wiki:DataManagement

Version 14 (modified by Morris Swertz, 13 years ago) (diff)

--

This page is work-in-progress regarding the data Management of the GoNL.

See

GCC-level Directory Structure

The root for all subsequent directories is data/gcc/

  • /tools
    • Contains all GCC tools including GoNL tools
    • All tools should be put in a folder using the naming convention: toolname-version
      • Ex: Picard v1.32 should be found in /data/gcc/tools/picard-tools-1.32/
  • /resources
    • Contains all GCC resources inlcluding GoNL resources
    • All resources should be put in a folder precising their version. Normally, should follow resource-version.
      • Ex: Human Genome build 19 should be found in /data/gcc/resources/hg-19/

Pipeline Result Files Naming Convention

The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above.

  • General convention
    • Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose.
    • Except where it references specific names using another convention (ex: sample name), file names should be all small letters.
  • Sample-level files should be named using: sample_name.step_id.step_name.genome_build.time_stamp.extension
    • Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.snp
  • Lane-level files should be named using: sample_name.lane_name.step_id.step_name.genome_build.time_stamp.extension
    • Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.bam
  • Log file names should correspond to their output counterparts and have the .log extension.
    • Ex: log file for the vcf sample-level step above should be: A2a.vc02.unified_genotyper.human_g1k_v37.2011_02_01_12_00.log
    • Ex: log file for the bam lane-level step above should be: A2a.FC20005_L1.pe03.bwa_sampe.human_g1k_v37.2011_02_12_00.log

Logging

The logging strategy is currently under development but will be composed of both file logs and database entries in a Molgenis platform. The status is described below.

Log Files

  • At each step of the pipeline a single log is produced and contains:
    • PBS out and err
    • Tool out and err
    • Other tool-produced log where applicable
  • For log file naming, see section above.

Molgenis

The Molgenis platform will be used to provide a more advanced and general view of the status of the pipeline runs (including different views, sorting, etc.) The current status is:

  • Molgenis instance created with proposed model
  • Scripts for insertion under development