Changes between Initial Version and Version 1 of DataManagement


Ignore:
Timestamp:
Feb 8, 2011 3:40:54 PM (14 years ago)
Author:
laurent
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataManagement

    v1 v1  
     1This page is work-in-progress regarding the data Management of the GoNL.
     2
     3== Millipede File Structure ==
     4=== '''Access rights''' ===
     5 * All data should only be writable by their owners
     6 * All tools and resources should be read/executable by the whole ''gcc'' group
     7 * All project-specific data and results should be read/executable by the ''gvnl'' group
     8
     9=== GCC-level Directory Structure ===
     10The root for all subsequent directories is data/gcc/
     11
     12 * /tools
     13   * Contains all GCC tools including''' '''GoNL tools
     14   * All tools should be put in a folder using the naming convention: ''toolname-version''
     15     * Ex: Picard v1.32 should be found in'' /data/gcc/tools/picard-tools-1.32/''
     16 * /resources
     17   * Contains all GCC resources inlcluding GoNL resources
     18   * All resources should be put in a folder precising their version. Normally, should follow resource-version.
     19     * Ex: Human Genome build 19 should be found in'' /data/gcc/resources/hg-19/''
     20
     21=== GoNL-level Directory structure ===
     22The root for all subsequent directories is /data/gcc/projects/gonl/
     23
     24 * /rawdata
     25   * Contains all the raw unprocessed data by batch
     26     * Ex: All raw data for the 1st batch is located in'' /data/gcc/projects/gonl/rawdata/first_batch/''
     27 * /results
     28   * Contains all the results after processing the data
     29 * /results/BGI
     30   * Contains all the results from the BGI pipeline (snps, indels, metrics, etc.)
     31 * /results/immunochip
     32   * Contains all the results from the immunochip data (cleaned/QCed data, metrics, etc.)
     33 * /results/pipeline
     34   * Contains all the results from the sequence data through the GoNL pipeline by batch
     35     * Ex: Results on the first batch are in'' /gcc/data/projects/gonl/results/pipeline/first_batch''
     36   * The subdirectory structure for each of the batches should be the following:
     37     * All results related to a sample shoud go in /sample_name
     38       * Ex: All results related to sample A2a (first batch) should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a''
     39     * All results related to a lane of a sample should go in /sample_name/lane_name
     40       * Ex: All results related to sample A2a (first batch), Lane FC20005_L1 should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a''/FC20005_L1/
     41
     42=== Pipeline Result Files Naming Convention ===
     43The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above.
     44
     45 * General convention
     46   * Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose.
     47   * Except where it references specific names using another convention (ex: sample name), file names should be all small letters.
     48 * Sample-level files should be named using: ''step_id.step_name.sample_name.genome_build.time_stamp.extension''
     49   * Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool !UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: ''vc02.unified_genotyper.A2a.human_g1k_v37.2011_02_01_12_00.snp''
     50 * Lane-level files should be named using: ''step_id.step_name.sample_name.lane_name.genome_build.time_stamp.extension''
     51   * Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a  run that begun on February 1st 2011 at 12:00 should be named: ''pe03.bwa_sampe.A2a.FC20005_L1.human_g1k_v37.2011_02_12_00.bam''