| 1 | This page is work-in-progress regarding the data Management of the GoNL. |
| 2 | |
| 3 | == Millipede File Structure == |
| 4 | === '''Access rights''' === |
| 5 | * All data should only be writable by their owners |
| 6 | * All tools and resources should be read/executable by the whole ''gcc'' group |
| 7 | * All project-specific data and results should be read/executable by the ''gvnl'' group |
| 8 | |
| 9 | === GCC-level Directory Structure === |
| 10 | The root for all subsequent directories is data/gcc/ |
| 11 | |
| 12 | * /tools |
| 13 | * Contains all GCC tools including''' '''GoNL tools |
| 14 | * All tools should be put in a folder using the naming convention: ''toolname-version'' |
| 15 | * Ex: Picard v1.32 should be found in'' /data/gcc/tools/picard-tools-1.32/'' |
| 16 | * /resources |
| 17 | * Contains all GCC resources inlcluding GoNL resources |
| 18 | * All resources should be put in a folder precising their version. Normally, should follow resource-version. |
| 19 | * Ex: Human Genome build 19 should be found in'' /data/gcc/resources/hg-19/'' |
| 20 | |
| 21 | === GoNL-level Directory structure === |
| 22 | The root for all subsequent directories is /data/gcc/projects/gonl/ |
| 23 | |
| 24 | * /rawdata |
| 25 | * Contains all the raw unprocessed data by batch |
| 26 | * Ex: All raw data for the 1st batch is located in'' /data/gcc/projects/gonl/rawdata/first_batch/'' |
| 27 | * /results |
| 28 | * Contains all the results after processing the data |
| 29 | * /results/BGI |
| 30 | * Contains all the results from the BGI pipeline (snps, indels, metrics, etc.) |
| 31 | * /results/immunochip |
| 32 | * Contains all the results from the immunochip data (cleaned/QCed data, metrics, etc.) |
| 33 | * /results/pipeline |
| 34 | * Contains all the results from the sequence data through the GoNL pipeline by batch |
| 35 | * Ex: Results on the first batch are in'' /gcc/data/projects/gonl/results/pipeline/first_batch'' |
| 36 | * The subdirectory structure for each of the batches should be the following: |
| 37 | * All results related to a sample shoud go in /sample_name |
| 38 | * Ex: All results related to sample A2a (first batch) should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a'' |
| 39 | * All results related to a lane of a sample should go in /sample_name/lane_name |
| 40 | * Ex: All results related to sample A2a (first batch), Lane FC20005_L1 should go in'' /data/gcc/projects/gonl/results/pipeline/first_batch/A2a''/FC20005_L1/ |
| 41 | |
| 42 | === Pipeline Result Files Naming Convention === |
| 43 | The following convention applies to all files that are generated by the pipeline. For containing folders, see sections above. |
| 44 | |
| 45 | * General convention |
| 46 | * Filenames are composed of tokens identifying their content. The tokens are separated by '.' and if necessary the words within the tokens can be separated by '_' for reading purpose. |
| 47 | * Except where it references specific names using another convention (ex: sample name), file names should be all small letters. |
| 48 | * Sample-level files should be named using: ''step_id.step_name.sample_name.genome_build.time_stamp.extension'' |
| 49 | * Ex: A vcf file for the sample A2a produced by the step vc02 (step 2 of variant calling) with the tool !UnifiedGenotyper using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: ''vc02.unified_genotyper.A2a.human_g1k_v37.2011_02_01_12_00.snp'' |
| 50 | * Lane-level files should be named using: ''step_id.step_name.sample_name.lane_name.genome_build.time_stamp.extension'' |
| 51 | * Ex: A bam file for the lane FC20005_L1 of the sample A2a produced by the step pe03 (step 3 of paired-end alignment) with the tool BWA sampe using genome build human_g1k_v37 on a run that begun on February 1st 2011 at 12:00 should be named: ''pe03.bwa_sampe.A2a.FC20005_L1.human_g1k_v37.2011_02_12_00.bam'' |