Changes between Version 56 and Version 57 of SnpCallingPipeline


Ignore:
Timestamp:
Dec 9, 2010 9:37:10 AM (13 years ago)
Author:
Leon Mei
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SnpCallingPipeline

    v56 v57  
    11= SNP calling pipeline =
    2 Status: Alpha
    3 Authors: Freerk van Dijk, Morris Swertz
     2Status: Alpha Authors: Freerk van Dijk, Morris Swertz
    43
    54This is the documentation of the BBMRI-NL snp calling pipeline based on the [http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Broad GATK]. It consists of the following three workflows:
    6 * Workflow 1: SnpCallingPipeline/ReferencePreparation
    7 * Workflow 2: SnpCallingPipeline/AlignmentAndCleaning
    8 * Workflow 3: SnpCallingPipeline/VariantCalling
     5
     6 * Workflow 1: SnpCallingPipeline/ReferencePreparation
     7 * Workflow 2: SnpCallingPipeline/AlignmentAndCleaning
     8 * Workflow 3: SnpCallingPipeline/VariantCalling
     9
    910== Schematic Overview ==
    10 
    1111This simplified overview this schema hides intermediate sort and indexing steps and only shows data inputs/outputs first time they occur.
    1212
     
    1414
    1515digraph g {
    16         size="10,10"
    17         node [shape=box,style=filled,color=white]
    18         "dbsnp"
    19         "reference.fasta"
    20         "realign.intervals"
    21         "indelcalls.vcf"
    22         "chr[1-24].fasta"
    23         "flowcell_lane.1.fq.gz"
    24         "flowcell_lane.2.fq.gz"
    25         "flowcell_lane.aligned.bam"
    26         "flowcell_lane2.aligned.bam"
    27         "flowcell_lane3.aligned.bam"
    28         "sample.aligned.bam"
    29         "sample QC reports"
    30         "sample_chr[1-24].vcf"
    3116
    32         node [shape=ellipse,color=yellow]
     17  size="10,10" node [shape=box,style=filled,color=white] "dbsnp" "reference.fasta" "realign.intervals" "indelcalls.vcf" "chr[1-24] .fasta" "flowcell_lane.1.fq.gz" "flowcell_lane.2.fq.gz" "flowcell_lane.aligned.bam" "flowcell_lane2.aligned.bam" "flowcell_lane3.aligned.bam" "sample.aligned.bam" "sample QC reports" "sample_chr[1-24] .vcf"
    3318
    34         subgraph cluster_0 {
    35                 style=filled;
    36                 color=lightgrey;
     19  node [shape=ellipse,color=yellow]
    3720
    38                 "reference.fasta" -> RealignerTargetCreator -> "realign.intervals"
    39                 "indelcalls.vcf"-> RealignerTargetCreator
    40                 "reference.fasta"->Split->"chr[1-24].fasta"
    41                 dbsnp -> RealignerTargetCreator
    42                 label = "Per genome (1)";
    43         }
     21  subgraph cluster_0 {
     22    style=filled; color=lightgrey;
    4423
    45         subgraph cluster_1 {
    46                 style=filled;
    47                 color=lightgrey;
    48                 "flowcell_lane.1.fq.gz" -> align1 -> alignPE
    49                 "chr[1-24].fasta" -> align1
    50                 "chr[1-24].fasta" -> align2
    51                 "chr[1-24].fasta" -> alignPE
    52                 "flowcell_lane.2.fq.gz" -> align2 -> alignPE -> MarkDuplicates -> "IndelRealigner & \n FixMateInformation (knownsOnly)" ->"Quality Recalibration"->"flowcell_lane.aligned.bam"
    53                 "realign.intervals" -> "IndelRealigner & \n FixMateInformation (knownsOnly)"   
    54                 label = "Per Lane (750*3=2250) ";
    55         }
     24  "reference.fasta" -> RealignerTargetCreator  -> "realign.intervals" "indelcalls.vcf"-> RealignerTargetCreator   "reference.fasta"->Split->"chr[1-24] .fasta"  dbsnp -> RealignerTargetCreator   label = "Per genome (1)";
    5625
    57       subgraph cluster_2 {
    58                 style=filled;
    59                 color=lightgrey;
    60                 "flowcell_lane.aligned.bam" -> Merge -> "sample.aligned.bam" -> "IndelRealigner & FixMateInformation"
    61                 "flowcell_lane2.aligned.bam" -> Merge
    62                 "flowcell_lane3.aligned.bam" -> Merge
    63                 "IndelRealigner & FixMateInformation" -> IndelGenotyperV2 -> FilterSingleCalls -> UnifiedGenotyper -> Filtration -> VariantEval -> "sample QC reports"
    64 Filtration -> "sample_chr[1-24].vcf"
    65                 label = "Per Sample or Trio*Chromosome (750*24=18k)";
    66         }
    67 
    68        subgraph cluster_3 {
    69                 style=filled;
    70                 color=lightgrey;
    71 
    72                 "sample.aligned.bam" -> "UnifiedGenotype (without realign)"->"QC against arrays and BGI"
    73 
    74                 label = "QC per sample";
    75 
    76        }
    7726}
    7827
     28  subgraph cluster_1 {
     29    style=filled; color=lightgrey; "flowcell_lane.1.fq.gz" -> align1 -> alignPE "chr[1-24] .fasta" -> align1 "chr[1-24] .fasta" -> align2 "chr[1-24] .fasta" -> alignPE "flowcell_lane.2.fq.gz" -> align2 -> alignPE -> MarkDuplicates  -> "IndelRealigner  & \n FixMateInformation  (knownsOnly)" ->"Quality Recalibration"->"flowcell_lane.aligned.bam" "realign.intervals" -> "IndelRealigner  & \n FixMateInformation  (knownsOnly)"    label = "Per Lane (750*3=2250) ";
     30  }
     31
     32  subgraph cluster_2 {
     33    style=filled; color=lightgrey; "flowcell_lane.aligned.bam" -> Merge -> "sample.aligned.bam" -> "IndelRealigner  & FixMateInformation " "flowcell_lane2.aligned.bam" -> Merge "flowcell_lane3.aligned.bam" -> Merge "IndelRealigner  & FixMateInformation " -> IndelGenotyperV2 -> FilterSingleCalls  -> UnifiedGenotyper  -> Filtration -> VariantEval  -> "sample QC reports"
     34
     35Filtration -> "sample_chr[1-24].vcf"
     36
     37  label = "Per Sample or Trio*Chromosome (750*24=18k)";
     38
     39}
     40
     41  subgraph cluster_3 {
     42    style=filled; color=lightgrey;
     43
     44  "sample.aligned.bam" -> "UnifiedGenotype  (without realign)"->"QC against arrays and BGI"
     45
     46  label = "QC per sample";
     47
     48  }
     49
     50}
    7951
    8052}}}
     
    8254Discussion
    8355
    84 * How long takes alignment per genome?
     56 * How long takes alignment per genome?
    8557   * If this takes very long we can split read files
    86 * How long takes realign knownsonly (per genome)?
     58 * How long takes realign knownsonly (per genome)?
    8759   * If very long, we need to rewrite workflow 2 to split before realign
    88 * For realign: if we split per chromosome, can we also split bam file?
    89 * How to easily lift over from b36 to b37
    90   * Contact BGI if they can use b37??
     60 * For realign: if we split per chromosome, can we also split bam file?
     61 * How to easily lift over from b36 to b37
     62   * Contact BGI if they can use b37??
    9163
    9264Todo:
    9365
    9466First:
    95 * Recode workflow 2 to work per genome instead of per chromosome and test - Freerk (done) -> workflow3 still needs to be done
    96 * Run on pilot data (6) to evaluate timing and concurrency issues (can 6 run on one node?) - Freerk (in progress)
    97 * Complete analysis of data (60) until including merge to sample.aligned.bam - Freerk
    98 * QC pipeline - Can we get Jeroen and Yurii involved here
     67
     68 * Recode workflow 2 to work per genome instead of per chromosome and test - Freerk (done) -> workflow3 still needs to be done
     69 * Run on pilot data (6) to evaluate timing and concurrency issues (can 6 run on one node?) - Freerk (in progress)
     70 * Complete analysis of data (60) until including merge to sample.aligned.bam - Freerk
     71 * QC pipeline - Can we get Jeroen and Yurii involved here
    9972   * UnifiedGenotyper without realign - Freerk
    10073   * GATK variant eval to make venn diagrams
    10174   * Contact Yurii for this; Let Jeroen take charge? (done) -> Jeroen doing QC stuff
    102 * Share data with Grid following plan Silvia - Freerk
    103 * Contact BGI for sample list - Morris (done)
    104 * Put report on FTP - Ger,Freerk
     75 * Share data with Grid following plan Silvia - Freerk
     76 * Contact BGI for sample list - Morris (done)
     77 * Put report on FTP - Ger,Freerk
    10578
    10679Next:
    107 * Short tutorial howto generate pipeline scripts - Morris
     80
     81 * Short tutorial howto generate pipeline scripts - Morris
    10882   * Teach Barbara and Jeroen
    109 * Port pipeline to Grid with help of Barbara
    110    * What do we need to generate exactly - Barbara
     83 * Port pipeline to Grid with help of Barbara
     84   * What do we need to generate exactly - Barbara
     85
     86==  ==
     87== Optimization? ==
     88{{{
     89{| {{table}}
     90| align="center" style="background:#f0f0f0;"|'''Step'''
     91| align="center" style="background:#f0f0f0;"|'''Cores'''
     92| align="center" style="background:#f0f0f0;"|'''Memory (gb)'''
     93| align="center" style="background:#f0f0f0;"|'''Time (hh.mm)'''
     94|-
     95| BWA alignment||1||± 6||10.05
     96|-
     97| BWA spe||1||||3.35
     98|-
     99| Sam-Bam||1||||12.3
     100|-
     101| Sam sort||1||||5.05
     102|-
     103| Mark Duplicates||1||4||1.55
     104|-
     105| Realignment (knowns only)||1||8 (*can be lowered)||5.2
     106|-
     107| Fix mates||1||6 (*)||3.05
     108|-
     109| Covariates bef.||1||2||12.35
     110|-
     111| Recalibrate||1||4||7.3
     112|-
     113| Sam sort||1||||4.5
     114|-
     115| Covariates aft.||1||2||11.2
     116|-
     117| Analyze Covar.||1||4||< 00.01
     118|-
     119| Total||||||± 90 (< 4 days)
     120|-
     121|
     122|}
     123}}}
     124=== Disk ===
     125 * Option 1, If it is possible to let a node guarantee certain amount of disk space (/tmp), we should use the entire cluster. Before start running a pipeline, we can just ask the node to reserve that amount of disk space.
     126 * Option 2, If we can cut a dedicate part of the cluster, we can use our own scheduler to share the nodes/disks. E.g, depending on the disk space usage pattern and how we can remove the data, we can decide which jobs run at which node and when.
     127
     128=== Memory/CPU time ===
     129 * Can multiple samples use the same reference genome in memory during the BWA alignment. I.e. 1 sample->6GB, 3 samples->6GB.
     130   * NO?
     131 * Can we parallelize the Markduplicate?
     132   * YES!
     133   * !MarkDuplicates finds sequence pairs that map to the same position, marking or removing the duplicates so you can work with unique pairs in downstream analyses. If you want them removed, use the REMOVE_DUPLICATES=true flag when running the program.
     134 * Can we parallelize covariate before/after, recalibration?
     135   * Don't know
     136
    111137== List of steps ==
    112 
    113138[[TOC(SnpCallingPipeline/ReferencePreparation,SnpCallingPipeline/AlignmentAndCleaning,SnpCallingPipeline/VariantCalling,inline,noheading)]]