[[TOC()]] = SNP calling pipeline = Status: Alpha Authors: Freerk van Dijk, Morris Swertz Based on [http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Broad GATK pipeline]. To perform the analysis as fast and good as possible the pipeline has been divided into several small processes. These processes are all numbered and can be found below, including commands, input and output files starting with pre-alignment and ending with variation calling & filtering. * SnpCallingPipeline/ReferencePreparation * SnpCallingPipeline/AlignmentAndCleaning * SnpCallingPipeline/VariantCalling == Simplified Overview == This simplified overview this schema hides intermediate sort and indexing steps and only shows data inputs/outputs first time they occur. {{{#!graphviz digraph g { size="10,10" node [shape=box,style=filled,color=white] "dbsnp" "reference.fasta" "realign.intervals" "indelcalls.vcf" "chr[1-24].fasta" "flowcell_lane.1.fq.gz" "flowcell_lane.2.fq.gz" "flowcell_lane.aligned.bam" "flowcell_lane2.aligned.bam" "flowcell_lane3.aligned.bam" "sample.aligned.bam" "sample QC reports" "sample_chr[1-24].vcf" node [shape=ellipse,color=yellow] subgraph cluster_0 { style=filled; color=lightgrey; "reference.fasta" -> RealignerTargetCreator -> "realign.intervals" "indelcalls.vcf"-> RealignerTargetCreator "reference.fasta"->Split->"chr[1-24].fasta" dbsnp -> RealignerTargetCreator label = "Per genome (1)"; } subgraph cluster_1 { style=filled; color=lightgrey; "flowcell_lane.1.fq.gz" -> align1 -> alignPE "chr[1-24].fasta" -> align1 "chr[1-24].fasta" -> align2 "chr[1-24].fasta" -> alignPE "flowcell_lane.2.fq.gz" -> align2 -> alignPE -> MarkDuplicates -> "IndelRealigner & \n FixMateInformation (knownsOnly)" -> "flowcell_lane.aligned.bam" "realign.intervals" -> "IndelRealigner & \n FixMateInformation (knownsOnly)" label = "Per Lane*Chromosome (750*3*24=54k) "; } subgraph cluster_2 { style=filled; color=lightgrey; "flowcell_lane.aligned.bam" -> Merge -> "sample.aligned.bam" -> "IndelRealigner"-> FixMateInformation "flowcell_lane2.aligned.bam" -> Merge "flowcell_lane3.aligned.bam" -> Merge FixMateInformation -> IndelGenotyperV2 -> FilterSingleCalls -> UnifiedGenotyper -> Filtration -> VariantEval -> "sample QC reports" Filtration -> "sample_chr[1-24].vcf" label = "Per Sample*Chromosome (750*24=18k)"; } } }}} == Workflow 1: genome reference file creation == This workflow creates reference files per chromosome including: * genome, dbsnp and indel vcfs per chromosome * realign targets for faster realignment target creation * index files for samtools and bwa Workflow inputs: * genome.chr.fa - downloaded from genome supplier (now hg19) * dbsnpXYZ.rod - downloaded reference SNPs from dbsnp (now 129) * indelsXYZ.vcf - downloaded reference indels from 1KG Workflow outputs: * genome.chr.fa - cleaned headers * genome.chr.fa.fa - index for samtools * genome.chr.fa. - multilple index files for bwa * dbsnpXYZ.chr.rod - split per chromosome * indelsXYZ.chr.vcf - split per chromosome * genome.chr.realign.intervals - targets for realignment === clean-fasta-headers === Clean headers to only have '1' instead of Chr1, etc ||tool: || || ||inputs: ||genome.chr.fa || ||outputs: ||genome.chr.fa || ||doc: ||internally developed || === split-vcf-chr for dbsnp and indels === Split vcf per chromosome ||tool: || || ||inputs: ||dbsnpXYZ.rod, indelsXYZ.vcf || ||outputs: ||dbsnpXYz.chr.rod, indelsXYZ.vcf || ||doc: || || Discussion: > Can we use http://vcftools.sourceforge.net/options.html ? >> vcftools --vcf indelsXYZ.vcf --chr --recode --out indelsXYZ.chr === index-chromosomes === Index reference sequence for each chromosome in the FASTA format ||tool: ||samtools faidx || ||input: ||genome.chr.fa || ||output: ||genome.chr.fa.fai || ||doc: ||http://samtools.sourceforge.net/samtools.shtml#3 || === bwa-index-chromosomes === Index reference sequence for each chromosome for bwa alignment ||tool: ||bwa index -a IS || ||input: ||genome.chr.fa || ||output: ||genome.chr.fa.xyz || ||doc: ||http://bio-bwa.sourceforge.net/bwa.shtml#3 || === !RealignerTargetCreator === Generate realignment targets for known sites for each chromosome ||tool: ||GenomeAnalysisTK.jar -T RealignerTargetCreator || ||input: ||genome.chr.fa, dbsnpXYz.chr.rod, indelsXYZ.vcf || ||output: ||genome.chr.realign.intervals || ||doc: ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites ||