[[TOC()]]
= SNP calling pipeline =
Status: Alpha

Authors: Freerk van Dijk, Morris Swertz

Based on [http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit Broad GATK pipeline].

To perform the analysis as fast and good as possible the pipeline has been divided into several small processes. These processes are all numbered and can be found below, including commands, input and output files starting with pre-alignment and ending with variation calling & filtering.
== Simplified Overview ==

This simplified overview this schema hides intermediate sort and indexing steps and only shows data inputs/outputs first time they occur.

{{{#!graphviz

digraph g {
        size="10,10"
        node [shape=box,style=filled,color=white]
        "dbsnp"
        "reference.fasta"
        "realign.intervals"
        "indelcalls.vcf"
        "chr[1-24].fasta"
        "flowcell_lane.1.fq.gz"
        "flowcell_lane.2.fq.gz"
        "flowcell_lane.aligned.bam"
        "flowcell_lane2.aligned.bam"
        "flowcell_lane3.aligned.bam"
        "sample.aligned.bam"
        "sample QC reports"
        "sample_chr[1-24].vcf"

        node [shape=ellipse,color=yellow]

        subgraph cluster_0 {
		style=filled;
		color=lightgrey;
		"reference.fasta" -> RealignerTargetCreator -> "realign.intervals"
                "indelcalls.vcf"-> RealignerTargetCreator 
                "reference.fasta"->Split->"chr[1-24].fasta" 
                dbsnp -> RealignerTargetCreator 
		label = "Per genome (1)";
	}

        subgraph cluster_1 {
		style=filled;
		color=lightgrey;
		"flowcell_lane.1.fq.gz" -> align1 -> alignPE
                "chr[1-24].fasta" -> align1
                "chr[1-24].fasta" -> align2
                "chr[1-24].fasta" -> alignPE
                "flowcell_lane.2.fq.gz" -> align2 -> alignPE -> MarkDuplicates -> "IndelRealigner & \n FixMateInformation (knownsOnly)" -> "flowcell_lane.aligned.bam"
                "realign.intervals" -> "IndelRealigner & \n FixMateInformation (knownsOnly)"   
		label = "Per Lane*Chromosome (750*3*24=54k) ";
	}

      subgraph cluster_2 {
		style=filled;
		color=lightgrey;
                "flowcell_lane.aligned.bam" -> Merge -> "sample.aligned.bam" -> "IndelRealigner"-> FixMateInformation
                "flowcell_lane2.aligned.bam" -> Merge
                "flowcell_lane3.aligned.bam" -> Merge
		FixMateInformation -> IndelGenotyperV2 -> FilterSingleCalls -> UnifiedGenotyper -> Filtration -> VariantEval -> "sample QC reports"
Filtration -> "sample_chr[1-24].vcf"
		label = "Per Sample*Chromosome (750*24=18k)";
	}
}


}}}



== Workflow 1: genome reference file creation ==

This workflow creates reference files per chromosome including:
* genome, dbsnp and indel vcfs per chromosome
* realign targets for faster realignment target creation
* index files for samtools and bwa

Workflow inputs:
* genome.chr.fa - downloaded from genome supplier (now hg19)
* dbsnpXYZ.rod - downloaded reference SNPs from dbsnp (now 129)
* indelsXYZ.vcf - downloaded reference indels from 1KG

Workflow outputs:
* genome.chr.fa - cleaned headers
* genome.chr.fa.fa - index for samtools
* genome.chr.fa.<format> - multilple index files for bwa
* dbsnpXYZ.chr.rod - split per chromosome
* indelsXYZ.chr.vcf - split per chromosome
* genome.chr.realign.intervals - targets for realignment

=== clean-fasta-headers ===
Clean headers to only have '1' instead of Chr1, etc

||tool:    || ||
||inputs:  ||genome.chr.fa ||
||outputs: ||genome.chr.fa ||
||doc:     ||internally developed ||

=== split-vcf-chr for dbsnp and indels ===
Split vcf per chromosome
||tool:    || ||
||inputs:  ||dbsnpXYZ.rod, indelsXYZ.vcf ||
||outputs: ||dbsnpXYz.chr.rod, indelsXYZ.vcf ||
||doc:     || ||

Discussion:
> Can we use http://vcftools.sourceforge.net/options.html ?
>> vcftools --vcf indelsXYZ.vcf --chr <i> --recode --out indelsXYZ.chr

=== index-chromosomes ===
Index reference sequence for each chromosome in the FASTA format

||tool:   ||samtools faidx ||
||input:  ||genome.chr.fa ||
||output: ||genome.chr.fa.fai ||
||doc:    ||http://samtools.sourceforge.net/samtools.shtml#3 ||

=== bwa-index-chromosomes ===
Index reference sequence for each chromosome for bwa alignment

||tool:   ||bwa index -a IS ||
||input:  ||genome.chr.fa ||
||output: ||genome.chr.fa.xyz ||
||doc:    ||http://bio-bwa.sourceforge.net/bwa.shtml#3 ||

=== !RealignerTargetCreator ===
Generate realignment targets for known sites for each chromosome

||tool:   ||GenomeAnalysisTK.jar -T RealignerTargetCreator ||
||input:  ||genome.chr.fa, dbsnpXYz.chr.rod, indelsXYZ.vcf ||
||output: ||genome.chr.realign.intervals ||
||doc:    ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites ||

== Workflow 2: Alignment per Lane, per Chr ==

This workflow aligns reads per lane and chromosome, including:
* re-alignment to prevend false SNP calls caused by indels (using known indels)
* markduplicates to prevend false coverage caused by PCR errors (per library = lane)
* base quality recalibration to correct for false low scores caused by true variation

Workflow Inputs:
* lane.1.fq.gz - raw reads for lane, pair end 1
* lane.2.fq.gz - raw reads for lane, pair end 2
* genome.chr.fasta - reference genome split on chromosome
* genome.chr.realign.intervals - targets for realignment per chromosome
* genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp
* genome.chr.indelsXYZ.vcf - known indels from, here from 1KG 

Workflow ouputs:
* lane.chr.1.sai - alignment index for first pair
* lane.chr.2.sai - alignment index for second pair
* lane.chr.sam - alignment map for
* lane.chr.bam - alignment map in binary format
* lane.chr.sorted.bam - sorted alignment map
* lane.chr.sorted.bai - sorted alignment index
* lane.chr.dedup.bam - marked duplicate PCR elements
* lane.chr.dedup.metrics - metrics describing deduplication
* lane.chr.realigned.bam - realigned based on known indels
* lane.chr.matefixed.bam - fixed the mate pair ends
* lane.chr.covariate_table.csv - table of countcovariates output for recalibration
* lane.chr.recal.bam - alignment map with recalibrated quality scores

=== align ===
Align each end of paired end.
 
||tool:   ||bwa-align ||
||input:  ||chr.fasta, lane.1.fq.gz, lane.2.fq.gz ||
||output: ||lane.chr.1.sai, lane.chr.2.sai ||
||docs:   ||http://bio-bwa.sourceforge.net/bwa.shtml ||

=== align-pe ===
Align the pairs as one

||tool:    ||bwa sampe || 
||inputs:  ||chr.fasta [[BR]] lane.1.fq.gz [[BR]] lane.2.fq.gz [[BR]] lane.chr.1.sai [[BR]] lane.chr.2.sai ||
||outputs: ||lane.chr.sam ||
||docs:    ||http://bio-bwa.sourceforge.net/bwa.shtml ||

=== sam-to-bam ===
Convert sam to bam

||tool:    ||samtools view || 
||inputs:  ||lane.chr.sam ||
||outputs: ||lane.chr.bam ||
||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||

(Question: can this not index and sort?)

=== sam-sort ===
Sort bam file on coordinate

||tool:    ||samtools sort || 
||inputs:  ||lane.chr.bam ||
||outputs: ||lane.chr.sorted.bam ||
||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||

=== sam-index ===
Index bam file for quicker access

||tool:    ||samtools index ||
||inputs:  ||lane.chr.sorted.bam ||
||outputs: ||lane.chr.sorted.bai ||
||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||

=== !MarkDuplicates ===
Mark duplicate PCR fragments to be filtered in analysis

||tool:    ||MarkDuplicates.jar ||
||inputs:  ||lane.chr.sorted.bam ||
||outputs: ||lane.chr.dedup.bam [[BR]] lane.chr.dedup.metrics ||
||docs:    ||http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates ||

=== !IndelRealigner-!KnownsOnly ===
Improve the alignment using known indel information (will reduce false SNP calls)

||tool:    ||GenomeAnalysisTK.jar -T IndelRealigner ||
||inputs:  ||lane.chr.dedup.bam [[BR]] genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf ||
||outputs: ||lane.chr.realigned.bam ||
||docs     ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites ||

=== !FixMateInformation ===
Fix the paired end information as consequence of the realignment.

||tool:    ||FixMateInformation.jar ||
||inputs:  ||lane.chr.realigned.bam
||outputs: ||lane.chr.matefixed.bam ||
||docs:    ||http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation, 

http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Fixing_Mate_Pairs ||

=== !CountCovariates ===
Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration.
Optionally: plot the results to pdf using AnalyzeCovariates

||tool:    ||GenomeAnalysisTK.jar -T CountCovariates, AnalyzeCovariates.jar ||
||inputs:  ||lane.chr.matefixed.bam [[BR]] genome.chr.dbsnpXYZ.rod ||
||outputs: ||lane.chr.covariate_table.csv ||
||docs:    ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates [[BR]]

http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#AnalyzeCovariates.jar ||

=== !TableRecalibration ===
Recalibrate quality scores based on the covariate table
||tool:    ||GenomeAnalysisTK.jar -T TableRecalibration ||
||inputs:  ||lane.chr.matefixed.bam [[BR]]lanec.chr.recal_table.csv [[BR]]chr.fasta || 
||outputs: ||lane.chr.recal.bam
||docs:    ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration ||

=== Repeat: sam-sort, sam-index, countcovariates ===
See steps above for commands and docs.

||inputs:  ||lane.chr.recal.bam ||
||outputs: ||lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv ||

Discussion:
> wy do we need to sort and index after recalibration? does it mess up the order of things?

== Workflow 3: sample level variant calling ==
This workflow will call variants for the samples including:
* sample level recalibration
* sample level realignment
N.B. no sample level MarkDuplicates is needed as lanes == libraries.

Workflow inputs:
* lane.chr.recal.sorted.bam - for all sample lanes: dedupped, recalibrated, realigned, sorted and indexed bams (3)
* sample.chip.vcf - genotypes called from genotype chip
Reference:
* genome.chr.fasta - reference genome split on chromosome
* genome.chr.realign.intervals - targets for realignment per chromosome
* genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp
* genome.chr.indelsXYZ.vcf - known indels from, here from 1KG 

Workflow outputs:
* sample.chr.bam - merged bam files per sample
* sample.chr.realign.interval - realignment target intervals
* sample.chr.realigned.bam - realigned 
* sample.chr.matesfixed.bam - fixed pairs in realignment
* sample.chr.indels.vcf - raw indels called
* sample.chr.indels.bed - raw indels annotations
* sample.chr.indels.txt - output from the indel calling
* sample.chr.indels.filtered.bed - indels filtered
* sample.chr.snps.vcf - raw snps called
* sample.chr.snps.filtered.vcf - snps filtered

=== merge-lanes ===
Merge lanes into one sample bam

||tool:    ||sam merge ||
||inputs:  ||lane.chr.recal.sorted.bam ||
||outputs: ||sample.chr.bam ||
||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||

=== !RealignerTargetCreator ===
Create realignment targets based on the data (so not only knowns)

||tool:    ||GenomeAnalysisTK.jar -T RealignerTargetCreator ||
||inputs:  ||sample.chr.bam [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod [[BR]]indelsXYZ.vcf
||outputs: ||sample.chr.realign.intervals ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Creating_Intervals ||

=== !IndelRealigner ===
Realign based on realignment targets in previous step

||tool:    ||GenomeAnalysisTK.jar -T IndelRealigner ||
||inputs:  ||sample.chr.bam [[BR]]genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf ||
||outputs: ||sample.chr.realigned.bam ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Realigning ||

=== !FixMateInformation ===
See description in workflow2, now applied to sample

||inputs:  ||sample.chr.realigned.bam ||
||ouputs:  ||sample.chr.matesfixed.bam ||

=== !IndelGenotyperV2 ===
Call indels

||tool:    ||GenomeAnalysisTK.jar -T IndelGenotyperV2 ||
||inputs:  ||sample.chr.matesfixed.bam [[BR]]genome.chr.fa ||
||outputs: ||sample.chr.indels.vcf [[BR]]sample.chr.indels.bed [[BR]]sample.chr.indels.txt ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0 [[BR]]

http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SampleIndelGenotyper ||

=== !filterSingleSampleCalls ===
Filter indels

||tool:    ||filterSingleSampleCalls.pl ||
||inputs:  ||sample.chr.indels.bed ||
||outputs: ||sample.chr.indels.filtered.bed ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SampleIndelGenotyper ||

=== !UnifiedGenotyper ===
Call SNPs

||tool:    ||GenomeAnalysisTK.jar -T UnifiedGenotyper ||
||inputs:  ||sample.chr.matesfixed [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod ||
||outputs: ||sample.chr.snps.vcf ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SetUnifiedGenotypertoEval [[BR]]

http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper ||

=== !makeIndelMask ===
Make indel mask

||tool:    ||makeIndelMask.py ||
||inputs:  ||sample.chr.indels.bed ||
||outputs: ||sample.chr.indels.mask.bed ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0#Creating_a_indel_mask_file ||

=== !VariantFiltration ===
Filter variants to get the best calls possible

||tool:    ||GenomeAnalysisTK.jar -T VariantFiltration ||
||inputs:  ||sample.chr.snps.vcf [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod  ||
||outputs: ||sample.chr.snps.filtered.vcf ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2#Integrating_analyses:_getting_the_best_call_set_possible 

||

=== !MergeVcfs ===
=== !ChipVcf ===
Produce vcf for the chips

=== !VariantEval ===
Create summary information on the variations called for evaluation.
Run per sample.snps.filtered.vcf against chip.

||tool:    ||GenomeAnalysisTK.jar -T VariantEval ||
||inputs:  ||sample.snps.vcf [[BR]]sample.chip.vcf [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod||
||outputs: ||sample.snps.eval ||
||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/VariantEval ||


Discussion:
> Do we call SNPs based on the filtered indels or the raw indels?
> Should we realign AGAIN after merge of lanes?
> BAQ?
> MINDEL/PINDEL?