= Workflow 2: Alignment per Lane, per Chr = [[TOC()]] This workflow aligns reads per lane and chromosome, including: * re-alignment to prevend false SNP calls caused by indels (using known indels) * markduplicates to prevend false coverage caused by PCR errors (per library = lane) * base quality recalibration to correct for false low scores caused by true variation Workflow Inputs: * lane.1.fq.gz - raw reads for lane, pair end 1 * lane.2.fq.gz - raw reads for lane, pair end 2 * genome.chr.fasta - reference genome split on chromosome * genome.chr.realign.intervals - targets for realignment per chromosome * genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp * genome.chr.indelsXYZ.vcf - known indels from, here from 1KG Workflow ouputs: * lane.chr.1.sai - alignment index for first pair * lane.chr.2.sai - alignment index for second pair * lane.chr.sam - alignment map for * lane.chr.bam - alignment map in binary format * lane.chr.sorted.bam - sorted alignment map * lane.chr.sorted.bai - sorted alignment index * lane.chr.dedup.bam - marked duplicate PCR elements * lane.chr.dedup.metrics - metrics describing deduplication * lane.chr.realigned.bam - realigned based on known indels * lane.chr.matefixed.bam - fixed the mate pair ends * lane.chr.covariate_table.csv - table of countcovariates output for recalibration * lane.chr.recal.bam - alignment map with recalibrated quality scores == align == Align each end of paired end. ||tool: ||bwa-align || ||input: ||chr.fasta, lane.1.fq.gz, lane.2.fq.gz || ||output: ||lane.chr.1.sai, lane.chr.2.sai || ||docs: ||http://bio-bwa.sourceforge.net/bwa.shtml || ||command: [[IncludeSource(ngs_pipelines/templates/ngs/template-paired-end_step_0_1.ftl) || == align-pe == Align the pairs as one ||tool: ||bwa sampe || ||inputs: ||chr.fasta [[BR]] lane.1.fq.gz [[BR]] lane.2.fq.gz [[BR]] lane.chr.1.sai [[BR]] lane.chr.2.sai || ||outputs: ||lane.chr.sam || ||docs: ||http://bio-bwa.sourceforge.net/bwa.shtml || == sam-to-bam == Convert sam to bam ||tool: ||samtools view || ||inputs: ||lane.chr.sam || ||outputs: ||lane.chr.bam || ||docs: ||http://samtools.sourceforge.net/samtools.shtml || (Question: can this not index and sort?) == sam-sort == Sort bam file on coordinate ||tool: ||samtools sort || ||inputs: ||lane.chr.bam || ||outputs: ||lane.chr.sorted.bam || ||docs: ||http://samtools.sourceforge.net/samtools.shtml || == sam-index == Index bam file for quicker access ||tool: ||samtools index || ||inputs: ||lane.chr.sorted.bam || ||outputs: ||lane.chr.sorted.bai || ||docs: ||http://samtools.sourceforge.net/samtools.shtml || == !MarkDuplicates == Mark duplicate PCR fragments to be filtered in analysis ||tool: ||MarkDuplicates.jar || ||inputs: ||lane.chr.sorted.bam || ||outputs: ||lane.chr.dedup.bam [[BR]] lane.chr.dedup.metrics || ||docs: ||http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates || == !IndelRealigner-!KnownsOnly == Improve the alignment using known indel information (will reduce false SNP calls) ||tool: ||GenomeAnalysisTK.jar -T IndelRealigner || ||inputs: ||lane.chr.dedup.bam [[BR]] genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf || ||outputs: ||lane.chr.realigned.bam || ||docs ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites || == !FixMateInformation == Fix the paired end information as consequence of the realignment. ||tool: ||FixMateInformation.jar || ||inputs: ||lane.chr.realigned.bam ||outputs: ||lane.chr.matefixed.bam || ||docs: ||http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation, http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Fixing_Mate_Pairs || == !CountCovariates == Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration. Optionally: plot the results to pdf using AnalyzeCovariates ||tool: ||GenomeAnalysisTK.jar -T CountCovariates, AnalyzeCovariates.jar || ||inputs: ||lane.chr.matefixed.bam [[BR]] genome.chr.dbsnpXYZ.rod || ||outputs: ||lane.chr.covariate_table.csv || ||docs: ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates [[BR]] http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#AnalyzeCovariates.jar || == !TableRecalibration == Recalibrate quality scores based on the covariate table ||tool: ||GenomeAnalysisTK.jar -T TableRecalibration || ||inputs: ||lane.chr.matefixed.bam [[BR]]lanec.chr.recal_table.csv [[BR]]chr.fasta || ||outputs: ||lane.chr.recal.bam ||docs: ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration || == Repeat: sam-sort, sam-index, countcovariates == See steps above for commands and docs. ||inputs: ||lane.chr.recal.bam || ||outputs: ||lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv || Discussion: > wy do we need to sort and index after recalibration? does it mess up the order of things?