| 1 | = Workflow 2: Alignment per Lane, per Chr = |
| 2 | [[TOC()]] |
| 3 | |
| 4 | This workflow aligns reads per lane and chromosome, including: |
| 5 | * re-alignment to prevend false SNP calls caused by indels (using known indels) |
| 6 | * markduplicates to prevend false coverage caused by PCR errors (per library = lane) |
| 7 | * base quality recalibration to correct for false low scores caused by true variation |
| 8 | |
| 9 | Workflow Inputs: |
| 10 | * lane.1.fq.gz - raw reads for lane, pair end 1 |
| 11 | * lane.2.fq.gz - raw reads for lane, pair end 2 |
| 12 | * genome.chr.fasta - reference genome split on chromosome |
| 13 | * genome.chr.realign.intervals - targets for realignment per chromosome |
| 14 | * genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp |
| 15 | * genome.chr.indelsXYZ.vcf - known indels from, here from 1KG |
| 16 | |
| 17 | Workflow ouputs: |
| 18 | * lane.chr.1.sai - alignment index for first pair |
| 19 | * lane.chr.2.sai - alignment index for second pair |
| 20 | * lane.chr.sam - alignment map for |
| 21 | * lane.chr.bam - alignment map in binary format |
| 22 | * lane.chr.sorted.bam - sorted alignment map |
| 23 | * lane.chr.sorted.bai - sorted alignment index |
| 24 | * lane.chr.dedup.bam - marked duplicate PCR elements |
| 25 | * lane.chr.dedup.metrics - metrics describing deduplication |
| 26 | * lane.chr.realigned.bam - realigned based on known indels |
| 27 | * lane.chr.matefixed.bam - fixed the mate pair ends |
| 28 | * lane.chr.covariate_table.csv - table of countcovariates output for recalibration |
| 29 | * lane.chr.recal.bam - alignment map with recalibrated quality scores |
| 30 | |
| 31 | == align == |
| 32 | Align each end of paired end. |
| 33 | |
| 34 | ||tool: ||bwa-align || |
| 35 | ||input: ||chr.fasta, lane.1.fq.gz, lane.2.fq.gz || |
| 36 | ||output: ||lane.chr.1.sai, lane.chr.2.sai || |
| 37 | ||docs: ||http://bio-bwa.sourceforge.net/bwa.shtml || |
| 38 | |
| 39 | == align-pe == |
| 40 | Align the pairs as one |
| 41 | |
| 42 | ||tool: ||bwa sampe || |
| 43 | ||inputs: ||chr.fasta [[BR]] lane.1.fq.gz [[BR]] lane.2.fq.gz [[BR]] lane.chr.1.sai [[BR]] lane.chr.2.sai || |
| 44 | ||outputs: ||lane.chr.sam || |
| 45 | ||docs: ||http://bio-bwa.sourceforge.net/bwa.shtml || |
| 46 | |
| 47 | == sam-to-bam == |
| 48 | Convert sam to bam |
| 49 | |
| 50 | ||tool: ||samtools view || |
| 51 | ||inputs: ||lane.chr.sam || |
| 52 | ||outputs: ||lane.chr.bam || |
| 53 | ||docs: ||http://samtools.sourceforge.net/samtools.shtml || |
| 54 | |
| 55 | (Question: can this not index and sort?) |
| 56 | |
| 57 | == sam-sort == |
| 58 | Sort bam file on coordinate |
| 59 | |
| 60 | ||tool: ||samtools sort || |
| 61 | ||inputs: ||lane.chr.bam || |
| 62 | ||outputs: ||lane.chr.sorted.bam || |
| 63 | ||docs: ||http://samtools.sourceforge.net/samtools.shtml || |
| 64 | |
| 65 | == sam-index == |
| 66 | Index bam file for quicker access |
| 67 | |
| 68 | ||tool: ||samtools index || |
| 69 | ||inputs: ||lane.chr.sorted.bam || |
| 70 | ||outputs: ||lane.chr.sorted.bai || |
| 71 | ||docs: ||http://samtools.sourceforge.net/samtools.shtml || |
| 72 | |
| 73 | == !MarkDuplicates == |
| 74 | Mark duplicate PCR fragments to be filtered in analysis |
| 75 | |
| 76 | ||tool: ||MarkDuplicates.jar || |
| 77 | ||inputs: ||lane.chr.sorted.bam || |
| 78 | ||outputs: ||lane.chr.dedup.bam [[BR]] lane.chr.dedup.metrics || |
| 79 | ||docs: ||http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates || |
| 80 | |
| 81 | == !IndelRealigner-!KnownsOnly == |
| 82 | Improve the alignment using known indel information (will reduce false SNP calls) |
| 83 | |
| 84 | ||tool: ||GenomeAnalysisTK.jar -T IndelRealigner || |
| 85 | ||inputs: ||lane.chr.dedup.bam [[BR]] genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf || |
| 86 | ||outputs: ||lane.chr.realigned.bam || |
| 87 | ||docs ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites || |
| 88 | |
| 89 | == !FixMateInformation == |
| 90 | Fix the paired end information as consequence of the realignment. |
| 91 | |
| 92 | ||tool: ||FixMateInformation.jar || |
| 93 | ||inputs: ||lane.chr.realigned.bam |
| 94 | ||outputs: ||lane.chr.matefixed.bam || |
| 95 | ||docs: ||http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation, |
| 96 | |
| 97 | http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Fixing_Mate_Pairs || |
| 98 | |
| 99 | == !CountCovariates == |
| 100 | Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration. |
| 101 | Optionally: plot the results to pdf using AnalyzeCovariates |
| 102 | |
| 103 | ||tool: ||GenomeAnalysisTK.jar -T CountCovariates, AnalyzeCovariates.jar || |
| 104 | ||inputs: ||lane.chr.matefixed.bam [[BR]] genome.chr.dbsnpXYZ.rod || |
| 105 | ||outputs: ||lane.chr.covariate_table.csv || |
| 106 | ||docs: ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates [[BR]] |
| 107 | |
| 108 | http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#AnalyzeCovariates.jar || |
| 109 | |
| 110 | == !TableRecalibration == |
| 111 | Recalibrate quality scores based on the covariate table |
| 112 | ||tool: ||GenomeAnalysisTK.jar -T TableRecalibration || |
| 113 | ||inputs: ||lane.chr.matefixed.bam [[BR]]lanec.chr.recal_table.csv [[BR]]chr.fasta || |
| 114 | ||outputs: ||lane.chr.recal.bam |
| 115 | ||docs: ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration || |
| 116 | |
| 117 | == Repeat: sam-sort, sam-index, countcovariates == |
| 118 | See steps above for commands and docs. |
| 119 | |
| 120 | ||inputs: ||lane.chr.recal.bam || |
| 121 | ||outputs: ||lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv || |
| 122 | |
| 123 | Discussion: |
| 124 | > wy do we need to sort and index after recalibration? does it mess up the order of things? |