Changes between Version 36 and Version 37 of SnpCallingPipeline


Ignore:
Timestamp:
Oct 18, 2010 4:36:18 AM (12 years ago)
Author:
Morris Swertz
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • SnpCallingPipeline

    v36 v37  
    7272
    7373
    74 == A. Pre-alignment ==
    75 
    76 This process involves all needed steps to prepare the raw sequencing reads for an alignment using BWA. These steps are all conducted using PICARD. An overview of the steps can be found in Appendix A.
    77 === 1) Read converting ===
    78 
    79 This step involves converting the output from the Illumina GAII, FASTQ format, to binary SAM format (BAM). To do this !FastqToSam.jar can be used. The following command was used:
    80 
    81 {{{
    82 #!sh
    83 java -jar !FastqToSam.jar FASTQ=../testrun/s_2_sequence041090.txt QUALITY_FORMAT=Illumina OUTPUT=../testrun/s_2_sequence.bam SAMPLE_NAME=s_2_test_sample PLATFORM_UNIT=barcode_13434HWUSIsomething PLATFORM=illumina SORT_ORDER=coordinate
    84 }}}
    85 
    86 It is also possible to convert raw BUSTARD data to BAM using !BustardToSam.jar. The advantage of this tool is the multiple lane input, in theory all lanes from one flow cell can be converted parallel. To remove possible linkers from reads a tool named !ExtractIlluminaBarcodes.jar can be used. Since there isn’t any data including these barcodes this and the !BustardToSam haven’t been tested yet.
    87 
    88 === 2) Sort & index reads ===
    89 
    90 The output of !FastqToSam.jar needs to be sorted and indexed in order to use it as input for the alignment. To sort these files SAMtools was used. The following commands can be executed on the BAM file:
    91 
    92 {{{
    93 #!sh
    94 ./samtools sort ../testrun/s_2_sequence041090_aligned.bam ../testrun/s_2_aligned_sorted
    95 }}}
    96 
    97 The postfix .bam is automatically added to the output file.
    98 
    99 The corresponding BAM index file can be created using:
    100 
    101 {{{
    102 #!sh
    103 ./samtools index ../testrun/s_2_aligned_sorted.bam ../testrun/s_2_aligned_sorted.bam.bai
    104 }}}
    105 
    106 == B. Alignment ==
    107 This consists of alignment using BWA, coverting steps and the sorting and indexing of the output BAM file.
    108 
    109 === 3) Alignment] ===
    110 
    111 '''The three commands in this section are identical. Copy&Paste error?  -- Leon'''
    112 
    113 To align the reads to a reference genome the GATK uses BWA. The indices used for this step were already made for an earlier Galaxy project. GATK uses an own version of BWA which doesn’t support multiple threading yet. To overcome this problem alignment was performed using the standard BWA. The following command was executed:'''''''
    114 
    115 {{{
    116 #!sh
    117 ./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai
    118 }}}
    119 
    120 This results in an alignment file in .sai format. To convert this to SAM format the following command was used:
    121 
    122 {{{
    123 #!sh
    124 ./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai
    125 }}}
    126 
    127 To convert this SAM file to BAM SAMtools was used, the command:
    128 
    129 {{{
    130 #!sh
    131 ./bwa aln -l 101 -k 4 -t 16 ../Sting/dist/resources/index_hg36.fa ../testrun/s_2_sequence041090.txt -f ../testrun/s_2_sequence041090_aligned.sai
    132 }}}
    133 
    134 === 4) Sort & index alignment ===
    135 After this converting step the BAM file needs to be sorted & indexed again using the following command:
    136 
    137 {{{
    138 #!sh
    139 ./samtools sort ../testrun/s_2_sequence041090_aligned.bam ../testrun/s_2_aligned_sorted
    140 }}}
    141 
    142 '''Above should be ../testrun/s_2_aligned_sorted.bam? --Leon'''
    143 
    144 The index was created using:
    145 
    146 {{{
    147 #!sh
    148 ./samtools index ../testrun/s_2_aligned_sorted.bam ../testrun/s_2_aligned_sorted.bam.bai
    149 }}}
    150 
    151 '''NOTE. Keep track of the reference build used. UCSC uses human build hg18 and hg19, which correspond to build 36 and 37 of the NCBI. These builds contains the same information and sequences but different headers etc. The dbSNP ROD files which need to be used further on in the pipeline can be downloaded from UCSC. To overcome any compatibility problems it is advised to use hg18 and hg19 as reference during alignment and further calculations.'''
    152 
    153 == C. Post-alignment ==
    154 This step involves all processes executed on the generated alignments. These steps include quality value recalibration, SNP/indel calling, annotation and the calculation of statistics as well.
    155 
    156 === 5) Quality value recalibration (Count covariates)] ===
    157 
    158 To determine the covariates influencing base quality scores in the BAM file obtained after alignment this step is performed. This results in a CSV file. The following command was used to execute this step:
    159 
    160 {{{
    161 #!sh
    162 java -jar GenomeAnalysisTK.jar -R resources/index_hg36.fa --default_platform illumina --DBSNP resources/hg36/snp130.txt -I ../../testrun/s_2_aligned_sorted.bam --max_reads_at_locus 20000 -T !CountCovariates -cov !ReadGroupcovariate -cov !QualityScoreCovariate -cov !CycleCovariate -cov !DinucCovariate -recalFile ../../testrun/recal_data.csv
    163 }}}
    164 
    165 The parameters used for this calculation need to be tested more. This to see the influence of each parameter on the output results.
    166 
    167 === 6) Quality value recalibration (Table recalibration) ===
    168 
    169 This step includes the actual quality value recalibration. The recalibration takes place using the generated CSV file as input. The output of this process is the alignment file with recalibrated quality values in BAM format. This step was executed using the following command:
    170 
    171 {{{
    172 #!sh
    173 java -jar GenomeAnalysisTK.jar -R resources/index_hg36.fa --default_platform illumina -I ../../testrun/s_2_aligned_sorted.bam -T !TableRecalibration -outputBam ../../testrun/s_2_aligned_recalibrated.bam -recalFile ../../testrun/recal_data.csv
    174 }}}
    175 
    176 === 7) Sort & index recalibrated alignment] ===
    177 
    178 The resulting BAM file needs to be sorted and indexed again. This step needs to be repeated every time an analysis or calculation has been performed on a BAM file. To sort and index this file use the commands explained in step 4.
    179 
    180 === 8) Quality value recalibration (Count covariates) ===
    181 
    182 This step is executed on the obtained BAM file after the quality value recalibration. This file is used as input, the command used further is exactly the same as the command used in step 5.
    183 
    184 === 9) Generate graphs/stats ===
    185 To generate several statistics and graphs on the recalibrated alignment the !AnalyzeCovariates.jar tool can be used. This tool uses the generated .CSV file as an input. The output is a directory containing several statistics and graphs about the quality values etc. The following command was used:
    186 
    187 {{{
    188 #!sh
    189 Java –jar !AnalyzeCovariates.jar –recalFile ../../testrun/recal_data.csv –outputDir ../../testrun/analyzecovariates_output –resources resources/index_hg36.fa –ignoreQ 5
    190 }}}
    191 
    192 === 10) Generate graphs/stats ===
    193 This step calculates the same as step 9, but used the CSV file from step 5 as input. The output from this step can be compared to the output of step 9. By comparing the output from both above mentioned steps the increase and changes in quality value etc. can be visualized.
    194 
    195 == D. SNP & indel calling ==
    196 This part of the pipeline involves the tools for SNP and indel calling. This part also includes local realignment and filtering for potential SNPs. The following steps divide this part in sub processes.
    197 
    198 === 11) Local realignment around indels ===
    199 
    200 The first step is to determine small suspicious intervals in the alignment. These intervals will then be realigned to get an optimal output result. To do this, these intervals were calculated using the following command:
    201 
    202 {{{
    203 #!sh
    204 java -jar GenomeAnalysisTK.jar -T !RealignerTargetCreator -I ../../testrun/s_2_aligned_recalibrated_sorted.bam -R resources/index_hg36.fa -o ../../testrun/forRealigner.intervals -D resources/hg36/snp130.txt
    205 }}}
    206 
    207 This process outputs a file containing intervals on which realignment should be executed.
    208 
    209 === 12) Cleaning BAM alignment file by realignment ===
    210 
    211 For each detect interval a local realignment takes place. This results in a cleaned BAM file on which SNP and indel calling takes place. The local alignment was performed using the following command:
    212 
    213 {{{
    214 #!sh
    215 java -Djava.io.tmpdir=/tmp -jar GenomeAnalysisTK.jar -I ../../testrun/s_2_aligned_recalibrated_sorted.bam -R resources/index_hg36.fa -T !IndelRealigner -targetIntervals ../../testrun/forRealigner.intervals --output ../../testrun/s_2_aligned_cleaned.bam -D resources/hg36/snp130.txt
    216 }}}
    217 
    218 === 13) Sort & index cleaned alignment ===
    219 
    220 The cleaned BAM file is again sorted and indexed using the command as described in step 4.
    221 
    222 === 14) Generate raw indel calls ===
    223 In this step raw indel calls are made. This output two bed files, the first containing raw indels, the second detailed output. This process was executed using the following command:
    224 
    225 {{{
    226 #!sh
    227 Java –jar GenomeAnalysisTK.jar –T IndelGenotyperV2 –R resources/index_hg36.fa --I ../../testrun/s_2_aligned_cleaned.bam –O ../../testrun/indels.raw.bed –o ../../testrun/detailed.output.bed –verbose –mnr 1000000
    228 }}}
    229 
    230 The generated raw indel file will later be used in step 17.
    231 
    232 === 15) Filter raw indels ===
    233 To retrieve the actual list of indels this last step needs to be performed. This results in a filtered list containing all the detected indels. The following command (Perl script) was used:
    234 
    235 {{{
    236 #!sh
    237 perl ../perl/filterSingleSampleCalls.pl --calls ../../testrun/detailed.output.bed --max_cons_av_mm 3.0 --max_cons_nqs_av_mm 0.5 --mode ANNOTATE > ../../testrun/indels.filtered.bed
    238 }}}
    239 
    240 === 16) Making SNP calls ===
    241 To make SNP calls this process needs to be executed. This results in a VCF file containing raw SNPs. This list will be filtered in step 18, using output from step 17. To generate this list of raw SNPs the following command was used:
    242 
    243 {{{
    244 #!sh
    245 Java –jar GenomeAnalysisTK.jar –R resources/index_hg36.fa –T !UnifiedGenotyper -I ../../testrun/s_2_aligned_cleaned.bam -D resources/hg36/snp130.txt –varout ../../testrun/s2_snps.raw.vcf –stand_call_conf 30.0
    246 }}}
    247 
    248 === 17) SNP filter pre-processing ===
    249 In this step the raw indel file created in step 14 was used. Based on the regions described in this file the tool filters out SNPs located near these potential indels. The output of this file is used for filtering the SNP calls in step 18. The following command was used:
    250 
    251 {{{
    252 #!sh
    253 Python ../python/makeIndelMask.py ../../testrun/indels.raw.bed 10 ../../testrun/indels.mask.bed
    254 }}}
    255 
    256 The number in this command stands for the number of bases that will be included on either side of the indel.
    257 
    258 === 18) Filter SNP calls ===
    259 The last step is the filtering on the list of SNPs. This filtering step includes a list of parameters which haven’t been tested yet. These parameters aren’t included in this command because they have to be tested. The following command was executed:
    260 
    261 {{{
    262 #!sh
    263 Java –jar GenomeAnalysisTK.jar –T !VariantFiltration –R resources/index_hg36.fa –O ../../testrun/s2_snps.filtered.vcf –B ../../testrun/s2_snps.raw.vcf –B ../../testrun/indels.mask.bed –maskName !InDel –clusterWindowSize 10 –filterExpression “AB > 0.75 && DP > 40
    264 ||DP > 100||MQ0 > 40||SB > -0.10” –filterName “!StandardFilters” –filterExpression “MQ0 >= 4 && ((1.0 * DP) > 0.1)” –filterName “HARD_TO_VALIDATE”||
    265 }}}
    266 
    267 The output of this step is the list of SNPs found in this individual. More information on the parameters used can be found on: http://www.broadinstitute.org/gsa/wiki/index.php/VariantFiltrationWalker & http://www.broadinstitute.org/gsa/wiki/index.php/Using_JEXL_expressions
    268 
    269 == E. Calculating quality metrics & removing duplicate reads ==
    270 
    271 To remove duplicate reads after alignment PICARD can be used. This tool can also calculate different metrics on alignment data. Therefore PICARD was installed and tested on an aligned and sorted BAM file. A process overview can be seen in figure 3. This process starts at step 4 of the analysis pipeline and uses the sorted alignment file as input. A complete testrun also was performed on this tool and developed pipeline. The command, results etc. can be found in Complete_Picard_test_run.doc.
    272 == Remarks & Suggestions ==
    273 There are several other tools which can be implemented into the pipeline, one of these tools is an implementation for Beagle. This tool is only usable in an experimental way at the moment.  The only use cases supported by this interface are:
    274 
    275  * Inferring      missing genotype data from call sets (e.g. for lack of coverage in      low-pass data)
    276  * Genotype      inference for unrelated individuals.
    277 
    278 The basic workflow for this interface is as follows:
    279 
    280  * After variants are called and possibly filtered, the GATK walker !ProduceBeagleInputWalker will take the resulting VCF as input, and will produce a likelihood file in BEAGLE format.
    281  * User needs to run BEAGLE with this likelihood file specified as input.
    282  *  After Beagle runs, user must unzip resulting output files (.gprobs, .phased) containing posterior genotype probabilities and phased haplotypes.
    283  * User can then run GATK walker BeagleOutputToVCFWalker to produce a new VCF with updated data. The new VCF will contain updated genotypes as well as updated annotations.
    284 
    285 GATK also has the ability to run process in parallel. This option is only supported by a limited amount of tools. It can be imported in every tool using parallelism, more information on this can be found on: http://www.broadinstitute.org/gsa/wiki/index.php/Parallelism_and_the_GATK
    286 
    287 
    288 To construct the correct read groups after merging BAM files obtained using BWA a patch can be used. The patch can be found here: http://www.broadinstitute.org/gsa/wiki/index.php/BWA_patch_to_generate_read_group
    289 In the next version of BWA this patch will be included in the standard BWA version.
    290 = Manuals, help etc =
    291 More information about the GATK can be found at:
    292 
    293 http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
    294 
    295 A forum containing a lot of background information can be found at:
    296 
    297 http://getsatisfaction.com/gsa
    298 
    299 Information on the other tools used can be found at;
    300 
    301 SAMtools: http://samtools.sourceforge.net/
    302 
    303 PICARD: http://picard.sourceforge.net/index.shtml
    304 
    305 Another source of valuable information on anything NGS related:
    306 
    307 http://seqanswers.com/forums/index.php
    308 = Appendix A – Schematic overview of the pipeline =
    309 
    310 == Figure 1. Overview of the analysis pipeline created using the GATK. ==
    311 The steps/processes are all numbered. More information on these steps can be found in the corresponding text sessions.''
    312 
    313 [[Image(wiki:SnpCallingPipeline:Figure1.png)]]
    314 
    315 == Figure 2. Overview of the analysis pipeline created using the GATK. ==
    316 
    317 This part starts at step 7 from figure 1. The steps/processes are all numbered. More information on these steps can be found in the corresponding text sessions.''
    318 
    319 [[Image(wiki:SnpCallingPipeline:Figure2.png)]]
    320 
    321 == Figure 3. Overview of the PICARD pipeline. ==
    322 
    323 This tool performs several quality checks after the alignment step completed.''
    324 
    325 [[Image(wiki:SnpCallingPipeline:Figure3.png)]]
     74== Workflow 1: genome reference file creation ==
     75
     76This workflow creates reference files per chromosome including:
     77* genome, dbsnp and indel vcfs per chromosome
     78* realign targets for faster realignment target creation
     79* index files for samtools and bwa
     80
     81Workflow inputs:
     82* genome.chr.fa - downloaded from genome supplier (now hg19)
     83* dbsnpXYZ.rod - downloaded reference SNPs from dbsnp (now 129)
     84* indelsXYZ.vcf - downloaded reference indels from 1KG
     85
     86Workflow outputs:
     87* genome.chr.fa - cleaned headers
     88* genome.chr.fa.fa - index for samtools
     89* genome.chr.fa.<format> - multilple index files for bwa
     90* dbsnpXYZ.chr.rod - split per chromosome
     91* indelsXYZ.chr.vcf - split per chromosome
     92* genome.chr.realign.intervals - targets for realignment
     93
     94=== clean-fasta-headers ===
     95Clean headers to only have '1' instead of Chr1, etc
     96
     97||tool:    || ||
     98||inputs:  ||genome.chr.fa ||
     99||outputs: ||genome.chr.fa ||
     100||doc:     ||internally developed ||
     101
     102=== split-vcf-chr for dbsnp and indels ===
     103Split vcf per chromosome
     104||tool:    || ||
     105||inputs:  ||dbsnpXYZ.rod, indelsXYZ.vcf ||
     106||outputs: ||dbsnpXYz.chr.rod, indelsXYZ.vcf ||
     107||doc:     || ||
     108
     109Discussion:
     110> Can we use http://vcftools.sourceforge.net/options.html ?
     111>> vcftools --vcf indelsXYZ.vcf --chr <i> --recode --out indelsXYZ.chr
     112
     113=== index-chromosomes ===
     114Index reference sequence for each chromosome in the FASTA format
     115
     116||tool:   ||samtools faidx ||
     117||input:  ||genome.chr.fa ||
     118||output: ||genome.chr.fa.fai ||
     119||doc:    ||http://samtools.sourceforge.net/samtools.shtml#3 ||
     120
     121=== bwa-index-chromosomes ===
     122Index reference sequence for each chromosome for bwa alignment
     123
     124||tool:   ||bwa index -a IS ||
     125||input:  ||genome.chr.fa ||
     126||output: ||genome.chr.fa.xyz ||
     127||doc:    ||http://bio-bwa.sourceforge.net/bwa.shtml#3 ||
     128
     129=== !RealignerTargetCreator ===
     130Generate realignment targets for known sites for each chromosome
     131
     132||tool:   ||GenomeAnalysisTK.jar -T RealignerTargetCreator ||
     133||input:  ||genome.chr.fa, dbsnpXYz.chr.rod, indelsXYZ.vcf ||
     134||output: ||genome.chr.realign.intervals ||
     135||doc:    ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites ||
     136
     137== Workflow 2: Alignment per Lane, per Chr ==
     138
     139This workflow aligns reads per lane and chromosome, including:
     140* re-alignment to prevend false SNP calls caused by indels (using known indels)
     141* markduplicates to prevend false coverage caused by PCR errors (per library = lane)
     142* base quality recalibration to correct for false low scores caused by true variation
     143
     144Workflow Inputs:
     145* lane.1.fq.gz - raw reads for lane, pair end 1
     146* lane.2.fq.gz - raw reads for lane, pair end 2
     147* genome.chr.fasta - reference genome split on chromosome
     148* genome.chr.realign.intervals - targets for realignment per chromosome
     149* genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp
     150* genome.chr.indelsXYZ.vcf - known indels from, here from 1KG
     151
     152Workflow ouputs:
     153* lane.chr.1.sai - alignment index for first pair
     154* lane.chr.2.sai - alignment index for second pair
     155* lane.chr.sam - alignment map for
     156* lane.chr.bam - alignment map in binary format
     157* lane.chr.sorted.bam - sorted alignment map
     158* lane.chr.sorted.bai - sorted alignment index
     159* lane.chr.dedup.bam - marked duplicate PCR elements
     160* lane.chr.dedup.metrics - metrics describing deduplication
     161* lane.chr.realigned.bam - realigned based on known indels
     162* lane.chr.matefixed.bam - fixed the mate pair ends
     163* lane.chr.covariate_table.csv - table of countcovariates output for recalibration
     164* lane.chr.recal.bam - alignment map with recalibrated quality scores
     165
     166=== align ===
     167Align each end of paired end.
     168 
     169||tool:   ||bwa-align ||
     170||input:  ||chr.fasta, lane.1.fq.gz, lane.2.fq.gz ||
     171||output: ||lane.chr.1.sai, lane.chr.2.sai ||
     172||docs:   ||http://bio-bwa.sourceforge.net/bwa.shtml ||
     173
     174=== align-pe ===
     175Align the pairs as one
     176
     177||tool:    ||bwa sampe ||
     178||inputs:  ||chr.fasta [[BR]] lane.1.fq.gz [[BR]] lane.2.fq.gz [[BR]] lane.chr.1.sai [[BR]] lane.chr.2.sai ||
     179||outputs: ||lane.chr.sam ||
     180||docs:    ||http://bio-bwa.sourceforge.net/bwa.shtml ||
     181
     182=== sam-to-bam ===
     183Convert sam to bam
     184
     185||tool:    ||samtools view ||
     186||inputs:  ||lane.chr.sam ||
     187||outputs: ||lane.chr.bam ||
     188||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||
     189
     190(Question: can this not index and sort?)
     191
     192=== sam-sort ===
     193Sort bam file on coordinate
     194
     195||tool:    ||samtools sort ||
     196||inputs:  ||lane.chr.bam ||
     197||outputs: ||lane.chr.sorted.bam ||
     198||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||
     199
     200=== sam-index ===
     201Index bam file for quicker access
     202
     203||tool:    ||samtools index ||
     204||inputs:  ||lane.chr.sorted.bam ||
     205||outputs: ||lane.chr.sorted.bai ||
     206||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||
     207
     208=== !MarkDuplicates ===
     209Mark duplicate PCR fragments to be filtered in analysis
     210
     211||tool:    ||MarkDuplicates.jar ||
     212||inputs:  ||lane.chr.sorted.bam ||
     213||outputs: ||lane.chr.dedup.bam [[BR]] lane.chr.dedup.metrics ||
     214||docs:    ||http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates ||
     215
     216=== !IndelRealigner-!KnownsOnly ===
     217Improve the alignment using known indel information (will reduce false SNP calls)
     218
     219||tool:    ||GenomeAnalysisTK.jar -T IndelRealigner ||
     220||inputs:  ||lane.chr.dedup.bam [[BR]] genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf ||
     221||outputs: ||lane.chr.realigned.bam ||
     222||docs     ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites ||
     223
     224=== !FixMateInformation ===
     225Fix the paired end information as consequence of the realignment.
     226
     227||tool:    ||FixMateInformation.jar ||
     228||inputs:  ||lane.chr.realigned.bam
     229||outputs: ||lane.chr.matefixed.bam ||
     230||docs:    ||http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation,
     231
     232http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Fixing_Mate_Pairs ||
     233
     234=== !CountCovariates ===
     235Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration.
     236Optionally: plot the results to pdf using AnalyzeCovariates
     237
     238||tool:    ||GenomeAnalysisTK.jar -T CountCovariates, AnalyzeCovariates.jar ||
     239||inputs:  ||lane.chr.matefixed.bam [[BR]] genome.chr.dbsnpXYZ.rod ||
     240||outputs: ||lane.chr.covariate_table.csv ||
     241||docs:    ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates [[BR]]
     242
     243http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#AnalyzeCovariates.jar ||
     244
     245=== !TableRecalibration ===
     246Recalibrate quality scores based on the covariate table
     247||tool:    ||GenomeAnalysisTK.jar -T TableRecalibration ||
     248||inputs:  ||lane.chr.matefixed.bam [[BR]]lanec.chr.recal_table.csv [[BR]]chr.fasta ||
     249||outputs: ||lane.chr.recal.bam
     250||docs:    ||http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration ||
     251
     252=== Repeat: sam-sort, sam-index, countcovariates ===
     253See steps above for commands and docs.
     254
     255||inputs:  ||lane.chr.recal.bam ||
     256||outputs: ||lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv ||
     257
     258Discussion:
     259> wy do we need to sort and index after recalibration? does it mess up the order of things?
     260
     261== Workflow 3: sample level variant calling ==
     262This workflow will call variants for the samples including:
     263* sample level recalibration
     264* sample level realignment
     265N.B. no sample level MarkDuplicates is needed as lanes == libraries.
     266
     267Workflow inputs:
     268* lane.chr.recal.sorted.bam - for all sample lanes: dedupped, recalibrated, realigned, sorted and indexed bams (3)
     269* sample.chip.vcf - genotypes called from genotype chip
     270Reference:
     271* genome.chr.fasta - reference genome split on chromosome
     272* genome.chr.realign.intervals - targets for realignment per chromosome
     273* genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp
     274* genome.chr.indelsXYZ.vcf - known indels from, here from 1KG
     275
     276Workflow outputs:
     277* sample.chr.bam - merged bam files per sample
     278* sample.chr.realign.interval - realignment target intervals
     279* sample.chr.realigned.bam - realigned
     280* sample.chr.matesfixed.bam - fixed pairs in realignment
     281* sample.chr.indels.vcf - raw indels called
     282* sample.chr.indels.bed - raw indels annotations
     283* sample.chr.indels.txt - output from the indel calling
     284* sample.chr.indels.filtered.bed - indels filtered
     285* sample.chr.snps.vcf - raw snps called
     286* sample.chr.snps.filtered.vcf - snps filtered
     287
     288=== merge-lanes ===
     289Merge lanes into one sample bam
     290
     291||tool:    ||sam merge ||
     292||inputs:  ||lane.chr.recal.sorted.bam ||
     293||outputs: ||sample.chr.bam ||
     294||docs:    ||http://samtools.sourceforge.net/samtools.shtml ||
     295
     296=== !RealignerTargetCreator ===
     297Create realignment targets based on the data (so not only knowns)
     298
     299||tool:    ||GenomeAnalysisTK.jar -T RealignerTargetCreator ||
     300||inputs:  ||sample.chr.bam [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod [[BR]]indelsXYZ.vcf
     301||outputs: ||sample.chr.realign.intervals ||
     302||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Creating_Intervals ||
     303
     304=== !IndelRealigner ===
     305Realign based on realignment targets in previous step
     306
     307||tool:    ||GenomeAnalysisTK.jar -T IndelRealigner ||
     308||inputs:  ||sample.chr.bam [[BR]]genome.chr.realign.intervals [[BR]] genome.chr.dbsnpXYZ.rod [[BR]] genome.chr.indelsXYZ.vcf ||
     309||outputs: ||sample.chr.realigned.bam ||
     310||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Realigning ||
     311
     312=== !FixMateInformation ===
     313See description in workflow2, now applied to sample
     314
     315||inputs:  ||sample.chr.realigned.bam ||
     316||ouputs:  ||sample.chr.matesfixed.bam ||
     317
     318=== !IndelGenotyperV2 ===
     319Call indels
     320
     321||tool:    ||GenomeAnalysisTK.jar -T IndelGenotyperV2 ||
     322||inputs:  ||sample.chr.matesfixed.bam [[BR]]genome.chr.fa ||
     323||outputs: ||sample.chr.indels.vcf [[BR]]sample.chr.indels.bed [[BR]]sample.chr.indels.txt ||
     324||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0 [[BR]]
     325
     326http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SampleIndelGenotyper ||
     327
     328=== !filterSingleSampleCalls ===
     329Filter indels
     330
     331||tool:    ||filterSingleSampleCalls.pl ||
     332||inputs:  ||sample.chr.indels.bed ||
     333||outputs: ||sample.chr.indels.filtered.bed ||
     334||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SampleIndelGenotyper ||
     335
     336=== !UnifiedGenotyper ===
     337Call SNPs
     338
     339||tool:    ||GenomeAnalysisTK.jar -T UnifiedGenotyper ||
     340||inputs:  ||sample.chr.matesfixed [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod ||
     341||outputs: ||sample.chr.snps.vcf ||
     342||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Firehose_Parameters#SetUnifiedGenotypertoEval [[BR]]
     343
     344http://www.broadinstitute.org/gsa/wiki/index.php/Unified_genotyper ||
     345
     346=== !makeIndelMask ===
     347Make indel mask
     348
     349||tool:    ||makeIndelMask.py ||
     350||inputs:  ||sample.chr.indels.bed ||
     351||outputs: ||sample.chr.indels.mask.bed ||
     352||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Indel_Genotyper_V2.0#Creating_a_indel_mask_file ||
     353
     354=== !VariantFiltration ===
     355Filter variants to get the best calls possible
     356
     357||tool:    ||GenomeAnalysisTK.jar -T VariantFiltration ||
     358||inputs:  ||sample.chr.snps.vcf [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod  ||
     359||outputs: ||sample.chr.snps.filtered.vcf ||
     360||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/Best_Practice_Variant_Detection_with_the_GATK_v2#Integrating_analyses:_getting_the_best_call_set_possible
     361
     362||
     363
     364=== !MergeVcfs ===
     365=== !ChipVcf ===
     366Produce vcf for the chips
     367
     368=== !VariantEval ===
     369Create summary information on the variations called for evaluation.
     370Run per sample.snps.filtered.vcf against chip.
     371
     372||tool:    ||GenomeAnalysisTK.jar -T VariantEval ||
     373||inputs:  ||sample.snps.vcf [[BR]]sample.chip.vcf [[BR]]genome.chr.fa [[BR]]dbsnpXYz.chr.rod||
     374||outputs: ||sample.snps.eval ||
     375||doc:     ||http://www.broadinstitute.org/gsa/wiki/index.php/VariantEval ||
     376
     377
     378Discussion:
     379> Do we call SNPs based on the filtered indels or the raw indels?
     380> Should we realign AGAIN after merge of lanes?
     381> BAQ?
     382> MINDEL/PINDEL?
     383
     384
     385
     386
     387
     388
     389