Version 1 (modified by 14 years ago) (diff) | ,
---|
Workflow 2: Alignment per Lane, per Chr
Table of Contents
This workflow aligns reads per lane and chromosome, including:
- re-alignment to prevend false SNP calls caused by indels (using known indels)
- markduplicates to prevend false coverage caused by PCR errors (per library = lane)
- base quality recalibration to correct for false low scores caused by true variation
Workflow Inputs:
- lane.1.fq.gz - raw reads for lane, pair end 1
- lane.2.fq.gz - raw reads for lane, pair end 2
- genome.chr.fasta - reference genome split on chromosome
- genome.chr.realign.intervals - targets for realignment per chromosome
- genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp
- genome.chr.indelsXYZ.vcf - known indels from, here from 1KG
Workflow ouputs:
- lane.chr.1.sai - alignment index for first pair
- lane.chr.2.sai - alignment index for second pair
- lane.chr.sam - alignment map for
- lane.chr.bam - alignment map in binary format
- lane.chr.sorted.bam - sorted alignment map
- lane.chr.sorted.bai - sorted alignment index
- lane.chr.dedup.bam - marked duplicate PCR elements
- lane.chr.dedup.metrics - metrics describing deduplication
- lane.chr.realigned.bam - realigned based on known indels
- lane.chr.matefixed.bam - fixed the mate pair ends
- lane.chr.covariate_table.csv - table of countcovariates output for recalibration
- lane.chr.recal.bam - alignment map with recalibrated quality scores
align
Align each end of paired end.
tool: | bwa-align |
input: | chr.fasta, lane.1.fq.gz, lane.2.fq.gz |
output: | lane.chr.1.sai, lane.chr.2.sai |
docs: | http://bio-bwa.sourceforge.net/bwa.shtml |
align-pe
Align the pairs as one
tool: | bwa sampe |
inputs: | chr.fasta lane.1.fq.gz lane.2.fq.gz lane.chr.1.sai lane.chr.2.sai |
outputs: | lane.chr.sam |
docs: | http://bio-bwa.sourceforge.net/bwa.shtml |
sam-to-bam
Convert sam to bam
tool: | samtools view |
inputs: | lane.chr.sam |
outputs: | lane.chr.bam |
docs: | http://samtools.sourceforge.net/samtools.shtml |
(Question: can this not index and sort?)
sam-sort
Sort bam file on coordinate
tool: | samtools sort |
inputs: | lane.chr.bam |
outputs: | lane.chr.sorted.bam |
docs: | http://samtools.sourceforge.net/samtools.shtml |
sam-index
Index bam file for quicker access
tool: | samtools index |
inputs: | lane.chr.sorted.bam |
outputs: | lane.chr.sorted.bai |
docs: | http://samtools.sourceforge.net/samtools.shtml |
MarkDuplicates
Mark duplicate PCR fragments to be filtered in analysis
tool: | MarkDuplicates?.jar |
inputs: | lane.chr.sorted.bam |
outputs: | lane.chr.dedup.bam lane.chr.dedup.metrics |
docs: | http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates |
IndelRealigner-KnownsOnly
Improve the alignment using known indel information (will reduce false SNP calls)
tool: | GenomeAnalysisTK.jar -T IndelRealigner? |
inputs: | lane.chr.dedup.bam genome.chr.realign.intervals genome.chr.dbsnpXYZ.rod genome.chr.indelsXYZ.vcf |
outputs: | lane.chr.realigned.bam |
docs | http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites |
FixMateInformation
Fix the paired end information as consequence of the realignment.
tool: | FixMateInformation?.jar |
inputs: | lane.chr.realigned.bam |
outputs: | lane.chr.matefixed.bam |
docs: | http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation, |
CountCovariates
Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration. Optionally: plot the results to pdf using AnalyzeCovariates?
tool: | GenomeAnalysisTK.jar -T CountCovariates?, AnalyzeCovariates?.jar |
inputs: | lane.chr.matefixed.bam genome.chr.dbsnpXYZ.rod |
outputs: | lane.chr.covariate_table.csv |
docs: | http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates |
TableRecalibration
Recalibrate quality scores based on the covariate table
tool: | GenomeAnalysisTK.jar -T TableRecalibration? |
inputs: | lane.chr.matefixed.bam lanec.chr.recal_table.csv chr.fasta |
outputs: | lane.chr.recal.bam |
docs: | http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration |
Repeat: sam-sort, sam-index, countcovariates
See steps above for commands and docs.
inputs: | lane.chr.recal.bam |
outputs: | lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv |
Discussion:
wy do we need to sort and index after recalibration? does it mess up the order of things?