wiki:SnpCallingPipeline/AlignmentAndCleaning

Version 1 (modified by Morris Swertz, 14 years ago) (diff)

--

Workflow 2: Alignment per Lane, per Chr

This workflow aligns reads per lane and chromosome, including:

  • re-alignment to prevend false SNP calls caused by indels (using known indels)
  • markduplicates to prevend false coverage caused by PCR errors (per library = lane)
  • base quality recalibration to correct for false low scores caused by true variation

Workflow Inputs:

  • lane.1.fq.gz - raw reads for lane, pair end 1
  • lane.2.fq.gz - raw reads for lane, pair end 2
  • genome.chr.fasta - reference genome split on chromosome
  • genome.chr.realign.intervals - targets for realignment per chromosome
  • genome.chr.dbsnpXYZ.rod - known snp variants, here from dpbsnp
  • genome.chr.indelsXYZ.vcf - known indels from, here from 1KG

Workflow ouputs:

  • lane.chr.1.sai - alignment index for first pair
  • lane.chr.2.sai - alignment index for second pair
  • lane.chr.sam - alignment map for
  • lane.chr.bam - alignment map in binary format
  • lane.chr.sorted.bam - sorted alignment map
  • lane.chr.sorted.bai - sorted alignment index
  • lane.chr.dedup.bam - marked duplicate PCR elements
  • lane.chr.dedup.metrics - metrics describing deduplication
  • lane.chr.realigned.bam - realigned based on known indels
  • lane.chr.matefixed.bam - fixed the mate pair ends
  • lane.chr.covariate_table.csv - table of countcovariates output for recalibration
  • lane.chr.recal.bam - alignment map with recalibrated quality scores

align

Align each end of paired end.

tool: bwa-align
input: chr.fasta, lane.1.fq.gz, lane.2.fq.gz
output: lane.chr.1.sai, lane.chr.2.sai
docs: http://bio-bwa.sourceforge.net/bwa.shtml

align-pe

Align the pairs as one

tool: bwa sampe
inputs: chr.fasta
lane.1.fq.gz
lane.2.fq.gz
lane.chr.1.sai
lane.chr.2.sai
outputs: lane.chr.sam
docs: http://bio-bwa.sourceforge.net/bwa.shtml

sam-to-bam

Convert sam to bam

tool: samtools view
inputs: lane.chr.sam
outputs: lane.chr.bam
docs: http://samtools.sourceforge.net/samtools.shtml

(Question: can this not index and sort?)

sam-sort

Sort bam file on coordinate

tool: samtools sort
inputs: lane.chr.bam
outputs: lane.chr.sorted.bam
docs: http://samtools.sourceforge.net/samtools.shtml

sam-index

Index bam file for quicker access

tool: samtools index
inputs: lane.chr.sorted.bam
outputs: lane.chr.sorted.bai
docs: http://samtools.sourceforge.net/samtools.shtml

MarkDuplicates

Mark duplicate PCR fragments to be filtered in analysis

tool: MarkDuplicates?.jar
inputs: lane.chr.sorted.bam
outputs: lane.chr.dedup.bam
lane.chr.dedup.metrics
docs: http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

IndelRealigner-KnownsOnly

Improve the alignment using known indel information (will reduce false SNP calls)

tool: GenomeAnalysisTK.jar -T IndelRealigner?
inputs: lane.chr.dedup.bam
genome.chr.realign.intervals
genome.chr.dbsnpXYZ.rod
genome.chr.indelsXYZ.vcf
outputs: lane.chr.realigned.bam
docs http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Running_the_Indel_Realigner_only_at_known_sites

FixMateInformation

Fix the paired end information as consequence of the realignment.

tool: FixMateInformation?.jar
inputs: lane.chr.realigned.bam
outputs: lane.chr.matefixed.bam
docs: http://picard.sourceforge.net/command-line-overview.shtml#FixMateInformation,
http://www.broadinstitute.org/gsa/wiki/index.php/Local_realignment_around_indels#Fixing_Mate_Pairs

CountCovariates

Count covariants, such as machine cycle and bp position, to be used as basis for quality recalibration. Optionally: plot the results to pdf using AnalyzeCovariates?

tool: GenomeAnalysisTK.jar -T CountCovariates?, AnalyzeCovariates?.jar
inputs: lane.chr.matefixed.bam
genome.chr.dbsnpXYZ.rod
outputs: lane.chr.covariate_table.csv
docs: http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#CountCovariates
http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#AnalyzeCovariates.jar

TableRecalibration

Recalibrate quality scores based on the covariate table

tool: GenomeAnalysisTK.jar -T TableRecalibration?
inputs: lane.chr.matefixed.bam
lanec.chr.recal_table.csv
chr.fasta
outputs: lane.chr.recal.bam
docs: http://www.broadinstitute.org/gsa/wiki/index.php/Base_quality_score_recalibration#TableRecalibration

Repeat: sam-sort, sam-index, countcovariates

See steps above for commands and docs.

inputs: lane.chr.recal.bam
outputs: lane.chr.recal.sorted.bam, lane.chr.recal.sorted.bam.bai, lane.chr.recal.covariate_table.csv

Discussion:

wy do we need to sort and index after recalibration? does it mess up the order of things?