wiki:ImputationTool

Introduction

ImputationTool is a collection of methods to perform pre- and post- analysis for imputation related tasks.

Implementation

ImputationTool developers:

  • Dr. Lude Franke (Lude@…): Format design, Initial methods.
  • Harm-Jan Westra (harm-jan@…): Extensions, Format converters, SNPs checks.

It has been written in java, NetBeans.

Availability

Documentation

From the ImputationTool help screen:

ImputationTool v0.2


------------------------
PreProcessing
------------------------

# Create random batches of cases and controls from a TriTyper dataset. Creates a file called batches.txt in outdir.
--mode batch --in TriTyperdir --out outdir --size batchsize

------------------------
Imputation
------------------------

# Convert Impute Imputed data into TriTyper
--mode itt --in ImputeDir --out TriTyperDir
------------------------
Beagle
------------------------

# Convert beagle files (one file/chromosome) to TriTyper. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number.
--mode btt --in BeagleDir --tpl template --ext ext --out TriTyperDir [--fam famfile]

# Convert batches of beagle files (multiple files / chromosome) to trityper files. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname.
--mode bttb --in BeagleDirdir --tpl template --out TriTyperDir --size numbatches

------------------------
Ped+Map (Plink files)
------------------------

# Converts Ped and Map files created by ttpmh to Beagle format
--mode pmbg --in indir --batch-file batches.txt

# Converts TriTyper file to Plink Dosage format. Filetemplate is a template for the batch filenames, The text CHROMOSOME will be replaced by the chromosome number, BATCH by the batchname.
--mode ttpd --in indir --beagle beagledir --tpl template --batchdesc batchdescriptor --out outdir --fam famfile

# Converts PED and MAP files to TriTyper.
--mode pmtt --in Ped+MapDir --out TriTyperDir

# Converts TriTyper file to PED and MAP files. The FAM file is optional. --split splits the ped and map files per chromosome
--mode ttpm --in indir --out outdir [--fam famfile] [--split]

# Converts TriTyper dataset to Ped+Map concordant to reference (hap) dataset. Supply a batchfile if you want to export in batches. Supply a chromosome if you want to export a certain chromosome.
--mode ttpmh --in TriTyperDir --hap TriTyperReferenceDir --out outdir [--fam famfile] [--batch-file batchfile] [--chr chromosome] [--exclude fileName]

---------------------
PostProcessing
---------------------

# Correlates genotypes of imputed vs non-imputed datasets. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution.
--mode corr --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir [--snps snplist]

# Correlates genotypes of imputed vs non-imputed datasets. Also take Beagle imputation score (R2) into account. Saves a file called correlationOutput.txt in outdir, containing correlation per chromosome as well as correlation distribution.
--mode corrb --in TriTyperDir --name datasetname --in2 TriTyperDir2 --name2 datasetname2 --out outdir --beagle beagleDir --tpl template --size numBatches 

# Gets all the excluded snps from chrx.excludedsnps.txt with a certain call-rate threshold (0 < threshold < 1.0)
--mode ecra --in TriTyperDir --threshold threshold

# Generates R2 distribution (beagle quality score) for each batch and chromosome, and tests each batch against chromosome R2 distribution, using WilcoxonMannWhitney test
--mode r2dist --in BeagleDir --template template --out outdir --size numbatches

# Merge two TriTyper datasets
--mode merge --in TriTyper1Dir --in2 TryTyper2Dir --out outdir

Example

This is a common scenario using ImputationTool: Suppose that we have the following directory structure:

  • study
    • study.ped
    • study.map
  • reference
    • reference.ped
    • reference.map

To impute the study vs. the reference with beagle:

  • mkdir study_TriTyper
  • Convert study ped/map data to TriTyper:
    • java -jar ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper
  • mkdir reference_TriTyper
  • Convert reference ped/map data to Trityper:
    • java -jar ImputationTool.jar --mode pmtt --in study/ --out study_TriTyper
  • mkdir batches
  • Create batches of 300 samples of the study:
    • java -jar ImputationTool.jar --mode batch --in study_TriTyper --out batches/ --size 300
  • mkdir reference_analyzed
  • Convert reference into beagle (and do some quality check as well)
    • java -jar ImputationTool.jar --mode ttpmh --in reference_TriTyper --hap reference_TriTyper --out reference_analyzed
  • convert reference to beagle (repeat for rest chromosomes)
    • java -jar linkage2beagle.jar reference_analyzed/chr1.dat reference_analyzed/chr1.ped > reference_analyzed/chr1.bgl
  • mkdir study_reference_compare
  • Perform a comparison between reference and study.
    • java -jar ImputationTool.jar --mode ttpmh --in study_TriTyper --hap reference_TriTyper --batch-file batches/batches.txt --out study_reference_compare
  • Convert analyzed study to beagle:
    • java -jar linkage2beagle.jar study_reference_compare/chr1.dat study_reference_compare/chr1.ped > reference_compare/chr1.bgl
  • mkdir RESULTS
  • And now time for the imputation step (beagle needed)
    • java -jar beagle.jar phased=reference_analyzed/chr1.bgl unphased=study_reference_compare/chr1.bgl markers=reference_analyzed/chr1.markersBeagleFormat missing=0 out=RESULTS/output

The TriTyper Format

TriTyper is a binary format to store genotype information, including insertion, deletion and expression data, providing very efficient read/write/seek methods.

Filtering

In the ttpmh mode, ImputationTool applies the following filtering between a study and a reference dataset:

The filtering steps imputation tool does when comparing to reference:

  • assesses alleles and swaps SNP if needed

ref: C/T GWAS: A/G --> needs to be swapped and inverted to become C/T

  • checks Hardy-Weinberg equilibrium <= 0.0001, MAF < 0.01, callrate < 0.95. If above threshold, SNP is removed
  • checks if SNP is present in reference data, if not, SNP is removed from GWAS data
  • checks if SNP has null alleles, if so, SNP is removed from GWAS data
  • checks if allele frequency is comparable to reference. If not (>25% difference), SNP is removed from GWAS data.
  • Assesses if the haplotype structure is comparable between reference and GWAS data. This is performed by pairwise comparison of r-squared between SNPs in both reference and GWAS. For SNPs in LD (r-squared > 0.1), the allele frequencies are compared. SNPs are removed from the GWAS data when the major allele differs more often than it is identical.

Last modified 13 years ago Last modified on Sep 19, 2011 5:12:30 PM