This page describes the Imputation pipelines developed by the GoNL - Impute team. Please contribute. For help with trac wiki formatting: http://trac.edgewall.org/wiki/WikiFormatting [[br]]
All scripts presented here are located in our SVN repository: http://www.bbmriwiki.nl/svn/imputation [[br]]
Minutes of our Team Calls: [[http://www.bbmriwiki.nl/wiki/Imputations/Minutes]]

== Contributors and Teams ==
 * UMC Groningen: Alexandros Kanterakis alexandros.kanterakis@gmail.com
== Study data ==
== Reference data ==
 * 1000 Genomes October 2011 release
 * GoNL pilot 3, 48 trios, 192 haplotypes
== Pre processing ==
=== Normalize beagle datasets ===
 * Location: http://www.bbmriwiki.nl/svn/Imputation/alex/scripts/Normalize_beagle_datasets.ftl [[BR]]
 Takes a list of beagle and marker files and applies the following checks:
 * Checks if the SNPs are compatible. If the compatibility cannot be corrected by SNP inversion then it is discarded.
 * Checks if SNP has null alleles, if so, SNP is removed from study data.
 * Checks if two SNPs with same reference code (rs) are in the same position.
 * Checks if two SNPs in the same position have the same reference code (rs).
 * Checks if a SNP in the study has MAF < MAF_minimum, HWE < HWE_minimum and CR < CR_minimum if any of these criteria are met, the SNP is discarded. (MAF = Minor Allele Frequency, HWE = Hardy Weinberg Equilibrium, CR = Call Rate)

It generates a log file with all inconsistencies found: At the end of this file there is a summary of the problems found:
 * '''SNPs inverted''': For Example A/G SNPs in reference , T/C SNPs in study
 * '''Allele problems''': Number of SNPs with inconsistent alleles in study and in reference that could not be fixed with flipping
 * '''Position problems (different references, same loci)''': As it says. These SNPs are NOT removed. We keep the reference (rs number) of the reference panel
 * '''Unresolved single alleles problems''': SNPs in study that have only one allele. These SNPs are filtered out.
 * '''Double rs codes problems''': As it says. This SNPs are filtered out.
 * '''SNPs in study with MAF < MAF_minimum''': SNPs with MAF < MAF_minimum set. 
 * '''SNPs in study with HWE < HWE_minimum''': SNPs with HWE < HWE_minimum set.
 * '''SNPs in study with CR < CR_minimum''': SNPs with Call Rate < CR_minimum set
 * '''SNPs that differ in Allele Frequencies''': SNPs with difference in AF between reference and study over CR_minimum set.
[[BR]]
Options:
 * input_beagle_study : The study in beagle format
 * input_beagle_reference : The reference in beagle format
 * input_markers_study : The study's markers in beagle format
 * input_markers_reference : The reference's markers in beagle format
 * output_beagle_study : The Normalized output of the study (Use this as "study" for imputation)
 * output_beagle_reference : The Normalized output of the reference (Normally you will not use this file)
 * output_markers_study : The Markers of the normalized study
 * output_markers_reference : The Markers of the normalized reference
 * output_log_filename : the log filename
== Imputation software ==
 * Impute2
 * Beagle 
 * Mach / Minimach
== Quality metrics ==
=== Convert impute2 gprobs to TPED ===
 * Location: http://www.bbmriwiki.nl/svn/Imputation/alex/scripts/Convert_impute2_gprobs_to_PEDMAP_beagle.ftl [[BR]]
This method is suitable to convert results from impute2 imputation to TPED. You can define an R2 threshold. The R2 is the allelic R2 according to http://www.sciencedirect.com/science/article/pii/S0002929709000123#sec2.7.2 . You can copy the TFAM from the original study in order to have a complete TPED / TFAM dataset. [[BR]]
 Options:
 * input_impute2_gprobs_filename : The gprobs file generated from impute2
 * output_TPED_filename : The output TPED filename
 * output_stats_filename : The file where the R2 estimation will be printed. It will contain ALL the R2 values not only these surpassing the threshold
 * chromosome : The chromosome of this study
 * r2_threshold : The R2 threshold

=== Statistics_of_imputation_results ===
 * Location: http://www.bbmriwiki.nl/svn/Imputation/alex/scripts/Statistics_of_imputation_results.ftl
Computes several statistics of imputation results. This is suitable when we have "real" genotype data to benchmark our imputation pipeline. The computed statistics are:
 * Allelic R2 : according to  http://www.sciencedirect.com/science/article/pii/S0002929709000123#sec2.7.2
 * Real_Allelic_R2 : Computes the R2 (or coefficient of determination) between a real and an imputed genotype. 
 * Imputation_Allele_Frequency and Standardized_allele_frequency_error :  (From: http://www.sciencedirect.com/science/article/pii/S0002929709000123) Allele-frequency error is the difference between the true allele frequency in the sample and the estimated allele frequency in the sample computed from the posterior genotype probabilities. If the three posterior genotype probabilities for an individual are denoted pAA, pAB, and pBB, then the estimated A allele frequency is found by summing (2pAA + pAB) over all individuals and dividing by twice the number of individuals. However, allele-frequency error is difficult to interpret unless the true allele frequency and sample size are known. abs(p - q) / sqrt( ( p * (1-p))/ (2*n)). p is the allele frequency in the sample of n individuals from a population in Hardy-Weinberg equilibrium. q is the estimated allele frequency obtained from the imputed posterior genotype probabilities.
[[BR]]
Options:
 * input_beagle_dosage_filename : The output of the beagle imputation
 * input_beagle_unimputed_filename : The beagle file with the "real", un-imputed genotypes
 * output_filename : Output filename for the stats
== Complete pipelines ==
== Results ==
== References ==
 * Brian L. Browning, Sharon R. Browning. A Unified Approach to Genotype Imputation and Haplotype-Phase Inference for Large Data Sets of Trios and Unrelated Individuals. AJHG, Volume 84, Issue 2, 13 February 2009, Pages 210-223. doi:10.1016/j.ajhg.2009.01.005
 * http://www.plosgenetics.org/article/info%3Adoi%2F10.1371%2Fjournal.pgen.1000529 Impute2
 * http://www.biomedcentral.com/1471-2156/10/27 Accuracy of genome-wide imputation of untyped markers and impacts on statistical power for association studies
 * The effect of genome-wide association scan quality control on imputation outcome for common variants. QC on scan quality has MINIMUM effect on imputation quality http://www.nature.com/ejhg/journal/v19/n5/full/ejhg2010242a.html , http://www.nature.com/ng/journal/v39/n7/full/ng2088.html Marchini , http://www.nature.com/nrg/journal/v11/n7/full/nrg2796.html
 * Hickey JM, Kinghorn BP, Tier B, Wilson JF, Dunstan N, van der Werf JH. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. Genet Sel Evol. 2011 Mar 10;43:12. http://www.ncbi.nlm.nih.gov/pubmed/21388557
 * The International HapMap Consortium. Integrating common and rare genetic variation in diverse human populations. Nature 467, 52-58. 2010. The HapMap Project
 * The use of imputation in GWAS studies: http://www.nature.com/ejhg/journal/v19/n2/full/ejhg2010157a.html?WT.i_dcsvid=%25%25LIST_ID%25%25-%25%25RECIPIENT_ID%25%25&WT.ec_id=MARKETING&WT.mc_id=EG1107CV030 Politopoulos I. et. al Genome-wide association of breast cancer: composite likelihood with imputed genotypes. European Journal of Human Genetics (2011) 19, 194–199.
 * Introduction of imputation? http://www.ncbi.nlm.nih.gov/pubmed/19165921 Detection of sharing by descent, long-range phasing and haplotype imputation.
 * We found that the difference between the unQCed data and the fully QCed data on imputation outcome was minimal. Our study shows that imputation of common variants is generally very accurate and robust to GWAS QC, which is not a major factor affecting imputation outcome. A minority of common-frequency SNPs with particular properties cannot be accurately imputed regardless of QC stringency. These findings may not generalise to the imputation of low frequency and rare variants. http://www.nature.com/ejhg/journal/v19/n5/full/ejhg2010242a.html . The effect of genome-wide association scan quality control on imputation outcome for common variants. European Journal of Human Genetics (2011) 19, 610–614.
== See also ==
* An older version of the imputation pipeline developed mainly by Harm-Jan and Lude Franke: [[ImputationPipeline_old]] it uses the [[ImputationTool]] for study / reference normalization. 
* SVN repository: http://www.bbmriwiki.nl/svn/imputation
* http://gettinggeneticsdone.blogspot.com/2010/04/probabel-r-package-for-gwas-data.html ProbABEL - R package for GWAS data imputation.  http://www.biomedcentral.com/1471-2105/11/134 http://mga.bionet.nsc.ru/~yurii/ABEL/GenABEL/ an R library for Genome-wide association analysis.
* MACH: http://www.sph.umich.edu/csg/abecasis/MACH/
* Impute2: http://mathgen.stats.ox.ac.uk/impute/impute_v2.html
* GATK and Interface with BEAGLE imputation software. http://www.broadinstitute.org/gsa/wiki/index.php/Interface_with_BEAGLE_imputation_software
* https://sites.google.com/site/hickeyjohn/alphaphase. A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3068938/?tool=pubmed
* ParaHaplo 3.0: A program package for imputation and a haplotype-based whole-genome association study using hybrid parallel computing http://www.scfbm.org/content/6/1/10/abstract . A parallel version of ParaHaplo 3.0 can conduct genotype imputation 20 times faster than a non-parallel version of ParaHaplo.
* HLA*IMP—an integrated framework for imputing classical HLA alleles from SNP genotypes. http://bioinformatics.oxfordjournals.org/content/27/7/968.short?rss=1
* https://sites.google.com/site/hickeyjohn/alphaphase A combined long-range phasing and long haplotype imputation method to impute phase for SNP genotypes. http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3068938/?tool=pubmed