Changes between Version 5 and Version 6 of ImputationPipeline

Oct 5, 2010 1:25:27 PM (12 years ago)



  • ImputationPipeline

    v5 v6  
    1111TODO: describe the protocols here;
    1312== Description from Harm-Jan ==
    2120 2. convert dataset to trityper format, if it is in ped+map format.
    23 java -Xmx4g -jar ImputationTool.jar pmtt $plinkLocation $trityperOutputLocation
     22java -Xmx4g -jar ImputationTool.jar --mode pmtt --in $plinkLocation --out $trityperOutputLocation
    2524 3. compare the dataset to be imputed to the reference dataset (for example HapMap2 release 24, also in TriTyper format), and remove any snps for which the haplotypes are different, or do not correlate to the reference dataset. Also remove any SNP that is not in the reference. Save the output as Ped+Map
    7877 *  '''batch effects caused by overrepresentation of a certain haplotype within an imputation batch''': for each batch of samples, beagle estimates a best fitting model to predict the genotypes of the missing SNPs, which is dependent upon both the input data as the reference dataset. Cases and controls should be therefore randomly distributed across the batches. Another option is to use impute, rather than beagle, since its batches are across parts of the genome, instead of samples.
    7978 * '''difference in source platform''': different platforms have different SNP content. When you impute datasets coming from different platforms, the resulting model which is based on the input data is also different. When associating traits in a GWAS meta-analysis, these differences may account for a platform specific effect. We should therefore remove the SNPs which are non-overlapping between such platforms, prior to imputation, and impute the samples after combining the datasets. This would remove such a platform-bias, although would also cause a huge loss of available SNPs, when the overlap between platforms is small. However, in my opinion, this problem is similar to the batch effect problem, and can possibly be resolved by randomizing the sample content of the batches: the model will then possibly be fitted to the data that is available. In any case the datasets that are used in a meta-analysis should be imputed together.