Changes between Version 5 and Version 6 of GwasQcPipeline


Ignore:
Timestamp:
Feb 10, 2011 5:31:50 PM (14 years ago)
Author:
Morris Swertz
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • GwasQcPipeline

    v5 v6  
    11= GWAS QC pipeline =
    22
    3 (((#!mediawiki
     3Removed all bad samples (taken from Mathieu).
     4{{{p-link --bfile GvNL_250111 --remove GvNL_bad_samples.txt --make-bed --out GvNL_good_samples}}}
    45
    5 # Removed all bad samples (taken from Mathieu).
    6 #* <pre>p-link --bfile GvNL_250111 --remove GvNL_bad_samples.txt --make-bed --out GvNL_good_samples</pre>
    7 # Update family information
    8 #* <pre>p-link --bfile GvNL_samples --update-ids GvNL_family_update.txt  --make-bed --out GvNL_good_fam</pre>
    9 # Set trio relationships
    10 #* <pre>p-link --bfile GvNL_good_fam --update-parents GvNL_parents_update.txt  --make-bed --out GvNL_good_fam_par</pre>
    11 # Manual check for suspicious individuals (not in trios) and remove them
    12 #* <pre>p-link --bfile GvNL_good_fam_par --remove GvNL_suspicious.txt  --make-bed --out GvNL_good_fam_par_susp</pre>
    13 #* After checking suspicious individuals with Mathieu, turns out it was 1 typo and 1 filtered low call rate (93%). Therefore, only the low callrate individual was screened out after all.
    14 # Check and update sex information
    15 ## PLINK sex check
    16 ##* <pre>p-link --bfile GvNL_good_fam_par_susp --check-sex --make-bed</pre>
    17 ## Crosscheck with information from Genome Studio provided by Mathieu
    18 ##* 100% match between plink and Genome Studio when sex information is within plink threshold
    19 ##* 2 pairs of parents swapped
    20 ##* 1 individual tagged as male instead of female by both Genome Studio and plink. Tagged as suspicious in fail-sexcheck-qc.txt
    21 ## Update sex information for children and swapped individuals
    22 ##* <pre>p-link --bfile GvNL_good_fam_par_susp --update-sex GvNL_sex_update.txt --make-bed --out GvNL_raw_final</pre>
    23 # Identification of individuals with elevated missing data rates or outlying heterozygosity rate (See Anderson, NP2010, p.1569) => How does this compare with the PLINK suggested approach  [[http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml here]]
    24 ## Get missing data information: <pre>p-link --bfile GvNL_raw_final --missing --out GvNL_raw_final</pre>
    25 ## Get heterozygocity information: <pre>p-link --bfile GvNL_raw_final --het --out GvNL_raw_final</pre>
    26 ## Calculate the observed heterozygosity rate per individual using the formula (N(NM)  −  O(Hom))/N(NM) and plot the missing SNPs vs heterozygocity rate for eyeball inspection.
    27 ##* In R: <pre>het <- read.table("GvNL_raw_final.het", head=TRUE)&#10;mis <- read.table("GvNL_raw_final.imiss", head=TRUE)&#10;het$HET_RATE = (het$"N.NM." - het$"O.HOM.")/het$"N.NM."&#10;plot (mis$F_MISS,het$HET_RATE, xlab="Missing SNPs", ylab="Heterozygocity Rate")</pre>
    28 ## No samples with callrate < 95% (Filtered in lab)
    29 ## Put samples where heterozygocity rate is outside 3 standard dev from the mean (Anderson, NP2010, p.1569) in fail-het-3sd-qc.txt, along with their deviation from the mean
    30 ##* In R: <pre>het_fail = subset(het, (het$HET_RATE < mean(het$HET_RATE)-3*sd(het$HET_RATE)) | (het$HET_RATE > mean(het$HET_RATE)+3*sd(het$HET_RATE)))&#10;het_fail$HET_DST = (het_fail$HET_RATE-mean(het$HET_RATE))/sd(het$HET_RATE)&#10;write.table(het_fail, "fail-het-qc.txt", row.names=FALSE)</pre>
    31 ##* A total of 5 individuals failed. Worst failure has a distance of -3.74sd from the mean.
    32 # Check inheritance and duplicates
    33 ## Prune SNPs for LD
    34 ##* <pre>p-link --bfile GvNL_raw_final --indep-pairwise 50 5 0.2 --out GvNL_raw_final</pre>
    35 ## Generate pairwise Identity-By-State (IBS) metrics using pruned SNPs only
    36 ##*<pre>p-link --bfile GvNL_raw_final --extract GvNL_raw_final.prune.in --genome --out GvNL_raw_final</pre>
    37 ## Check trio inheritance based on IBS: Check the Pi_HAT value for the individuals (unrelated individuals =~ 0, parent <-> child =~ 0.5, siblings =~0.5, twins =~ 1.0). Done with homemade perl script.
    38 ##* 2 Families are not real families (G34c (child) is wrong and A56b (mother) is wrong)
    39 ##* Duplicated family: R24 == R25
    40 ## Check trio inheritance based on Mendelian segregation
    41 ##* <pre> p-link --file GvNL_raw_final --me 0.05 0.1 --make-bed --out GvNL_inheritance</pre>
    42 ##* A diff between GvNL_raw_final.fam and GvNL_inheritance.fam confirms IBS version: families G34 and A56 have been removed.
    43 }}}
     6Update family information
     7{{{p-link --bfile GvNL_samples --update-ids GvNL_family_update.txt  --make-bed --out GvNL_good_fam}}}
     8
     9Set trio relationships
     10{{{p-link --bfile GvNL_good_fam --update-parents GvNL_parents_update.txt  --make-bed --out GvNL_good_fam_par}}}
     11
     12Manual check for suspicious individuals (not in trios) and remove them
     13{{{<pre>p-link --bfile GvNL_good_fam_par --remove GvNL_suspicious.txt  --make-bed --out GvNL_good_fam_par_susp}}}
     14
     15NB: After checking suspicious individuals with Mathieu, turns out it was 1 typo and 1 filtered low call rate (93%). Therefore, only the low callrate individual was screened out after all.
     16
     17Check and update sex information
     18{{{p-link --bfile GvNL_good_fam_par_susp --check-sex --make-bed}}}
     19
     20Crosscheck with information from Genome Studio provided by Mathieu
     21* 100% match between plink and Genome Studio when sex information is within plink threshold
     22* 2 pairs of parents swapped
     23* 1 individual tagged as male instead of female by both Genome Studio and plink. Tagged as suspicious in fail-sexcheck-qc.txt
     24
     25Update sex information for children and swapped individuals
     26{{{<pre>p-link --bfile GvNL_good_fam_par_susp --update-sex GvNL_sex_update.txt --make-bed --out GvNL_raw_final}}}
     27
     28Identification of individuals with elevated missing data rates or outlying heterozygosity rate (See Anderson, NP2010, p.1569) => How does this compare with the PLINK suggested approach  [[http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml here]]
     29
     30Get missing data information: {{{p-link --bfile GvNL_raw_final --missing --out GvNL_raw_final}}}
     31 
     32Get heterozygocity information: {{{p-link --bfile GvNL_raw_final --het --out GvNL_raw_final}}}
     33
     34Calculate the observed heterozygosity rate per individual using the formula (N(NM)  −  O(Hom))/N(NM) and plot the missing SNPs vs heterozygocity rate for eyeball inspection.
     35* In R: {{{het <- read.table("GvNL_raw_final.het", head=TRUE)&#10;mis <- read.table("GvNL_raw_final.imiss", head=TRUE)&#10;het$HET_RATE = (het$"N.NM." - het$"O.HOM.")/het$"N.NM."&#10;plot (mis$F_MISS,het$HET_RATE, xlab="Missing SNPs", ylab="Heterozygocity Rate")}}}
     36
     37No samples with callrate < 95% (Filtered in lab)
     38
     39Put samples where heterozygocity rate is outside 3 standard dev from the mean (Anderson, NP2010, p.1569) in fail-het-3sd-qc.txt, along with their deviation from the mean
     40
     41In R: {{{het_fail = subset(het, (het$HET_RATE < mean(het$HET_RATE)-3*sd(het$HET_RATE)) | (het$HET_RATE > mean(het$HET_RATE)+3*sd(het$HET_RATE)))&#10;het_fail$HET_DST = (het_fail$HET_RATE-mean(het$HET_RATE))/sd(het$HET_RATE)&#10;write.table(het_fail, "fail-het-qc.txt", row.names=FALSE)}}}
     42
     43Result:
     44* A total of 5 individuals failed. Worst failure has a distance of -3.74sd from the mean.
     45
     46Check inheritance and duplicates
     47* Prune SNPs for LD
     48{{{p-link --bfile GvNL_raw_final --indep-pairwise 50 5 0.2 --out GvNL_raw_final}}}
     49* Generate pairwise Identity-By-State (IBS) metrics using pruned SNPs only
     50{{{<pre>p-link --bfile GvNL_raw_final --extract GvNL_raw_final.prune.in --genome --out GvNL_raw_final}}}
     51* Check trio inheritance based on IBS: Check the Pi_HAT value for the individuals (unrelated individuals =~ 0, parent <-> child =~ 0.5, siblings =~0.5, twins =~ 1.0). Done with homemade perl script.
     52* 2 Families are not real families (G34c (child) is wrong and A56b (mother) is wrong)
     53* Duplicated family: R24 == R25
     54Check trio inheritance based on Mendelian segregation
     55{{{p-link --file GvNL_raw_final --me 0.05 0.1 --make-bed --out GvNL_inheritance}}}
     56
     57A diff between GvNL_raw_final.fam and GvNL_inheritance.fam confirms IBS version: families G34 and A56 have been removed.