Changes between Version 6 and Version 7 of GwasQcPipeline
- Timestamp:
- Feb 10, 2011 5:32:47 PM (14 years ago)
Legend:
- Unmodified
- Added
- Removed
- Modified
-
GwasQcPipeline
v6 v7 1 1 = GWAS QC pipeline = 2 2 3 Removed all bad samples (taken from Mathieu). 3 Removed all bad samples (taken from Mathieu). 4 4 5 {{{p-link --bfile GvNL_250111 --remove GvNL_bad_samples.txt --make-bed --out GvNL_good_samples}}} 5 6 6 7 Update family information 8 7 9 {{{p-link --bfile GvNL_samples --update-ids GvNL_family_update.txt --make-bed --out GvNL_good_fam}}} 8 10 9 11 Set trio relationships 12 10 13 {{{p-link --bfile GvNL_good_fam --update-parents GvNL_parents_update.txt --make-bed --out GvNL_good_fam_par}}} 11 14 12 15 Manual check for suspicious individuals (not in trios) and remove them 16 13 17 {{{<pre>p-link --bfile GvNL_good_fam_par --remove GvNL_suspicious.txt --make-bed --out GvNL_good_fam_par_susp}}} 14 18 … … 16 20 17 21 Check and update sex information 22 18 23 {{{p-link --bfile GvNL_good_fam_par_susp --check-sex --make-bed}}} 19 24 … … 24 29 25 30 Update sex information for children and swapped individuals 31 26 32 {{{<pre>p-link --bfile GvNL_good_fam_par_susp --update-sex GvNL_sex_update.txt --make-bed --out GvNL_raw_final}}} 27 33 28 34 Identification of individuals with elevated missing data rates or outlying heterozygosity rate (See Anderson, NP2010, p.1569) => How does this compare with the PLINK suggested approach [[http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml here]] 29 35 30 Get missing data information: {{{p-link --bfile GvNL_raw_final --missing --out GvNL_raw_final}}} 36 Get missing data information: 37 38 {{{p-link --bfile GvNL_raw_final --missing --out GvNL_raw_final}}} 31 39 32 Get heterozygocity information: {{{p-link --bfile GvNL_raw_final --het --out GvNL_raw_final}}} 40 Get heterozygocity information: 41 42 {{{p-link --bfile GvNL_raw_final --het --out GvNL_raw_final}}} 33 43 34 44 Calculate the observed heterozygosity rate per individual using the formula (N(NM) − O(Hom))/N(NM) and plot the missing SNPs vs heterozygocity rate for eyeball inspection. 35 * In R: {{{het <- read.table("GvNL_raw_final.het", head=TRUE) mis <- read.table("GvNL_raw_final.imiss", head=TRUE) het$HET_RATE = (het$"N.NM." - het$"O.HOM.")/het$"N.NM." plot (mis$F_MISS,het$HET_RATE, xlab="Missing SNPs", ylab="Heterozygocity Rate")}}} 45 * In R: 46 47 {{{het <- read.table("GvNL_raw_final.het", head=TRUE) mis <- read.table("GvNL_raw_final.imiss", head=TRUE) het$HET_RATE = (het$"N.NM." - het$"O.HOM.")/het$"N.NM." plot (mis$F_MISS,het$HET_RATE, xlab="Missing SNPs", ylab="Heterozygocity Rate")}}} 36 48 37 49 No samples with callrate < 95% (Filtered in lab) … … 39 51 Put samples where heterozygocity rate is outside 3 standard dev from the mean (Anderson, NP2010, p.1569) in fail-het-3sd-qc.txt, along with their deviation from the mean 40 52 41 In R: {{{het_fail = subset(het, (het$HET_RATE < mean(het$HET_RATE)-3*sd(het$HET_RATE)) | (het$HET_RATE > mean(het$HET_RATE)+3*sd(het$HET_RATE))) het_fail$HET_DST = (het_fail$HET_RATE-mean(het$HET_RATE))/sd(het$HET_RATE) write.table(het_fail, "fail-het-qc.txt", row.names=FALSE)}}} 53 In R: 54 55 {{{het_fail = subset(het, (het$HET_RATE < mean(het$HET_RATE)-3*sd(het$HET_RATE)) | (het$HET_RATE > mean(het$HET_RATE)+3*sd(het$HET_RATE))) het_fail$HET_DST = (het_fail$HET_RATE-mean(het$HET_RATE))/sd(het$HET_RATE) write.table(het_fail, "fail-het-qc.txt", row.names=FALSE)}}} 42 56 43 57 Result: … … 46 60 Check inheritance and duplicates 47 61 * Prune SNPs for LD 62 48 63 {{{p-link --bfile GvNL_raw_final --indep-pairwise 50 5 0.2 --out GvNL_raw_final}}} 64 49 65 * Generate pairwise Identity-By-State (IBS) metrics using pruned SNPs only 66 50 67 {{{<pre>p-link --bfile GvNL_raw_final --extract GvNL_raw_final.prune.in --genome --out GvNL_raw_final}}} 68 51 69 * Check trio inheritance based on IBS: Check the Pi_HAT value for the individuals (unrelated individuals =~ 0, parent <-> child =~ 0.5, siblings =~0.5, twins =~ 1.0). Done with homemade perl script. 52 70 * 2 Families are not real families (G34c (child) is wrong and A56b (mother) is wrong) 53 71 * Duplicated family: R24 == R25 72 54 73 Check trio inheritance based on Mendelian segregation 74 55 75 {{{p-link --file GvNL_raw_final --me 0.05 0.1 --make-bed --out GvNL_inheritance}}} 56 76