wiki:GwasQcPipeline

Version 6 (modified by Morris Swertz, 14 years ago) (diff)

--

GWAS QC pipeline

Removed all bad samples (taken from Mathieu). p-link --bfile GvNL_250111 --remove GvNL_bad_samples.txt --make-bed --out GvNL_good_samples

Update family information p-link --bfile GvNL_samples --update-ids GvNL_family_update.txt --make-bed --out GvNL_good_fam

Set trio relationships p-link --bfile GvNL_good_fam --update-parents GvNL_parents_update.txt --make-bed --out GvNL_good_fam_par

Manual check for suspicious individuals (not in trios) and remove them <pre>p-link --bfile GvNL_good_fam_par --remove GvNL_suspicious.txt --make-bed --out GvNL_good_fam_par_susp

NB: After checking suspicious individuals with Mathieu, turns out it was 1 typo and 1 filtered low call rate (93%). Therefore, only the low callrate individual was screened out after all.

Check and update sex information p-link --bfile GvNL_good_fam_par_susp --check-sex --make-bed

Crosscheck with information from Genome Studio provided by Mathieu

  • 100% match between plink and Genome Studio when sex information is within plink threshold
  • 2 pairs of parents swapped
  • 1 individual tagged as male instead of female by both Genome Studio and plink. Tagged as suspicious in fail-sexcheck-qc.txt

Update sex information for children and swapped individuals <pre>p-link --bfile GvNL_good_fam_par_susp --update-sex GvNL_sex_update.txt --make-bed --out GvNL_raw_final

Identification of individuals with elevated missing data rates or outlying heterozygosity rate (See Anderson, NP2010, p.1569) => How does this compare with the PLINK suggested approach http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml here

Get missing data information: p-link --bfile GvNL_raw_final --missing --out GvNL_raw_final Get heterozygocity information: p-link --bfile GvNL_raw_final --het --out GvNL_raw_final

Calculate the observed heterozygosity rate per individual using the formula (N(NM) − O(Hom))/N(NM) and plot the missing SNPs vs heterozygocity rate for eyeball inspection.

  • In R: het <- read.table("GvNL_raw_final.het", head=TRUE)&#10;mis <- read.table("GvNL_raw_final.imiss", head=TRUE)&#10;het$HET_RATE = (het$"N.NM." - het$"O.HOM.")/het$"N.NM."&#10;plot (mis$F_MISS,het$HET_RATE, xlab="Missing SNPs", ylab="Heterozygocity Rate")

No samples with callrate < 95% (Filtered in lab)

Put samples where heterozygocity rate is outside 3 standard dev from the mean (Anderson, NP2010, p.1569) in fail-het-3sd-qc.txt, along with their deviation from the mean

In R: het_fail = subset(het, (het$HET_RATE < mean(het$HET_RATE)-3*sd(het$HET_RATE)) | (het$HET_RATE > mean(het$HET_RATE)+3*sd(het$HET_RATE)))&#10;het_fail$HET_DST = (het_fail$HET_RATE-mean(het$HET_RATE))/sd(het$HET_RATE)&#10;write.table(het_fail, "fail-het-qc.txt", row.names=FALSE)

Result:

  • A total of 5 individuals failed. Worst failure has a distance of -3.74sd from the mean.

Check inheritance and duplicates

  • Prune SNPs for LD

p-link --bfile GvNL_raw_final --indep-pairwise 50 5 0.2 --out GvNL_raw_final

  • Generate pairwise Identity-By-State (IBS) metrics using pruned SNPs only

<pre>p-link --bfile GvNL_raw_final --extract GvNL_raw_final.prune.in --genome --out GvNL_raw_final

  • Check trio inheritance based on IBS: Check the Pi_HAT value for the individuals (unrelated individuals =~ 0, parent <-> child =~ 0.5, siblings =~0.5, twins =~ 1.0). Done with homemade perl script.
  • 2 Families are not real families (G34c (child) is wrong and A56b (mother) is wrong)
  • Duplicated family: R24 == R25

Check trio inheritance based on Mendelian segregation p-link --file GvNL_raw_final --me 0.05 0.1 --make-bed --out GvNL_inheritance

A diff between GvNL_raw_final.fam and GvNL_inheritance.fam confirms IBS version: families G34 and A56 have been removed.