5 | | # Removed all bad samples (taken from Mathieu). |
6 | | #* <pre>p-link --bfile GvNL_250111 --remove GvNL_bad_samples.txt --make-bed --out GvNL_good_samples</pre> |
7 | | # Update family information |
8 | | #* <pre>p-link --bfile GvNL_samples --update-ids GvNL_family_update.txt --make-bed --out GvNL_good_fam</pre> |
9 | | # Set trio relationships |
10 | | #* <pre>p-link --bfile GvNL_good_fam --update-parents GvNL_parents_update.txt --make-bed --out GvNL_good_fam_par</pre> |
11 | | # Manual check for suspicious individuals (not in trios) and remove them |
12 | | #* <pre>p-link --bfile GvNL_good_fam_par --remove GvNL_suspicious.txt --make-bed --out GvNL_good_fam_par_susp</pre> |
13 | | #* After checking suspicious individuals with Mathieu, turns out it was 1 typo and 1 filtered low call rate (93%). Therefore, only the low callrate individual was screened out after all. |
14 | | # Check and update sex information |
15 | | ## PLINK sex check |
16 | | ##* <pre>p-link --bfile GvNL_good_fam_par_susp --check-sex --make-bed</pre> |
17 | | ## Crosscheck with information from Genome Studio provided by Mathieu |
18 | | ##* 100% match between plink and Genome Studio when sex information is within plink threshold |
19 | | ##* 2 pairs of parents swapped |
20 | | ##* 1 individual tagged as male instead of female by both Genome Studio and plink. Tagged as suspicious in fail-sexcheck-qc.txt |
21 | | ## Update sex information for children and swapped individuals |
22 | | ##* <pre>p-link --bfile GvNL_good_fam_par_susp --update-sex GvNL_sex_update.txt --make-bed --out GvNL_raw_final</pre> |
23 | | # Identification of individuals with elevated missing data rates or outlying heterozygosity rate (See Anderson, NP2010, p.1569) => How does this compare with the PLINK suggested approach [[http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml here]] |
24 | | ## Get missing data information: <pre>p-link --bfile GvNL_raw_final --missing --out GvNL_raw_final</pre> |
25 | | ## Get heterozygocity information: <pre>p-link --bfile GvNL_raw_final --het --out GvNL_raw_final</pre> |
26 | | ## Calculate the observed heterozygosity rate per individual using the formula (N(NM) − O(Hom))/N(NM) and plot the missing SNPs vs heterozygocity rate for eyeball inspection. |
27 | | ##* In R: <pre>het <- read.table("GvNL_raw_final.het", head=TRUE) mis <- read.table("GvNL_raw_final.imiss", head=TRUE) het$HET_RATE = (het$"N.NM." - het$"O.HOM.")/het$"N.NM." plot (mis$F_MISS,het$HET_RATE, xlab="Missing SNPs", ylab="Heterozygocity Rate")</pre> |
28 | | ## No samples with callrate < 95% (Filtered in lab) |
29 | | ## Put samples where heterozygocity rate is outside 3 standard dev from the mean (Anderson, NP2010, p.1569) in fail-het-3sd-qc.txt, along with their deviation from the mean |
30 | | ##* In R: <pre>het_fail = subset(het, (het$HET_RATE < mean(het$HET_RATE)-3*sd(het$HET_RATE)) | (het$HET_RATE > mean(het$HET_RATE)+3*sd(het$HET_RATE))) het_fail$HET_DST = (het_fail$HET_RATE-mean(het$HET_RATE))/sd(het$HET_RATE) write.table(het_fail, "fail-het-qc.txt", row.names=FALSE)</pre> |
31 | | ##* A total of 5 individuals failed. Worst failure has a distance of -3.74sd from the mean. |
32 | | # Check inheritance and duplicates |
33 | | ## Prune SNPs for LD |
34 | | ##* <pre>p-link --bfile GvNL_raw_final --indep-pairwise 50 5 0.2 --out GvNL_raw_final</pre> |
35 | | ## Generate pairwise Identity-By-State (IBS) metrics using pruned SNPs only |
36 | | ##*<pre>p-link --bfile GvNL_raw_final --extract GvNL_raw_final.prune.in --genome --out GvNL_raw_final</pre> |
37 | | ## Check trio inheritance based on IBS: Check the Pi_HAT value for the individuals (unrelated individuals =~ 0, parent <-> child =~ 0.5, siblings =~0.5, twins =~ 1.0). Done with homemade perl script. |
38 | | ##* 2 Families are not real families (G34c (child) is wrong and A56b (mother) is wrong) |
39 | | ##* Duplicated family: R24 == R25 |
40 | | ## Check trio inheritance based on Mendelian segregation |
41 | | ##* <pre> p-link --file GvNL_raw_final --me 0.05 0.1 --make-bed --out GvNL_inheritance</pre> |
42 | | ##* A diff between GvNL_raw_final.fam and GvNL_inheritance.fam confirms IBS version: families G34 and A56 have been removed. |
43 | | }}} |
| 6 | Update family information |
| 7 | {{{p-link --bfile GvNL_samples --update-ids GvNL_family_update.txt --make-bed --out GvNL_good_fam}}} |
| 8 | |
| 9 | Set trio relationships |
| 10 | {{{p-link --bfile GvNL_good_fam --update-parents GvNL_parents_update.txt --make-bed --out GvNL_good_fam_par}}} |
| 11 | |
| 12 | Manual check for suspicious individuals (not in trios) and remove them |
| 13 | {{{<pre>p-link --bfile GvNL_good_fam_par --remove GvNL_suspicious.txt --make-bed --out GvNL_good_fam_par_susp}}} |
| 14 | |
| 15 | NB: After checking suspicious individuals with Mathieu, turns out it was 1 typo and 1 filtered low call rate (93%). Therefore, only the low callrate individual was screened out after all. |
| 16 | |
| 17 | Check and update sex information |
| 18 | {{{p-link --bfile GvNL_good_fam_par_susp --check-sex --make-bed}}} |
| 19 | |
| 20 | Crosscheck with information from Genome Studio provided by Mathieu |
| 21 | * 100% match between plink and Genome Studio when sex information is within plink threshold |
| 22 | * 2 pairs of parents swapped |
| 23 | * 1 individual tagged as male instead of female by both Genome Studio and plink. Tagged as suspicious in fail-sexcheck-qc.txt |
| 24 | |
| 25 | Update sex information for children and swapped individuals |
| 26 | {{{<pre>p-link --bfile GvNL_good_fam_par_susp --update-sex GvNL_sex_update.txt --make-bed --out GvNL_raw_final}}} |
| 27 | |
| 28 | Identification of individuals with elevated missing data rates or outlying heterozygosity rate (See Anderson, NP2010, p.1569) => How does this compare with the PLINK suggested approach [[http://pngu.mgh.harvard.edu/~purcell/plink/thresh.shtml here]] |
| 29 | |
| 30 | Get missing data information: {{{p-link --bfile GvNL_raw_final --missing --out GvNL_raw_final}}} |
| 31 | |
| 32 | Get heterozygocity information: {{{p-link --bfile GvNL_raw_final --het --out GvNL_raw_final}}} |
| 33 | |
| 34 | Calculate the observed heterozygosity rate per individual using the formula (N(NM) − O(Hom))/N(NM) and plot the missing SNPs vs heterozygocity rate for eyeball inspection. |
| 35 | * In R: {{{het <- read.table("GvNL_raw_final.het", head=TRUE) mis <- read.table("GvNL_raw_final.imiss", head=TRUE) het$HET_RATE = (het$"N.NM." - het$"O.HOM.")/het$"N.NM." plot (mis$F_MISS,het$HET_RATE, xlab="Missing SNPs", ylab="Heterozygocity Rate")}}} |
| 36 | |
| 37 | No samples with callrate < 95% (Filtered in lab) |
| 38 | |
| 39 | Put samples where heterozygocity rate is outside 3 standard dev from the mean (Anderson, NP2010, p.1569) in fail-het-3sd-qc.txt, along with their deviation from the mean |
| 40 | |
| 41 | In R: {{{het_fail = subset(het, (het$HET_RATE < mean(het$HET_RATE)-3*sd(het$HET_RATE)) | (het$HET_RATE > mean(het$HET_RATE)+3*sd(het$HET_RATE))) het_fail$HET_DST = (het_fail$HET_RATE-mean(het$HET_RATE))/sd(het$HET_RATE) write.table(het_fail, "fail-het-qc.txt", row.names=FALSE)}}} |
| 42 | |
| 43 | Result: |
| 44 | * A total of 5 individuals failed. Worst failure has a distance of -3.74sd from the mean. |
| 45 | |
| 46 | Check inheritance and duplicates |
| 47 | * Prune SNPs for LD |
| 48 | {{{p-link --bfile GvNL_raw_final --indep-pairwise 50 5 0.2 --out GvNL_raw_final}}} |
| 49 | * Generate pairwise Identity-By-State (IBS) metrics using pruned SNPs only |
| 50 | {{{<pre>p-link --bfile GvNL_raw_final --extract GvNL_raw_final.prune.in --genome --out GvNL_raw_final}}} |
| 51 | * Check trio inheritance based on IBS: Check the Pi_HAT value for the individuals (unrelated individuals =~ 0, parent <-> child =~ 0.5, siblings =~0.5, twins =~ 1.0). Done with homemade perl script. |
| 52 | * 2 Families are not real families (G34c (child) is wrong and A56b (mother) is wrong) |
| 53 | * Duplicated family: R24 == R25 |
| 54 | Check trio inheritance based on Mendelian segregation |
| 55 | {{{p-link --file GvNL_raw_final --me 0.05 0.1 --make-bed --out GvNL_inheritance}}} |
| 56 | |
| 57 | A diff between GvNL_raw_final.fam and GvNL_inheritance.fam confirms IBS version: families G34 and A56 have been removed. |