23 | | Header line: |
24 | | |
25 | | CHRC POSC SNPC STRANDC A1C A2C CHRV POSV SNPV A1V A2V |
26 | | |
27 | | Next lines should all contain 12 tab-delimited values. There should be no missing values. |
28 | | * CHR[C/V]: chromosome SNP is located at, according to Chip [C] and VCF [V] build versions <integer from 1 to 22 for autosomes, “X” for X-chromosome, and “Y” for Y-chromosome> |
29 | | * POS[C/V]: chromosomal position, according to Chip [C] and VCF [V] build versions <integer> |
30 | | * SNP[C/V]: SNP rs-name, according to Chip [C] and VCF [V] dbSNP versions <alphanumeric> |
31 | | * STRANDC: strand in chip annotation <single character, either “+” or “-“> |
32 | | * A1C: first allele in chip annotation <single character, either “A”, “C”, “G” or “T”> |
33 | | * A2C: second allele in chip annotation <single character, either “A”, “C”, “G” or “T”> |
34 | | * A1V: (translated) first chip allele (A1C) according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
35 | | * A2V: (translated) second chip allele (A2C) according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
36 | | |
37 | | Questions: |
38 | | * Q: what is the build and dbSNP version used by chip and VCF? |
39 | | * Q: how many SNPs changed the name in VCF build? |
40 | | * Q: how many SNPs changed the strand in VCF build? |
41 | | * Q: please provide a 2x2 table (name change/not) x (strand change/not) |
42 | | |
43 | | Note that in future, samples typed using a number of different Chip platforms will be coming in. Therefore above step should not assume a particular chip is used! |
44 | | |
45 | | == UPDATED CHIP GENOTYPES == |
46 | | |
47 | | Using above described translation table, generate updated chip genotypes file (name: chip_genotypes_yyyy.mm.dd.txt) |
48 | | |
49 | | This is a tab-delimited text file containing a table. The header line is |
50 | | |
51 | | ID SNPV QUALCHIP A1VCHIP A2VCHIP GTCHIP |
52 | | |
53 | | Next lines should all contain 5 tab-delimited values. Use “.” (dot) for missing. |
54 | | * ID: sample ID (genotyped individual’s code) <alphanumeric> |
55 | | * SNPV: SNP rs-name, according to VCF dbSNP version <alphanumeric> |
56 | | * QUALCHIP: calling quality for the individual genotype |
57 | | * A1VCHIP: first allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
58 | | * A2VCHIP: second allele the personal genotype, translated according to “+” strand on VCF build <single character, either “A”, “C”, “G” or “T”> |
59 | | GTCHIP: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”> |
60 | | |
61 | | Questions: |
62 | | * Q: do all SNPs in chip data have rs-number? |
63 | | * Q: what alleles are observed in chip data? Only A/T/G/C? |
64 | | * Q: are all SNPs bi-allelic? |
65 | | |
66 | | == EXTRACTION OF CHIP SNPS FROM VCF FILE == |
67 | | |
68 | | From VCF, extract only lines containing SNPs also observed in the chip (see SNPV column of “chip_data_conversion_table_yyyy.mm.dd.txt”) |
69 | | |
70 | | Parse extracted lines, and arrange “Annotation” and “Genotypic” tables |
71 | | |
72 | | Annotation table (name: VCF_annotation_yyyy.mm.dd.txt). Tab-delimited file with header lines (and consequently extracting following columns from VCF): |
73 | | |
74 | | CHROM POS ID REF ALT QUAL FILTER INFO |
75 | | |
76 | | At the beginning of the file, add meta-info from VCF file |
77 | | |
78 | | Genotypic table (name: VCF_genotypes_yyyy.mm.dd.txt). Tab-delimited file containing following information. Header line: |
79 | | |
80 | | ID SNPV GTVCF GQ DP BATCH ???? |
81 | | |
82 | | Next lines should all contain XXX tab-delimited values. Use “.” (dot) for missing. |
83 | | * ID: sample ID (genotyped individual’s code) <alphanumeric> |
84 | | * SNPV: SNP rs-name, according to VCF dbSNP version <alphanumeric> |
85 | | * GTVCF: genotype with alleles in alphabetic order, <two characters, each either “A”, “C”, “G” or “T”>. This can be done by mapping the numbers provided in VCF GT field to REF and ALT and then ordering. |
86 | | * GQ, DP: directly from VCF file |
87 | | BATCH … |
88 | | |
89 | | Merge chip and VCF genotypic tables (“chip_genotypes_yyyy.mm.dd.txt” and “VCF_genotypes_yyyy.mm.dd.txt”) using ID and SNPV as key variables. Keep all chip genotypes, substituting missing (“.”) when no information is available from VCF. Name the table “merged_chip_and_VCF_genotypes_yyy.mm.dd.txt”. |
90 | | |
91 | | Questions: |
92 | | * Q: What is count and proportion of genotypes that do not match between GTCHIP and GTVCF? How much these counts/proportions changes if dropping rows with QUALCHIP < X (vary X)? How much these counts/proportions changes if dropping rows with GQ (DP) < X (vary X)? |
93 | | * Q: What is proportion of false-positive and false-negative findings in our study, if we do not take trio structure into account? |
94 | | * Q: Find out QC metrics thresholds maximizing specificity and sensitivity. |
95 | | |
96 | | Update the table with variable “CHIPVCFMISMATCH” (1 if mismatch, 0 for match, missing (“.”) if any is missing). |
97 | | |
98 | | * Q: Explore, which variables are significant predictors of mismatch using multiple logistic regression. |
99 | | |
100 | | |
101 | | == CHIP SNPS MISSING FROM VCF == |
102 | | |
103 | | Write the list of the chip SNPs not in VCF into the file “list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt” (single column containing SNPV name). This should be done when matching chip SNPs with VCF SNPs (see section “EXTRACT CHIP SNPs FROM VCF”) |
104 | | |
105 | | * Q: How many variants do we miss in VCF (how many SNPs in file list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt)? |
106 | | |
107 | | For each SNP in list_of_chip_snps_missing_in_VCF_yyyy.mm.dd.txt, based on chip_genotypes_yyyy.mm.dd.txt derive frequency from chip data and arrange the following table (name: annot_chip_snps_missing_in_VCF_yyyy.mm.dd.txt). The header line should contain |
108 | | |
109 | | SNPV A1V A2V FREQA1V |
110 | | |
111 | | Each next line should contain 4 values delimited by tab; SNPV, A1V, and A2V explained above (the same as in “chip_data_conversion_table_yyyy.mm.dd.txt” file). FREQA1V is a floating-point frequency of allele “A1V”. |
112 | | |
113 | | * Q: Does the distribution of frequency of missed variants match the expected under the assumption that we miss at random because of limited #chromosomes and coverage (for each trio we read two chromosomes at 12x and 2 chromosomes at 24x) |
114 | | |
| 25 | Automated workflow (will be) provided in ChipBasedQcPipelineWorkflow page. |