= Data QC based on mix-up mapping and concordance of imputed genotypes with genotypes called from RNAseq data = We used 3 ways of doing the QC: 1. mix-up mapper: matching genotypes with expression for each sample; 1. genotype concordance: calculating the concordance of imputed genotypes with genotypes called from RNAseq data; 1. heterozygosity rate. The blacklist of samples that do not pass these quality checks can be found in the attachment. === LLS === MixupMapper detected 5 swaps and 7 samples with wrong genotype. The swaps will be performed and the 7 genotypes replaced. This leaves 7 samples (pers_id: 2014, 3142, 3144, 2634, 2890, 3126 and 3150) without genotype, these should be removed. ||=geno_id =||=run_id =||=Best Match (geno_id) =||=Best Match (run_id) =||= Action =|| || 561 ||BD2CPRACXX-1-12 || 563 ||BD2CPRACXX-1-21 || swap || || 563 ||BD2CPRACXX-1-21 || 561 ||BD2CPRACXX-1-12 || swap || || 974 ||BD24PGACXX-8-25 || 978 ||BD24PGACXX-7-8 || swap || || 978 ||BD24PGACXX-7-8 || 974 ||BD24PGACXX-8-25 || swap || ||1841 ||AD2CJPACXX-6-9 ||1842 ||AD2CJPACXX-5-1 || swap || ||1842 ||AD2CJPACXX-5-1 ||1841 ||AD2CJPACXX-6-9 || swap || ||2585 ||AD2DATACXX-3-21 ||3273 ||AD2DATACXX-3-22 || swap || ||3273 ||AD2DATACXX-3-22 ||2585 ||AD2DATACXX-3-21 || swap || ||3411 ||BD2D5MACXX-3-7 ||3413 ||BD2D5MACXX-4-15 || swap || ||3413 ||BD2D5MACXX-4-15 ||3411 ||BD2D5MACXX-3-7 || swap || ||2928 ||AD2DATACXX-8-1 ||2014 ||BD2CPRACXX-1-22 || replace genotype || ||3126 ||AD1NFNACXX-8-25 ||3142 ||BD1NYRACXX-2-15 || replace genotype || ||3142 ||BD1NYRACXX-2-15 ||3144 ||AD1NAMACXX-7-19 || replace genotype || ||3194 ||AD2DATACXX-4-5 ||2634 ||AD2DATACXX-4-9 || replace genotype || || 311 ||BD1NW4ACXX-7-13 ||2890 ||BD1NYRACXX-5-23 || replace genotype || || 905 ||AD1NFNACXX-8-27 ||3126 ||AD1NFNACXX-8-25 || replace genotype || ||6039 ||AD1NE2ACXX-5-22 ||3150 ||BD24PGACXX-5-5 || replace genotype || ==== Possibly contaminated samples ==== The outliers that show high heterozygosity rate in genotypes called from RNA-seq. [[BR]][[BR]] Also present in gender-specific analysis (see below):[[BR]] BC1KBKACXX-5-6[[BR]] BD1NW4ACXX-8-5[[BR]] BD1NYRACXX-2-16[[BR]] BC1KBKACXX-5-3[[BR]] BC1KBKACXX-5-1[[BR]] BD1NYRACXX-2-27[[BR]] BD1NYRACXX-4-19[[BR]] BC1KBKACXX-5-4[[BR]][[BR]] Possible gender-neutral contaminations:[[BR]] BC1KBKACXX-3-12[[BR]] BC1KBKACXX-5-7[[BR]] BD24PGACXX-7-10[[BR]] BC1KBKACXX-5-5[[BR]] BD1NYRACXX-3-1[[BR]] BC1KBKACXX-5-2[[BR]] AD1NFNACXX-4-8[[BR]] BD1NYRACXX-2-18[[BR]] === === === LifeLines === http://www.molgenis.org/wiki/DeepNoteworthyObservations LLDeep_0063 Corresponding RNA-seq sample is AC1C40ACXX-4-4 (old id: 103001429206) has only 76% of reads aligned. Flagged by !MixupMapper? as sample mix-up. Also shows many discordant genotypes when using SNVMix. LLDeep_0350 Corresponding RNA-seq sample is AD1GWFACXX-4-15 (old id: 103001383279), not flagged by !MixupMapper?. However, shows many discordant genotypes when using SNVMix. Has both high XIST and high chromosome Y expression levels. Average heteryzygosity for all samples = 49%, stdev = 1.9%. Sample LLDeep_0350, 103001383279 has heterozygosity rate of 72%: contaminated sample, where a male and female sample have likely been mixed in very similar proportions, hence the high expression levels of both XIST and chromosome Y genes.[[BR]] [[BR]] Link to file with genotype concordance and heterozygosity rates on imputed genotpyes can be found [raw-attachment:genotype_concordance_heterozygosity_rate_imputed_RS_CODAM_LLS.xlsx here] [[BR]] === CODAM === '''eQTL mapping (gene level) results:''' 6804 unique cis-regulated genes.[[BR]][[BR]] '''Samples that failed the QC:''' 2345 (RNA-seq ids: AD10W1ACXX-8-11, CODAM-102-130804): mix-up mapper + genotype concordance; 2495 (RNA-seq ids: AD10W1ACXX-5-18, CODAM-156-130804): mix-up mapper + genotype concordance; It looks like RNA-seq sample ids were swapped for these two samples (see: http://www.bbmriwiki.nl/wiki/BIOS_QualityControl/BIOS_QualityControlRun1 of 12-December-2013)[[BR]] Link to file with genotype concordance and heterozygosity rates on imputed genotpyes can be found [raw-attachment:genotype_concordance_heterozygosity_rate_imputed_RS_CODAM_LLS.xlsx here] [[BR]] === RS === '''eQTL mapping (gene level) results:''' 7708 unique cis-regulated genes.[[BR]][[BR]] '''Samples that failed the QC:''' 8190002 (RNA-seq ids: AD1NNNACXX-4-18, RS-287-130804): mix-up mapper + genotype concordance; 9353 (AC1JV9ACXX-1-13, RS-761-130804): mix-up mapper + genotype concordance; 3520 (BC1JTJACXX-6-7, RS-442-130804): genotype concordance; 562 (BC1KAVACXX-8-13, RS-55-130804): genotype concordance + heterozygosity rate; ~~6734 (RS-502-130804): genotype concordance + heterozygosity rate;~~ (passed QC in the first run data)[[BR]] Link to file with genotype concordance and heterozygosity rates on imputed genotpyes can be found [raw-attachment:genotype_concordance_heterozygosity_rate_imputed_RS_CODAM_LLS.xlsx here] [[BR]] [[BR]] = Data QC based on median correlations of gene counts from each sample to all other samples = [[BR]] Samples with much lower median correlations to all other samples [[BR]] For methods see: http://www.bbmriwiki.nl/wiki/gene_exon_transcript_count [[BR]] AC1JV9ACXX.1.10 0.0471[[BR]] AD1NE2ACXX.5.22 0.1174[[BR]] AD2D8RACXX.3.3 0.8028[[BR]] AD2D8RACXX.6.3 0.8093[[BR]] AD2D8RACXX.1.3 0.8257[[BR]] [[BR]] = Outliers to be removed based on QC stats and PC analysis = [[BR]] Updated: 12-December-2013[[BR]] Analysis by: Peter-Bram 't Hoen[[BR]] Too few reads: [[BR]] AC1JV9ACXX-1-10[[BR]] AD1NE2ACXX-5-22[[BR]] BD1NW4ACXX-3-27[[BR]] Other reasons: See http://www.bbmriwiki.nl/wiki/BIOS_QualityControl/BIOS_QualityControlRun [[BR]] BD1NYRACXX-6-10 too low percentage of mapped reads, outlier on principal component 1,4,5,6[[BR]] AD2CJPACXX-8-9 low exon correlation, outlier on principal component 1,11,14[[BR]] BD1NR9ACXX-7-27 low percentage of mapped reads, outlier on principal component 4, likely degraded[[BR]] = Outliers to be removed based on gender-specific expression analysis = Updated: 12-December-2013[[BR]] Analysis by: Peter-Bram 't Hoen[[BR]] The normalized gene expression values (edgeR TMM method, expressed cpm) for XIST and for the sum of all protein-coding Y-chromosomal genes was used to check for contaminations between samples with different gender. The script can be found [raw-attachment:gender_analysis.r here]. In addition to sample LL AD1GWFACXX-4-15, the following samples (all from LLS) came up and appeared to be contaminated:[[BR]] BC1KBKACXX-5-1[[BR]] BC1KBKACXX-5-3[[BR]] BC1KBKACXX-5-4[[BR]] BC1KBKACXX-5-6[[BR]] BC1KBKACXX-5-8[[BR]] BD1NW4ACXX-8-5[[BR]] BD1NYRACXX-2-16[[BR]] BD1NYRACXX-2-27[[BR]] BD1NYRACXX-4-19[[BR]]