wiki:BIOS_SampleBlacklist

Version 1 (modified by jamverlouw, 8 years ago) (diff)

--

Data QC based on mix-up mapping and concordance of imputed genotypes with genotypes called from RNAseq data

We used 3 ways of doing the QC:

  1. mix-up mapper: matching genotypes with expression for each sample;
  2. genotype concordance: calculating the concordance of imputed genotypes with genotypes called from RNAseq data;
  3. heterozygosity rate.

The blacklist of samples that do not pass these quality checks can be found in the attachment.

LLS

MixupMapper? detected 5 swaps and 7 samples with wrong genotype. The swaps will be performed and the 7 genotypes replaced. This leaves 7 samples (pers_id: 2014, 3142, 3144, 2634, 2890, 3126 and 3150) without genotype, these should be removed.

geno_id run_id Best Match (geno_id) Best Match (run_id) Action
561 BD2CPRACXX-1-12 563 BD2CPRACXX-1-21 swap
563 BD2CPRACXX-1-21 561 BD2CPRACXX-1-12 swap
974 BD24PGACXX-8-25 978 BD24PGACXX-7-8 swap
978 BD24PGACXX-7-8 974 BD24PGACXX-8-25 swap
1841 AD2CJPACXX-6-9 1842 AD2CJPACXX-5-1 swap
1842 AD2CJPACXX-5-1 1841 AD2CJPACXX-6-9 swap
2585 AD2DATACXX-3-21 3273 AD2DATACXX-3-22 swap
3273 AD2DATACXX-3-22 2585 AD2DATACXX-3-21 swap
3411 BD2D5MACXX-3-7 3413 BD2D5MACXX-4-15 swap
3413 BD2D5MACXX-4-15 3411 BD2D5MACXX-3-7 swap
2928 AD2DATACXX-8-1 2014 BD2CPRACXX-1-22 replace genotype
3126 AD1NFNACXX-8-25 3142 BD1NYRACXX-2-15 replace genotype
3142 BD1NYRACXX-2-15 3144 AD1NAMACXX-7-19 replace genotype
3194 AD2DATACXX-4-5 2634 AD2DATACXX-4-9 replace genotype
311 BD1NW4ACXX-7-13 2890 BD1NYRACXX-5-23 replace genotype
905 AD1NFNACXX-8-27 3126 AD1NFNACXX-8-25 replace genotype
6039 AD1NE2ACXX-5-22 3150 BD24PGACXX-5-5 replace genotype

Possibly contaminated samples

The outliers that show high heterozygosity rate in genotypes called from RNA-seq.

Also present in gender-specific analysis (see below):
BC1KBKACXX-5-6
BD1NW4ACXX-8-5
BD1NYRACXX-2-16
BC1KBKACXX-5-3
BC1KBKACXX-5-1
BD1NYRACXX-2-27
BD1NYRACXX-4-19
BC1KBKACXX-5-4

Possible gender-neutral contaminations:
BC1KBKACXX-3-12
BC1KBKACXX-5-7
BD24PGACXX-7-10
BC1KBKACXX-5-5
BD1NYRACXX-3-1
BC1KBKACXX-5-2
AD1NFNACXX-4-8
BD1NYRACXX-2-18

LifeLines?

http://www.molgenis.org/wiki/DeepNoteworthyObservations

LLDeep_0063

Corresponding RNA-seq sample is AC1C40ACXX-4-4 (old id: 103001429206) has only 76% of reads aligned. Flagged by MixupMapper?? as sample mix-up. Also shows many discordant genotypes when using SNVMix.

LLDeep_0350

Corresponding RNA-seq sample is AD1GWFACXX-4-15 (old id: 103001383279), not flagged by MixupMapper??. However, shows many discordant genotypes when using SNVMix.

Has both high XIST and high chromosome Y expression levels. Average heteryzygosity for all samples = 49%, stdev = 1.9%. Sample LLDeep_0350, 103001383279 has heterozygosity rate of 72%: contaminated sample, where a male and female sample have likely been mixed in very similar proportions, hence the high expression levels of both XIST and chromosome Y genes.

Link to file with genotype concordance and heterozygosity rates on imputed genotpyes can be found here

CODAM

eQTL mapping (gene level) results:

6804 unique cis-regulated genes.

Samples that failed the QC:

2345 (RNA-seq ids: AD10W1ACXX-8-11, CODAM-102-130804): mix-up mapper + genotype concordance;

2495 (RNA-seq ids: AD10W1ACXX-5-18, CODAM-156-130804): mix-up mapper + genotype concordance;

It looks like RNA-seq sample ids were swapped for these two samples (see: http://www.bbmriwiki.nl/wiki/BIOS_QualityControl/BIOS_QualityControlRun1 of 12-December-2013)

Link to file with genotype concordance and heterozygosity rates on imputed genotpyes can be found here

RS

eQTL mapping (gene level) results:

7708 unique cis-regulated genes.

Samples that failed the QC:

8190002 (RNA-seq ids: AD1NNNACXX-4-18, RS-287-130804): mix-up mapper + genotype concordance;

9353 (AC1JV9ACXX-1-13, RS-761-130804): mix-up mapper + genotype concordance;

3520 (BC1JTJACXX-6-7, RS-442-130804): genotype concordance;

562 (BC1KAVACXX-8-13, RS-55-130804): genotype concordance + heterozygosity rate;

6734 (RS-502-130804): genotype concordance + heterozygosity rate; (passed QC in the first run data)

Link to file with genotype concordance and heterozygosity rates on imputed genotpyes can be found here

Data QC based on median correlations of gene counts from each sample to all other samples


Samples with much lower median correlations to all other samples
For methods see: http://www.bbmriwiki.nl/wiki/gene_exon_transcript_count
AC1JV9ACXX.1.10 0.0471
AD1NE2ACXX.5.22 0.1174
AD2D8RACXX.3.3 0.8028
AD2D8RACXX.6.3 0.8093
AD2D8RACXX.1.3 0.8257

Outliers to be removed based on QC stats and PC analysis


Updated: 12-December-2013
Analysis by: Peter-Bram 't Hoen
Too few reads:
AC1JV9ACXX-1-10
AD1NE2ACXX-5-22
BD1NW4ACXX-3-27
Other reasons: See http://www.bbmriwiki.nl/wiki/BIOS_QualityControl/BIOS_QualityControlRun
BD1NYRACXX-6-10 too low percentage of mapped reads, outlier on principal component 1,4,5,6
AD2CJPACXX-8-9 low exon correlation, outlier on principal component 1,11,14
BD1NR9ACXX-7-27 low percentage of mapped reads, outlier on principal component 4, likely degraded

Outliers to be removed based on gender-specific expression analysis

Updated: 12-December-2013
Analysis by: Peter-Bram 't Hoen
The normalized gene expression values (edgeR TMM method, expressed cpm) for XIST and for the sum of all protein-coding Y-chromosomal genes was used to check for contaminations between samples with different gender. The script can be found here. In addition to sample LL AD1GWFACXX-4-15, the following samples (all from LLS) came up and appeared to be contaminated:
BC1KBKACXX-5-1
BC1KBKACXX-5-3
BC1KBKACXX-5-4
BC1KBKACXX-5-6
BC1KBKACXX-5-8
BD1NW4ACXX-8-5
BD1NYRACXX-2-16
BD1NYRACXX-2-27
BD1NYRACXX-4-19

Attachments (8)