== Summary Statistics Plot ==

Author: Peter-Bram 't Hoen[[BR]]
Date: 12-November-2013[[BR]]
Based on this file http://www.bbmriwiki.nl/attachment/wiki/FgPipelineOutput/5ddcae7/run_1_stats.tsv density plots of important qc statistics seperated by biobank were created: [raw-attachment:important_summary_stats_run_1.pdf summary stats for different biobanks][[BR]]
The first 10 principle components from Lude based on the gene counts ([raw-attachment:PCA_Lude_5nov13.txt PCA Lude]) were correlated to these summary statistics. Pearson correlations can be found [raw-attachment:correlations_of_qc_factors_to_pc_Lude.xlsx here]. Plots of pca loadings against summary statistics can be found [raw-attachment:run1_stats_pc_correlations.rar here].

== Decision on samples to be repeated from run 1 ==
On November 13, the BIOS management team decided to repeat samples that had less than 30M past filter reads / mappable reads (15M paired end reads). There were in total 2240 samples that were run once, 30 samples were run twice, and also the merged files for these 30 samples were in the output database with 2330 run_ids. The number of unique samples is 2270. Of these 2154 passed the threshold before or after merging. 116 samples did not pass the threshold. The sample list of samples not passing the threshold can be found [raw-attachment:run1_qc_stats_for_samples_below_30M_past_filter_reads.txt here].


== PCA on samples that are included after removal of clear sample mix-ups and all Amsterdam samples ==
Author: Lude Franke[[BR]]
Date: December 10[[BR]]
PCA conducted on 2188 samples (from the original 2270 samples removed sample mix-ups (see: http://www.bbmriwiki.nl/wiki/FgSampleBlacklist), removed Amsterdam samples, and removed the three samples with very low number of reads BD1NW4ACXX-3-27, AC1JV9ACXX-1-10,AD1NE2ACXX-5-22[[BR]]
Principal components can be found in attachment [raw-attachment:factorloadings.txt here].[[BR]]
[[BR]]
Samples that are outliers (correlation with first principal component < 0.93):[[BR]]
Sample	Comp1[[BR]]
AD2DATACXX-6-6	-0.893288[[BR]]
BD1NYRACXX-6-10	-0.894323[[BR]]
AD1NNNACXX-8-7_BD2D5MACXX-6-7	-0.896686[[BR]]
AD2CJPACXX-8-9	-0.912417[[BR]]
AD1NE2ACXX-1-18	-0.912594[[BR]]
BC1C8DACXX-7-15	-0.919944[[BR]]
AD1NE2ACXX-1-19	-0.922991[[BR]]
AD1NNNACXX-6-18	-0.9262[[BR]]
BD2CPRACXX-7-3	-0.927969[[BR]]
BD1NW4ACXX-2-10	-0.928871[[BR]]
AD1NP0ACXX-8-23	-0.929918[[BR]]
AD1NFNACXX-8-4	-0.930331[[BR]]
AC1C40ACXX-5-8	-0.933724[[BR]]
AD1NE2ACXX-5-4	-0.934033[[BR]]
AD1NNNACXX-5-7	-0.934122[[BR]]
BD1NRGACXX-8-23	-0.935606[[BR]]
AD10W1ACXX-8-9	-0.935622[[BR]]
AC1JV9ACXX-5-10	-0.935901[[BR]]
BC1C19ACXX-6-19	-0.935993[[BR]]
BD2CPRACXX-5-5	-0.936282[[BR]]
AD2CJPACXX-1-5	-0.936736[[BR]]
BC1C8DACXX-6-22	-0.937006[[BR]]
AD1NP0ACXX-2-10	-0.939364[[BR]]


== Summary Statistics Plots on 2188 samples ==
Author: Peter-Bram 't Hoen[[BR]]
Date: 12-December-2013[[BR]]

There were 6 outliers manually flagged based on relatively low percentage of mapped reads, or relatively low exon or gene correlations. This is how they behave in Lude's principal component analysis: [raw-attachment:Outliers_pca_2188_samples.pdf outlier behavior].
BD1NYRACXX-6-10	too low percentage of mapped reads, outlier on principal component 1,4,5,6[[BR]]
AD2CJPACXX-8-9	low exon correlation, outlier on principal component 1,11,14[[BR]]
BD1NYRACXX-4-20	low percentage of unique mappings, not an outlier in pca[[BR]]
BD24PGACXX-3-13	low percentage of mapped reads, not an outlier in pca[[BR]]
AC1C40ACXX-4-4	low percentage of exon mapping, not an outlier in pca[[BR]]
BC1C19ACXX-8-7	low percentage of mapped reads, not an outlier in pca[[BR]]
BD1NR9ACXX-7-27	low percentage of mapped reads, outlier on principal component 4[[BR]]
[[BR]]
Propose to exclude BD1NYRACXX-6-10,AD2CJPACXX-8-9, BD1NR9ACXX-7-27(degraded?). These are now put on the blacklist. See: 

Principal components were correlated to available [raw-attachment:merged_qc_stats_2188_samples.xlsx qc parameters] (including 5' and 3'-bias and gender specific expression): [raw-attachment:correlations_of_qc_factors_to_pc_Lude_2188_samples.xlsx correlations], [raw-attachment:qc_pca.rar scatter plots] 

PC1: number of reads (but also exon and gene correlations, and difference between exon and gene correlations, possibly explaining discrepancies between exon and gene correlations: discrepant samples are usually samples with low number of reads.[[BR]]
PC2: percentage GC and biobank, but also of number of duplicates. These all seem confounded.[[BR]]
PC3: percentage multiple mappings[[BR]]
PC4: gender and median 5'-bias and possibly RNA degradation. Needs to evaluate exon expression for that[[BR]]
PC5: XIST and Y-chromosomal expression, likely gender effect[[BR]]
PC6: percentage GC and multimappings[[BR]]
PC7+8: nothing obvious[[BR]]
PC9: ratio exon/genome mapped, can reflect genomic DNA contamination, perhaps also intronic and thus pre-RNA content[[BR]]
PC10: nothing obvious[[BR]]

== CODAM sample mixups ==
[[BR]]
Author: Dasha Zhernakova[[BR]]
Date: 12-December-2013[[BR]]
[[BR]]
In CODAM dataset MixupMapper identified one sample swap. Original sample id conversion table contained the following:
  genotype id 2495 -> RNA-seq id AD10W1ACXX-5-18[[BR]]
  genotype id 2345 -> RNA-seq id AD10W1ACXX-8-11[[BR]]
[[BR]]
MixupMapper suggested that these samples were swapped and that the correct conversion table is: 
  genotype id 2495 -> RNA-seq id AD10W1ACXX-8-11[[BR]]
  genotype id 2345 -> RNA-seq id AD10W1ACXX-5-18[[BR]]

If the sample ids are swapped in this way, the genotype concordance indeed increases from low to normal level.

Phenotype information says that:[[BR]]
2345 is female[[BR]]
2495 is male[[BR]]


XIST Expression: [[BR]]
AD10W1ACXX-8-11 doesn't have any reads mapping to XIST[[BR]]
AD10W1ACXX-5-18 (normalized) expression is 19.06149483[[BR]]

Mean chrY genes' expression (normalized):[[BR]]
AD10W1ACXX-8-11: 1.807537771[[BR]]
AD10W1ACXX-5-18: 0.467295688[[BR]]
(in AD10W1ACXX-5-18 the expressed genes are pseudogenes)[[BR]]

ChrX heterozygosity rate:[[BR]]
2345: 0.277410392[[BR]]
2495: 0.001561367[[BR]]

These results suggest that RNA-seq sample ids were swapped and the correct conversion table is:[[BR]]
  genotype id 2495 -> RNA-seq id AD10W1ACXX-8-11[[BR]]
  genotype id 2345 -> RNA-seq id AD10W1ACXX-5-18[[BR]]