Changes between Version 11 and Version 12 of DataConcordance


Ignore:
Timestamp:
Apr 21, 2011 4:43:30 PM (14 years ago)
Author:
laurent
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • DataConcordance

    v11 v12  
    88 * BGI Sequence data
    99
    10 = Methods & Tools =
    11 == File Types ==
     10== Methods & Tools ==
     11=== File Types ===
    1212All data sets were either generated or converted to VCF files aligned on the build Hg19 of the Human Reference Genome:
    1313
     
    1515 * See [https://www.broad.harvard.edu/gsa/wiki/index.php/LiftOverVCF.pl GATK LiftOverVCF] about how to liftover a VCF file from one reference to another
    1616
    17 == Concordance calculation using [http://vcftools.sourceforge.net/ VCFTools] ==
     17=== Concordance calculation using [http://vcftools.sourceforge.net/ VCFTools] ===
    1818To calculate the concordance between the different files, [http://vcftools.sourceforge.net/ VCFTools] was used. More specifically: <pre>vcftools --vcf /data/lfrancioli/immunochip/hg19/GvNL.hg19.final.vcf --indv ${sample} --diff /data/lfrancioli/results/pilot/${sample}.human_g1k_v37.immuno.vcf --diff-site-discordance --diff-indv-discordance --diff-discordance-matrix</pre> This computes the concordance per file, site and individual as well as a discordance matrix. This was applied on a sample level so only the file, site and discordance matrix where actually used.
    1919
    20 == Concordance aggregation using home-made scripts ==
     20=== Concordance aggregation using home-made scripts ===
    2121The output of the VCFTools being per sample, they are useful for single individual QC but not for population level QC. A few scripts were developed in order to easily aggregate the data over a selection of samples files.
    2222
    23 === vcftools-diff_site-concordance.pl ===
     23==== vcftools-diff_site-concordance.pl ====
    2424As its name suggests, this script runs over the individual .diff.site files produced by VCFTools and aggregate their information. The following features are available:
    2525
     
    3232** SNP filtering ** Addition of MAF from a plink frq files ** Addition of SNP ID from a plink bim file ** Output of shared SNPs only
    3333
    34 === vcftools-discordance-matrix.py ===
     34==== vcftools-discordance-matrix.py ====
    3535This script aggregates the discordance matrix files produced by vcftools into one.
    3636
    37 == Reporting using R scripts ==
     37=== Reporting using R scripts ===
    3838For reporting purpose, R scripts were created. These scripts all take files created using vcftools-diff_site-condordance.pl or vcftools-discordance-matrix.py as input. The following scripts are available:
    3939
     
    5050** Plots the genotype discordance by "discordance type" (0/0 -> 0/1, 0/0 -> 1/1, 0/1 -> 0/0, etc.) ** Usage: Rscript plot_discordance_matrix.R <discordance_matrix_file> <out_plot.jpg> [dataset1_name] [dataset2_name] [show_concordant_data=FALSE] [<concordance_file>] ** Note: *** The last optional argument is a concordance file over the same data to plot as 'unknown' all loci that were not captured by the concordance matrix since the alleles were not exact matches (e.g. if one of the allele was monomorphic in one set).
    5151
    52 = Results - GoNL Pilot =
    53 == Groningen / BGI ==
     52== Results - GoNL Pilot ==
     53=== Groningen / BGI ===
    5454Datasets:
    5555
     
    6262** Produced using BGI pipeline on b36, then lifted over to hg19 ** SNPs filtered using standard BGI filter setup
    6363
    64 === Loci Concordance ===
     64==== Loci Concordance ====
    6565Below is a chart showing the shared and unique SNPs in the two datasets regardless of their genotypes. As expected, the vast majority of the SNPs are shared between the datasets, a relatively high number of SNPs are only found in Groningen (amongst them a majority of unfiltered false positives) and a small number of SNPs unique to the BGI dataset (to be investigated).
    6666
     
    6969After investigation, the three least concordant individuals encountered a problem while processing one of their lanes, thus leading to 2/3 of the normal coverage. The figures should be updated when the lanes have been processed and these individuals corrected.
    7070
    71 === Genotype Concordance ===
     71==== Genotype Concordance ====
    7272The following chart shows the genotype concordance on the shared SNPs between BGI and Groningen datasets.
    7373
     
    7676Note: The chart above does not take sex chromosomes into account as an artifact introduced by the way the Y-chrom was mapped by BGI was showing all males as completely discordant over the sex chromosomes.
    7777
    78 == Groningen / Immunochip ==
     78=== Groningen / Immunochip ===
    7979Datasets:
    8080
     
    8787** ~165K loci after QC (both SNPs and homozygous reference) *** SNP HWE p-val > 1e-3 *** SNP callrate > 99% ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19
    8888
    89 === Genotype Concordance ===
     89==== Genotype Concordance ====
    9090The following chart shows the genotype concordance on the 165K Immunochip loci left after QC.
    9191
     
    103103[[Image(pilot.immuno.seq.gen.concordance.test.jpg)]]
    104104
    105 == BGI / Immunochip ==
     105=== BGI / Immunochip ===
    106106Datasets:
    107107
     
    114114** ~165K loci after QC (both SNPs and homozygous reference) *** SNP HWE p-val > 1e-3 *** SNP callrate > 99% ** Exported from Genome Studio, QC'ed and lifted over from hg18 to hg19
    115115
    116 === Genotype Concordance ===
     116==== Genotype Concordance ====
    117117The following chart shows the concordance between the 2 datasets over ~47K shared loci.
    118118