[[TOC()]]
= Data Analysis Plan =

250 trio families (mother-father-offspring) will be randomly sampled from the 12 provincial regions, each with noted ancestral roots from the Netherlands. These samples are to be sequenced using the Illumina HiSeq platform at an average depth of coverage of 12X.  This will result in a data set of 1000 unique chromosomal haplotypes, which is substantially larger than the HapMap (Phase 1 or 2) based on the European-descent CEPH samples.  The trio design gives this project a distinct advantage (as compared to focusing exclusively on unrelated individuals) for cataloguing more complex variants such as copy number or indels due to the ability to test for Mendelian consistency between parent(s) and offspring. 

This page enumerates our analysis plans for this data:

== Trio-aware variant discovery and genotype calling ==

The identification of novel variants from low-coverage sequencing data is a challenging problem given the known and unknown error modes and biases of sequencing platforms.  The 1000 Genomes project has (so far) focused mostly on unrelated samples. The trio design can very likely provide better power for accurate discovery of novel sites, genotyping in familial samples, and haplotype inference. There is an urgent need for novel method development in this area. The GvNL data is ideal for training and benchmarking purposes.

== De novo mutations ==

The trio design also allows an accurate characterization of de novo events in the offspring (variants that do not appear to segregate in a Mendelian fashion). 

== Population genetics of Dutch samples within Europe ==

A key question is to what extent genomic variation can be effectively represented by other European samples (for example, CEU, TSI and GBR in HapMap and 1000 Genomes).  Are common and low-frequency SNPs present in other European samples (at a given sample size)? Can we detect a North-South cline across the Netherlands? 

== Imputation performance based on first-generation SNP microarrays ==

Many thousands of samples have already been used in GWAS. A key question is to what extent the GvNL will be an effective reference panel for imputation. Of all identified variants present in GvNL, how many ungenotyped variants can be imputed effectively?  How does this compare to the first generation HapMap? This also addresses the question how many variants were missed in the initial GWAS.

== NEED TO ADD TRIO-AWARE PHASING AND NOVEL VARIANTS DISCOVERY ==