Version 9 (modified by 13 years ago) (diff) | ,
---|
GoNL Immunochip Data Preparation for Concordance
Table of Contents
This page describes the necessary steps to get a VCF Hg19 file containing the GoNL Immunochip data from the raw/QC'ed Immunochip data in PED format. This is using tools as available in early 2011 and should get much simpler when PLINK/Seq is released.
Here, the procedure is shown for a FORWARD strand PED file. If you have a TOP/TOP PED file, you will still need to correct for strand.
PED to VCF
The following steps explain how to produce a VCF file from PLINK ped files. It is a rather cumbersome process at the moment and should be streamlined when PLINK/Seq is released.
Initial PED to VCF
The only tool to have an easy (yet not completely correct) conversion from PED to VCF is the beta version of PLINK/Seq available -along with instructions- here: http://www.broadinstitute.org/gsa/wiki/index.php/Converting_ped_to_vcf
This pre-compiled version can only be run on Linux64 machines and some dependency problems may occur.
The result here is a VCF tool that contains all
Correct the initial VCF file
The initial VCF file produced by PLINK 1.08 does contain the right information, however it is not actually in VCF format. The problem here is that PLINK files specify the Ref/Alt? alleles relative to the dataset where VCF specifies the Ref/Alt? alleles relative to the Human Genome Reference it is aligned on. To correct this VCF file, it is necessary to modify the initial VCF file so that the alleles are relative to the Human Genome Reference and not the dataset anymore.
Create a tab-delimited file containing your loci using BEDTools
First it has to be clear that here BED refers to the UCSC BED format and NOT the PLINK binary file format. To be able to sort the alleles with the Human Genome Reference, we need to access it. As it is a big file, BEDTools function fastaFromBed can extract only the loci of interest (in this case those on the chip) and report them in tab-delimited file.
fastaFromBed needs a UCSC BED file as input. This file is tab-delimited and contains 3 columns: Chrom Start_seq End_seq. As we are only interested in specific loci, Start_seq and End_seq will be 1 base appart so that only the locus of interest is reported in the output file. This file can very easily be generated either from the initial VCF file or the PLINK BIM file:
- From VCF: grep -v '#' in.vcf | awk '{OFS="\t";print $1,$2,$2+1}' > out.bed
Once you have the input file, simply run fastaFromBed on it giving the Human Reference corresponding to the chip data as the other input. For more information on fastaFromBed, see the BEDTools Manual.
Re-arrange Ref/Alt? alleles based on the Human Genome Reference
Now that we have the Human Reference Genome loci, it is trivial to re-arrange the alleles so that the Ref and Alt alleles correspond to the Human Genome Reference. I wrote a small script, align-vcf-to-ref.pl that does the work provided the correct input. Note that when flipping the order of alleles in the VCF Ref/Alt? columns, one must also flipped the genotypes correctly.
[Optional] Update SNP IDs
Depending on the chip platform, the SNP IDs might not correspond to dbSNP IDs (ex. Illumina Immonchip). Illumina provides a list of the corresponding Illumina-dbSNP names. A small script reannotate-vcf-snp-ids.pl can update the SNP IDs so that they correspond to dbSNP based on the Illumina list (not on SNP location).
[Optional] Flip Strand
A small script, flip-vcf-snp.pl, is available in case you have the following:
- A VCF file coming from a TOP/TOP PLINK file set
- A BIM file corresponding to the same dataset but in forward strand
The script can be used to flip the strand according to the BIM file.
Liftover file
The last step in preparing the immunochip data for comparison with the sequence data is to liftover the VCF file to the same Human Genome Reference as the Sequence data so that comparisons can be made. Here is how:
- Get the appropriate chain files
- From the Broad GSA ftp (password is blank)
- From the USCS Genome Browser #
- Get the appropriate fasta files (you'll need both from-and-to builds fasta files)
- From NCBI
- From USCS Genome Browser
- Index the fasta files appropriately to get .fai (samtools) and .dict files (Picard) as described on the GATK wiki
- Get and run the GATK LiftOver tool
- IMPORTANT: Check your VCF file header as some versions of the GATK liftover tool (e.g. v1.0.5083) might mix the individuals in the header (sort alphabetically rather than preserve original order). If the order is changed, then you should copy/paste the original order from the source VCF file.
Note that once the liftover VCF has successfully been created, it can be used to liftover the PLINK files. To do so:
- Remove all SNPs that are not present in the new reference VCF file (using plink --extract)
- Use the liftover VCF as an input to the liftover-bim.pl tool .
Downloads
All tools developed inhouse at UMCG are available here: http://www.bbmriwiki.nl/svn/ped_to_vcf