| 1 | = Recommended BIOS datasets for downstream analysis = |
| 2 | |
| 3 | = RNAseq data = |
| 4 | == Freeze I == |
| 5 | === Data available === |
| 6 | Raw RNA seq data is avalable at the grid, see [wiki:FgRnaSeq RNASeq data]. This data has been aligned using the pipeline described at [wiki:FgPipeline RNAseq alignment and quantification pipeline], the exon, transcript and gene level count output is described in the following. Count data is available from the so called 'Freeze1': These are the 2116 samples from Groningen (N=626), Leiden (N=654), Rotterdam (N=652) and Maastricht (N=184) that passed QC (see [wiki:FgSampleBlacklist2 RNAseq QC]). This is around half of the BIOS RNA seq data that is used for the first papers: the other half has been measured but is still in the process of aligning and QC. Both raw and TMM normalized data are available. TMM normalization corrects for the different library sizes across subjects, see attached script for R code or the R package edgeR, and http://genomebiology.com/2010/11/3/r25. |
| 7 | === Location on VM === |
| 8 | |
| 9 | * exon counts raw: /virdir/Backup/RP3_data/RNASeq/run_01/exoncounts/exon_count_freeze1_R_object.RData [[BR]] |
| 10 | * exon counts TMM normalized: /virdir/Backup/RP3_data/RNASeq/run_01/exoncounts/exon_count_freeze1_TMM_normalized_R_object.RData [[BR]] |
| 11 | * transcript counts raw: /virdir/Backup/RP3_data/RNASeq/run_01/transcriptcounts/transcript_count_freeze1_R_object.RData [[BR]] |
| 12 | * transcript counts TMM normalized: /virdir/Backup/RP3_data/RNASeq/run_01/transcriptcounts/transcript_count_freeze1_TMM_normalized_R_object.RData [[BR]] |
| 13 | * gene counts raw: /virdir/Backup/RP3_data/RNASeq/run_01/genecounts/gene_count_freeze1_R_object.RData [[BR]] |
| 14 | * gene counts TMM normalized: /virdir/Backup/RP3_data/RNASeq/run_01/genecounts/gene_count_freeze1_TMM_normalized_R_object.RData [[BR]] |
| 15 | |
| 16 | === How to use the data === |
| 17 | |
| 18 | Data is stored in R objects, by loading the files in R (e.g. type load('/virdir/Backup/RP3_data/RNASeq/run_01/exoncounts/exon_count_freeze1_R_object.RData') in R) there will be a matrix called RNA in your workspace for the raw data, and RNAs for the TMM normalized data. The row names of these matrices (type rownames(RNAs)) contain gene, exon or transcript IDs, the column names (colnames(RNAs) are the subject BIOS IDs (called uuid in other files). We used |
| 19 | ensembl v.71 for annotation, see [wiki:FgReferenceFiles Reference and annotation]. If you want to export the data to a tab delimited text file, use write.table(RNAs, file='yourfile.txt', quote =FALSE, col.names=TRUE, row.names=TRUE, sep='\t').[[BR]] |
| 20 | == Freeze II == |
| 21 | === Data available === |
| 22 | === Location on VM === |
| 23 | === How to use the data === |
| 24 | = DNA methylation data = |
| 25 | |
| 26 | === Data available === |
| 27 | |
| 28 | Raw methylation data, idat-files, are available for all samples currently (01-08-2015) generated within the BIOS project. Sample swaps detected by !MixupMapper (http://bioinformatics.oxfordjournals.org/content/27/15/2104) have been corrected. Detection of bad quality samples or runs can be performed using !MethylAid (http://bioinformatics.oxfordjournals.org/content/30/23/3435) the results are available at http://shiny.bioexp.nl/BIOS. Rdata-files have been generated for each biobank and a combined file for all biobanks containing !SummarizedData-objects with functional normalized beta values sample and feature annotation. QC has been perform using !MethylAid. |
| 29 | |
| 30 | === Location on VM === |
| 31 | |
| 32 | The idat-files are located on: /virdir/Scratch/RP3_data/IlluminaHumanMethylation450k/450k |
| 33 | separated by biobank: /virdir/Scratch/RP3_data/IlluminaHumanMethylation450k/450k/CODAM, /virdir/Scratch/RP3_data/IlluminaHumanMethylation450k/450k/LL |
| 34 | and all folders contain the samplesheets provided by the data generation center including a unique and universal BIOS identifier (uuid in phenotype files). |
| 35 | |
| 36 | === How to use the data === |
| 37 | |
| 38 | The Bioconductor/R packages minfi and illuminaio provide reading capabilities for the idat-files. |
| 39 | |
| 40 | = Genotype data = |
| 41 | |
| 42 | === Data available === |
| 43 | |
| 44 | All genotype data that has been imputed(GoNL version 5) and stored on the SRM is also available from the VM. Beware the imputed files contain indels as well and the Biobank person_id is used in the sample-files. |
| 45 | |
| 46 | Genotype data has also been imputed using the Human Reference Consortium (HRC) reference panel and is available from both the SRM and the VM. The same warnings apply for this set. |
| 47 | |
| 48 | === Location on VM === |
| 49 | |
| 50 | Gzipped IMPUTE2 files are stored at /virdir/Backup/RP3_data/GWAS_ImputationGoNLv5 per biobank. |
| 51 | |
| 52 | HRC imputed data is contained in per-chromosome Variant Call Format (VCF) files at /virdir/Backup/RP3_data/HRC_Imputation/[Biobank]/results/unzipped/.. |
| 53 | |
| 54 | === How to use the data === |
| 55 | |
| 56 | Once unzipped IMPUTE2 files can easily be read as these are tab-separated files (see [https://mathgen.stats.ox.ac.uk/impute/impute_v2.html]). |
| 57 | |
| 58 | Note that the HRC imputed data is in VCF format, which you may need to convert before usage. |
| 59 | |
| 60 | = Phenotype data = |
| 61 | |
| 62 | === Data available === |
| 63 | |
| 64 | BIOS phenotype data is stored in a meta database, see [wiki:FgMetadatabase this page]. This databases can be accessed by so called views, using e.g. R. Three views were extracted (January 2015) and stored at the VM: [[BR]] |
| 65 | * phenotype data: view="allPhenotypes" design="phenotypes"[[BR]] |
| 66 | * RNA seq sample sheets: view="rnaseq", design="samplesheets" [[BR]] |
| 67 | * IDs: view="getIds", design="identifiers" [[BR]] |
| 68 | These files are available in .RData and .csv file formats. |
| 69 | See for column name explanations the page [wiki:FgPhenotype Phenotype data]. Phenotype data is not complete yet: we are currently contacting the biobanks to complete there files. |
| 70 | === Location on VM === |
| 71 | |
| 72 | * Phenotypes: /virdir/Backup/RP3_data/Phenotypes/BIOS_Phenotypes.RData (matrix P) and /virdir/Backup/RP3_data/Phenotypes/BIOS_Phenotypes.csv [[BR]] |
| 73 | * RNA-seq sample sheets: /virdir/Backup/RP3_data/Phenotypes/rna_seq_sample_sheets.RData (matrix S) and /virdir/Backup/RP3_data/Phenotypes/rna_seq_sample_sheets.csv [[BR]] |
| 74 | * IDs: /virdir/Backup/RP3_data/Phenotypes/BIOS_IDs.RData (matrix F) and /virdir/Backup/RP3_data/Phenotypes/BIOS_IDs.csv [[BR]] |
| 75 | === How to use the data === |
| 76 | |
| 77 | Link the files to the RNA-seq, genotype or methylation data by mapping the corresponding IDs. |