wiki:BigCompute

Version 98 (modified by Barbera van Schaik, 14 years ago) (diff)

--

Port applications to Dutch Life Science Grid

People

  • AMC: Antoine van Kampen, Barbera van Schaik, Silvia D Olabarriaga, Mark Santcroos
  • Sara/BiGGrid: Tom Visser
  • UMCG: Morris Swertz, Freerk van Dijk

Description

Software is being implemented as workflow components. The workflows will run on the Dutch life science grid.

Implemented workflow components at AMC

This list of workflow components are already available. We can expand it with Pindel and (parts of) the GATK pipeline.

  • Splitting of fastq files
  • Building a BWA index on the genome sequence (base space and color space)
  • BWA for shotgun reads (base space and color space) It is possible to do parameter sweeps. Output is in bam format
  • Merge bam results
  • Samtools pileup
  • Varscan (pileup to snp, indel and cns)
  • Bam2coverage creates a UCSC wiggle file to display the genome coverage (per 50kbp)
  • Coverage-per-base determines the coverage for every base in the genome and it summarizes the results (coverage versus frequency)
  • Annovar (works for hg18, working on other assemblies). This is a pipeline to annotate variants (gene, dbsnp, hapmap, 1000g, conservation, etc)
  • FastqC

Implemented components of the Groningen pipeline

Template (grid component)

Alignment, realignment, recalibration, stats

  • pe0--fastqc.ftl (FastqToFastQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/quality/Workflow/FastqToFastQC.gwendia)
  • pe00-bwa-align-pair1.ftl (BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
  • pe01-bwa-align-pair2.ftl (BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
  • pe02-bwa-sampe.ftl (BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
  • pe03-sam-to-bam.ftl (BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
  • pe04a-HsMetrics.ftl (CalculateHsMetrics, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/CalculateHsMetrics.gwendia)
  • pe04b-picardQC.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia)
  • pe04-sam-sort.ftl (SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia)
  • pe05-mark-duplicates.ftl (MarkDuplicates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/MarkDuplicates.gwendia)
  • pe06-realign.ftl (ReAlign, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/ReAlign.gwendia)
  • pe07-fixmates.ftl (FixMates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/FixMates.gwendia)
  • pe08-covariates-before.ftl (GatkCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkCovariates.gwendia)
  • pe09-recalibrate.ftl (GatkRecalibrate, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkRecalibrate.gwendia)
  • pe10-sam-sort.ftl (SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia)
  • pe11-covariates-after.ftl (GatkCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkCovariates.gwendia)
  • pe12-analyze-covariates.ftl (GatkAnalyzeCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkAnalyzeCovariates.gwendia)

Merge bam per sample and perform SNP and indel calling

  • vc00a-unified-genotyper.ftl to do
  • vc00b-variant-filtration.ftl to do
  • vc00c-variant-eval.ftl to do
  • vc00d-picardMetrics.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia)
  • vc00-merge.ftl to do
  • vc00.merge.ftl to do
  • vc01-coverage.ftl to do
  • vc01.unified_genotyper.ftl to do
  • vc02.picardQC.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia)
  • vc02-realigner-target-creator.ftl to do
  • vc03.coverage.ftl to do
  • vc03-realign.ftl (ReAlign, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/ReAlign.gwendia)
  • vc04-fixmates.ftl (FixMates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/FixMates.gwendia)
  • vc05-indel-genotyper-v2.ftl to do
  • vc06-filter-indels.ftl to do
  • vc07-unified-genotyper.ftl to do
  • vc08-make-indel-mask.ftl to do
  • vc09-variant-filtration.ftl to do
  • vc10-variant-eval.ftl to do
  • vc11-name-sort-bam.ftl (SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia)
  • Pindel (Pindel, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/pindel/Workflows/Pindel.gwendia)

Data access rights

To ensure that the most limited group of people has access to the data we have created a subgroup "gvnl" within the "vlemed" Virtual Organisation (VO). For people to become part of this group, it is required that they have a Grid certificate and that they are part of the "vlemed" VO. On the following page there is information on how to get a certificate, how to get into the "vlemed" VO: http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/EBioInfra#Access

For more information about data access see http://www.bioinformaticslaboratory.nl/twiki/bin/view/EBioScience/DataManagement

Things to address

  • Available disk space on the grid storage elements / worker nodes

Data location on grid

Data

The data is located on the storage element at Sara and only readable and writable for the vlemed/gvnl group. This screencast demonstrates how to access the data from the Vbrowser: http://www.youtube.com/watch?v=FicwWGAbubQ

Storage location (resource): srm.grid.sara.nl

Path: /pnfs/grid.sara.nl/data/vlemed/gvnl

Workflows and databases

These directories are open to all members of the vlemed VO

  • Workflows: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF
  • Databases: lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_DB

The directories that contain the workflows have the following structure:

Directory Description
bin dependent binaries like bwa, samtools
GasW component description, describes which executable has to run on grid and it specifies the in and output files/parameters
Java dependent jar files, e.g. the GATK jar
parameterFiles text files that contain exactly one line with the parameters that you would like to provide to bwa or another component
Scufl old workflow description files, can be ignored
shFiles these files are executed on the grid and described by the GASW descriptor, the gvnl shFiles are based on the Groningen templates
Workflows Workflow descriptions, clicking on them will start the Moteur plugin. Input files/parameters can be specified in the fields. If you click on "Run" the jobs are submitted to the grid

The shell files can (in most cases) run on any linux cluster. In that case you need to place the shell file and the dependent executable(s) in one directory. At the start of each shell file is an example on how to run them.

Workflow execution / progress data analysis

First alignment step

# Sample WF Status Start
1 A4a F F F workflow-425d9ceb done
2 Vartest workflow-490b15f8 done
3 Iteration test workflow-bf48aff1 failed
4 Iteration test workflow-923c6588 done
5 60-samples-batch (15 lanes) workflow-d80b5767 10 / 15 done 11-02-2011 19:30
6 60-samples-batch A (55 lanes) workflow-cbaca6e5 15 / 55 done 12-02-2011 13:55
7 60-samples-batch A remaining 1 (17 lanes) workflow-835250b1 failed (grid very busy) 07-03-2011 17:45
8 60-samples-batch A remaining 1 (17 lanes) workflow-31eb952d 1/17 done 08-03-2011 14:15
9 60-samples-batch A remaining (27 lanes) workflow-fd98c7c8 11/27 done 15-03-2011 10:45
10 60-samples-batch G (27/54 lanes) workflow-a781209c 3/27 done 15-03-2011 20:37
11 second-batch R10-11-12 (27 lanes) workflow-6fe1cb10 6/27 done 16-03-2011 10:57
12 second-batch R13-14-15 (27 lanes) workflow-77b82882 3/27 done 18-03-2011 19:07
13 second-batch R16-17 (24 lanes) workflow-18ec0160 7/24 done 19-03-2011 11:00
14 second-batch R18 (11 lanes) workflow-7596a676 2/11 done 19-03-2011 18:33
15 second-batch R19-20-21 (31 lanes) workflow-58ea16f1 4 done, scheduled downtime, rest will fail 20-03-2011 11:50
16 second-batch R22-8-9 (altered submission scheme: submits 1 job/5 min) workflow-ea6b36a6 4 done, scheduled downtime, rest will fail 20-03-2011 12:37
17 second-batch R10-R17 A9-23 (206 lanes) workflow-7d2b2b5b 5 done, scheduled downtime, rest will fail 21-03-2011 11:59
18 second-batch (244 lanes) workflow-c902ebf3 This WF is cancelled, because we only want to run these jobs on Gina and HTC 23-03-2011 09:48
19 second-batch (237 lanes) workflow-62e0f254 135 done 25-03-2011 12:20
20 second-batch (102 lanes) workflow-b00af623 33 done 31-03-2011 08:21
21 second-batch (69 lanes) workflow-d5709592 4 done 04-04-2011 18:17
22 second-batch (65 lanes) workflow-7d8743c1 25 done 06-04-2011 23:24
23 second-batch (40 lanes) workflow-d8cdf017 16 done 15-04-2011 18:07
24 second-batch (24 lanes) workflow-177917d4 12 done 18-04-2011 11:25
25 second-batch (12 lanes) workflow-289c831f running 20-04-2011 21:15
26 second-batch (7 lanes) workflow-96f414e3 running 22-04-2011

Fastqc analysis

# Sample WF Status Start
1 second-batch (2x295 lanes = 590 fastq files) workflow-417415cc 396 done 23-04-2011 02:44
2 second-batch (194 fastq files) workflow-6ded80d3 189 done 23-04-2011 11:40
3 second-batch (5 fastq file) workflow-2bb76072 running 23-04-2011 14:40

Monitor clusters

Alternatives

Clusters

  • Groningen
    • Description here about code template and automatic PBS script generation. Job submission/monitoring
  • Leiden
  • Huygens
  • Lisa
  • Philips
  • DAS

Grid

Attachments (3)

Download all attachments as: .zip