Changes between Version 114 and Version 115 of BigCompute


Ignore:
Timestamp:
May 9, 2011 8:38:37 AM (14 years ago)
Author:
Barbera van Schaik
Comment:

--

Legend:

Unmodified
Added
Removed
Modified
  • BigCompute

    v114 v115  
    1919* BigComputeAnalysisProgress
    2020* BigComputeReports
    21 
    22 === Implemented workflow components ===
    23 
    24 This list of workflow components are already available. We can expand it with Pindel and (parts of) the GATK pipeline.
    25 
    26 * Splitting of fastq files
    27 * Building a BWA index on the genome sequence (base space and color space)
    28 * BWA for shotgun reads (base space and color space) It is possible to do parameter sweeps. Output is in bam format
    29 * Merge bam results
    30 * Samtools pileup
    31 * Varscan (pileup to snp, indel and cns)
    32 * Bam2coverage creates a UCSC wiggle file to display the genome coverage (per 50kbp)
    33 * Coverage-per-base determines the coverage for every base in the genome and it summarizes the results (coverage versus frequency)
    34 * Annovar (works for hg18, working on other assemblies). This is a pipeline to annotate variants (gene, dbsnp, hapmap, 1000g, conservation, etc)
    35 * FastqC
    36 
    37 === Implemented components of the Groningen pipeline ===
    38 
    39 Template (grid component)
    40 
    41 ==== Alignment, realignment, recalibration, stats ====
    42 * pe0--fastqc.ftl (FastqToFastQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/quality/Workflow/FastqToFastQC.gwendia)
    43 * pe00-bwa-align-pair1.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
    44 * pe01-bwa-align-pair2.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
    45 * pe02-bwa-sampe.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
    46 * pe03-sam-to-bam.ftl (!BwaIllumina, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/BwaIllumina.gwendia)
    47 * pe04a-!HsMetrics.ftl (!CalculateHsMetrics, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/CalculateHsMetrics.gwendia)
    48 * pe04b-picardQC.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia)
    49 * pe04-sam-sort.ftl (!SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia)
    50 * pe05-mark-duplicates.ftl (!MarkDuplicates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/MarkDuplicates.gwendia)
    51 * pe06-realign.ftl (!ReAlign, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/ReAlign.gwendia)
    52 * pe07-fixmates.ftl (!FixMates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/FixMates.gwendia)
    53 * pe08-covariates-before.ftl (!GatkCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkCovariates.gwendia)
    54 * pe09-recalibrate.ftl (!GatkRecalibrate, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkRecalibrate.gwendia)
    55 * pe10-sam-sort.ftl (!SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia)
    56 * pe11-covariates-after.ftl (!GatkCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkCovariates.gwendia)
    57 * pe12-analyze-covariates.ftl (!GatkAnalyzeCovariates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/GatkAnalyzeCovariates.gwendia)
    58 
    59 ==== Merge bam per sample and perform SNP and indel calling ====
    60 * vc00a-unified-genotyper.ftl '''to do'''
    61 * vc00b-variant-filtration.ftl '''to do'''
    62 * vc00c-variant-eval.ftl '''to do'''
    63 * vc00d-picardMetrics.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia)
    64 * vc00-merge.ftl '''to do'''
    65 * vc00.merge.ftl '''to do'''
    66 * vc01-coverage.ftl '''to do'''
    67 * vc01.unified_genotyper.ftl '''to do'''
    68 * vc02.picardQC.ftl (PicardQC, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/PicardQC.gwendia)
    69 * vc02-realigner-target-creator.ftl '''to do'''
    70 * vc03.coverage.ftl '''to do'''
    71 * vc03-realign.ftl (!ReAlign, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/ReAlign.gwendia)
    72 * vc04-fixmates.ftl (!FixMates, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/FixMates.gwendia)
    73 * vc05-indel-genotyper-v2.ftl '''to do'''
    74 * vc06-filter-indels.ftl '''to do'''
    75 * vc07-unified-genotyper.ftl '''to do'''
    76 * vc08-make-indel-mask.ftl '''to do'''
    77 * vc09-variant-filtration.ftl '''to do'''
    78 * vc10-variant-eval.ftl '''to do'''
    79 * vc11-name-sort-bam.ftl (!SamSort, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/gvnl/Workflows/SamSort.gwendia)
    80 * Pindel (Pindel, lfn://lfc.grid.sara.nl:5010/grid/vlemed/AMC-e-BioScience/Sequence_WF/pindel/Workflows/Pindel.gwendia)
    8121
    8222=== Data access rights ===
     
    12262The shell files can (in most cases) run on any linux cluster. In that case you need to place the shell file and the dependent executable(s) in one directory. At the start of each shell file is an example on how to run them.
    12363
    124 == Workflow execution / progress data analysis ==
    125 
    126 '''First alignment step - running'''
    127 || '''#''' || '''Sample''' || '''WF''' || '''Status''' || '''Start''' ||
    128 || 1 || A4a || [http://orange.ebioscience.amc.nl/workflows/workflow-693426f3/html/workflow-693426f3.html F] [http://orange.ebioscience.amc.nl/workflows/workflow-35a9b777/html/workflow-35a9b777.html F] [http://orange.ebioscience.amc.nl/workflows/workflow-4ba3f651/html/workflow-4ba3f651.html F] [http://orange.ebioscience.amc.nl/workflows/workflow-425d9ceb/html/workflow-425d9ceb.html workflow-425d9ceb] || done ||
    129 || 2 || Vartest || [http://orange.ebioscience.amc.nl/workflows/workflow-490b15f8/html/workflow-490b15f8.html workflow-490b15f8] || done ||
    130 || 3 || Iteration test || [http://orange.ebioscience.amc.nl/workflows/workflow-bf48aff1/html/workflow-bf48aff1.html workflow-bf48aff1] || failed ||
    131 || 4 || Iteration test || [http://orange.ebioscience.amc.nl/workflows/workflow-923c6588/html/workflow-923c6588.html workflow-923c6588] || done ||
    132 || 5 || 60-samples-batch (15 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-d80b5767/html/workflow-d80b5767.html workflow-d80b5767] || 10 / 15 done || 11-02-2011 19:30 ||
    133 || 6 || 60-samples-batch A (55 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-cbaca6e5/html/workflow-cbaca6e5.html workflow-cbaca6e5] || 15 / 55 done || 12-02-2011 13:55 ||
    134 || 7 || 60-samples-batch A remaining 1 (17 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-835250b1/html/workflow-835250b1.html workflow-835250b1] || failed (grid very busy) || 07-03-2011 17:45 ||
    135 || 8 || 60-samples-batch A remaining 1 (17 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-31eb952d/html/workflow-31eb952d.html workflow-31eb952d] || 1/17 done || 08-03-2011 14:15 ||
    136 || 9 || 60-samples-batch A remaining (27 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-fd98c7c8/html/workflow-fd98c7c8.html workflow-fd98c7c8] || 11/27 done || 15-03-2011 10:45 ||
    137 || 10 || 60-samples-batch G (27/54 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-a781209c/html/workflow-a781209c.html workflow-a781209c] || 3/27 done || 15-03-2011 20:37 ||
    138 || 11 || second-batch R10-11-12 (27 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-6fe1cb10/html/workflow-6fe1cb10.html workflow-6fe1cb10] || 6/27 done || 16-03-2011 10:57 ||
    139 || 12 || second-batch R13-14-15 (27 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-77b82882/html/workflow-77b82882.html workflow-77b82882] || 3/27 done || 18-03-2011 19:07 ||
    140 || 13 || second-batch R16-17 (24 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-18ec0160/html/workflow-18ec0160.html workflow-18ec0160] || 7/24 done || 19-03-2011 11:00 ||
    141 || 14 || second-batch R18 (11 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-7596a676/html/workflow-7596a676.html workflow-7596a676] || 2/11 done || 19-03-2011 18:33 ||
    142 || 15 || second-batch R19-20-21 (31 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-58ea16f1/html/workflow-58ea16f1.html workflow-58ea16f1] || 4 done, scheduled downtime, rest will fail || 20-03-2011 11:50 ||
    143 || 16 || second-batch R22-8-9 (altered submission scheme: submits 1 job/5 min) || [http://orange.ebioscience.amc.nl/workflows/workflow-ea6b36a6/html/workflow-ea6b36a6.html workflow-ea6b36a6] || 4 done, scheduled downtime, rest will fail || 20-03-2011 12:37 ||
    144 || 17 || second-batch R10-R17 A9-23 (206 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-7d2b2b5b/html/workflow-7d2b2b5b.html workflow-7d2b2b5b] || 5 done, scheduled downtime, rest will fail || 21-03-2011 11:59 ||
    145 || 18 || second-batch (244 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-c902ebf3/html/workflow-c902ebf3.html workflow-c902ebf3] || This WF is cancelled, because we only want to run these jobs on Gina and HTC || 23-03-2011 09:48 ||
    146 || 19 || second-batch (237 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-62e0f254/html/workflow-62e0f254.html workflow-62e0f254] || 135 done || 25-03-2011 12:20 ||
    147 || 20 || second-batch (102 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-b00af623/html/workflow-b00af623.html workflow-b00af623] || 33 done || 31-03-2011 08:21 ||
    148 || 21 || second-batch (69 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-d5709592/html/workflow-d5709592.html workflow-d5709592] || 4 done || 04-04-2011 18:17 ||
    149 || 22 || second-batch (65 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-7d8743c1/html/workflow-7d8743c1.html workflow-7d8743c1] || 25 done || 06-04-2011 23:24 ||
    150 || 23 || second-batch (40 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-d8cdf017/html/workflow-d8cdf017.html workflow-d8cdf017] || 16 done || 15-04-2011 18:07 ||
    151 || 24 || second-batch (24 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-177917d4/html/workflow-177917d4.html workflow-177917d4] || 12 done || 18-04-2011 11:25 ||
    152 || 25 || second-batch (12 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-289c831f/html/workflow-289c831f.html workflow-289c831f] || 5 done || 20-04-2011 21:15 ||
    153 || 26 || second-batch (7 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-96f414e3/html/workflow-96f414e3.html workflow-96f414e3] || running || 22-04-2011 ||
    154 || 27 || second-batch (3 lanes) || [http://orange.ebioscience.amc.nl/workflows/workflow-63ecec34/html/workflow-63ecec34.html workflow-63ecec34] || running || 27-04-2011 15:22 ||
    155 
    156 '''Fastqc analysis - done'''
    157 || '''#''' || '''Sample''' || '''WF''' || '''Status''' || '''Start''' ||
    158 || 1 || second-batch (2x295 lanes = 590 fastq files) || [http://orange.ebioscience.amc.nl/workflows/workflow-417415cc/html/workflow-417415cc.html workflow-417415cc] || 396 done || 23-04-2011 02:44 ||
    159 || 2 || second-batch (194 fastq files) || [http://orange.ebioscience.amc.nl/workflows/workflow-6ded80d3/html/workflow-6ded80d3.html workflow-6ded80d3] || 189 done || 23-04-2011 11:40 ||
    160 || 3 || second-batch (5 fastq file) || [http://orange.ebioscience.amc.nl/workflows/workflow-2bb76072/html/workflow-2bb76072.html workflow-2bb76072] || 5 done || 23-04-2011 14:40 ||
    161 
    162 RESULTS: [http://www.bbmriwiki.nl/attachment/wiki/BigCompute/log-fastqc20110423.xls log-fastqc20110423.xls] - contains information about run time and disk usage on the compute nodes and info about the number of sequences per lane
    163 
    164 THROUGHPUT: workflow run time was 12 hrs, total CPU run time was 12 days (speedup of ~24x for this component)
    165 
    166 
    167 '''Mark-duplicates analysis on all files that are aligned so far'''
    168 || '''#''' || '''Sample''' || '''WF''' || '''Status''' || '''Start''' ||
    169 || 1 || 351 lanes || [http://orange.ebioscience.amc.nl/workflows/workflow-a5a7a078/html/workflow-a5a7a078.html workflow-a5a7a078] || 212 done || 04-05-2011 12:54 ||
    170 || 2 || 139 lanes || [http://orange.ebioscience.amc.nl/workflows/workflow-87b3994d/html/workflow-87b3994d.html workflow-87b3994d] || 115 done || 05-05-2011 15:29 ||
    171 || 3 || 24 lanes || [http://orange.ebioscience.amc.nl/workflows/workflow-7e395331/html/workflow-7e395331.html workflow-7e395331] || failed || 06-05-2011 08:22 ||
    172 || 4 || 24 lanes || [http://orange.ebioscience.amc.nl/workflows/workflow-6207c67d/html/workflow-6207c67d.html workflow-6207c67d] || 19 done || 06-05-2011 12:55 ||
    173 || 5 || 5 lanes || [http://orange.ebioscience.amc.nl/workflows/workflow-62e2653e/html/workflow-62e2653e.html workflow-62e2653e] || running || 07-05-2011 14:08 ||
    174 
    175 '''Monitor clusters'''
    176 * [http://ganglia.sara.nl/?m=load_one&r=week&s=descending&c=LifeScience+Grid&h=&sh=1&hc=4&z=small Ganglia - LifeScience grid]
    177 * [http://ganglia.sara.nl/?m=load_one&r=week&s=descending&c=GINA+Cluster&h=&sh=1&hc=4&z=small Ganglia - Gina cluster]
    178 
    17964== Alternatives ==
    18065