SPRINT - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

SPRINT

Description:

SPRINT A Simple Parallel R INTerface – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 13
Provided by: spet151
Category:
Tags: sprint | sprint

less

Transcript and Presenter's Notes

Title: SPRINT


1
SPRINT
  • A Simple Parallel R INTerface

2
Overview
  • What is SPRINT
  • How is SPRINT different from other parallel R
    packages
  • Biological example Post-genomic data analysis
  • Code comparison

3
SPRINT
Simple Parallel R INTerface (www.r-sprint.org) S
PRINT A new parallel framework for R, J Hill et
al, BMC Bioinformatics, Dec 2008.
4
Issues of existing parallel R packages
  • Difficult to program
  • Require scientist to also be a parallel
    programmer!
  • Require substantial changes to existing scripts
  • Cant be used to solve some problems
  • No data dependencies allowed

5
Biological example
  • Data A matrix of expression measurements with
    genes in rows and samples in columns

6
Biological example
  • ProblemUsing all or many genes will either crash
    or be very slow (R memory allocation limits,
    number of computations)

Data limitations (correlations)
Work load limitations (permutations)
Input array dimensions and size Final array size in memory
11,000 x 320 26.85 MB 923.15 MB (0.9 GB)
22,000 x 320 53.7 MB 3,692.62 MB (3.6 GB)
35,000 x 320 85.44 MB 9,346 MB (9.12 GB)
45,000 x 320 109.86 MB 15,449.52 MB (15.08 GB)
Input array dimensions and permutation count Estimated total run time
36,612 x 76 500,000 20,750 seconds 6 hours
36,612 x 76 1,000,000 41,500 seconds 12 hours
73,224 x 76 500,000 35,000 seconds 10 hours
73,224 x 76 1,000,000 70,000 seconds 20 hours
7
Workarounds and solution
  • Workaround
  • Remove as many genes as possible before applying
    algorithm. This can be an arbitrary process and
    remove relevant data.
  • Perform multiple executions and post-process the
    data. Can become very painful procedure.
  • SolutionParallelisation of R code can be made
    accessible to bioinformaticians/statisticians.A
    library with expert coded solutions once, then
    easy end-point use by all.

Big Post Genomic Data
SPRINT
HPC
R
Biological Results
8
Benchmarks (256 processes)
Data limitations (correlations)
Input array dimensions and size Final array size in memory Total run time (in serial) (in seconds) Total run time (in parallel) (in seconds)
11,000 x 320 26.85 MB 923.15 MB (0.9 GB) 63.18 4.76
22,000 x 320 53.7 MB 3,692.62 MB (3.6 GB) Error cannot allocate vector of size 3.6 Gb 13.87
35,000 x 320 85.44 MB 9,346 MB (9.12 GB) CRASHED 36.64
45,000 x 320 109.86 MB 15,449.52 MB (15.08 GB) CRASHED 42.18
Work load limitations (permutations)
Input array dimensions and permutation count Estimated total run time (in serial) Total run time (in parallel) (in seconds)
36,612 x 76 500,000 20,750 seconds 6 hours 73.18
36,612 x 76 1,000,000 41,500 seconds 12 hours 146.64
73,224 x 76 500,000 35,000 seconds 10 hours 148.46
73,224 x 76 1,000,000 70,000 seconds 20 hours 294.61
9
Correlation code comparison
edata lt- read.table("largedata.dat") pearsonpairw
ise lt- cor(edata) write.table(pearsonpairwise,
"Correlations.txt") quit(save"no")
library("sprint") edata lt- read.table("largedata.
dat") ff_handle lt- pcor(edata) pterminate() qui
t(save"no")
10
Permutation testing code comparison
data(golub) smallgd lt- golub1100, classlabel
lt- golub.cl resT lt- mt.maxT(smallgd, classlabel,
test"t", side"abs") quit(save"no")
library("sprint") data(golub) smallgd lt-
golub1100, classlabel lt- golub.cl resT lt-
pmaxT(smallgd, classlabel, test"t",
side"abs") pterminate() quit(save"no")
11
SPRINT
  • Website http//www.r-sprint.org/
  • Source code can be downloaded from website
  • Soon also in the CRAN repository
  • Mailing list sprint_at_lists.ed.ac.uk
  • Contact email sprint_at_ed.ac.uk

12
Acknowledgements
  • EPCC Team
  • Terry Sloan
  • Michal Piotrowski
  • Savvas Petrou
  • Bartek Dobrzelecki
  • Jon Hill
  • Florian Scharinger
  • DPM Team
  • Peter Ghazal
  • Thorsten Forster
  • Muriel Mewissen

This work is supported by the Wellcome Trust and
the NAG dCSE Support service.
13
SPRINT - Demo
  • Executing the same code in serial and in parallel.

R -vanilla --slave f maxT_serial.R mpiexec n
2 R -vanilla --slave f maxT_parallel.R
Write a Comment
User Comments (0)
About PowerShow.com