Title: SPRINT
1SPRINT
- A Simple Parallel R INTerface
2Overview
- What is SPRINT
- How is SPRINT different from other parallel R
packages - Biological example Post-genomic data analysis
- Code comparison
3SPRINT
Simple Parallel R INTerface (www.r-sprint.org) S
PRINT A new parallel framework for R, J Hill et
al, BMC Bioinformatics, Dec 2008.
4Issues of existing parallel R packages
- Difficult to program
- Require scientist to also be a parallel
programmer! - Require substantial changes to existing scripts
- Cant be used to solve some problems
- No data dependencies allowed
5Biological example
- Data A matrix of expression measurements with
genes in rows and samples in columns
6Biological example
- ProblemUsing all or many genes will either crash
or be very slow (R memory allocation limits,
number of computations)
Data limitations (correlations)
Work load limitations (permutations)
Input array dimensions and size Final array size in memory
11,000 x 320 26.85 MB 923.15 MB (0.9 GB)
22,000 x 320 53.7 MB 3,692.62 MB (3.6 GB)
35,000 x 320 85.44 MB 9,346 MB (9.12 GB)
45,000 x 320 109.86 MB 15,449.52 MB (15.08 GB)
Input array dimensions and permutation count Estimated total run time
36,612 x 76 500,000 20,750 seconds 6 hours
36,612 x 76 1,000,000 41,500 seconds 12 hours
73,224 x 76 500,000 35,000 seconds 10 hours
73,224 x 76 1,000,000 70,000 seconds 20 hours
7Workarounds and solution
- Workaround
- Remove as many genes as possible before applying
algorithm. This can be an arbitrary process and
remove relevant data. - Perform multiple executions and post-process the
data. Can become very painful procedure. - SolutionParallelisation of R code can be made
accessible to bioinformaticians/statisticians.A
library with expert coded solutions once, then
easy end-point use by all.
Big Post Genomic Data
SPRINT
HPC
R
Biological Results
8Benchmarks (256 processes)
Data limitations (correlations)
Input array dimensions and size Final array size in memory Total run time (in serial) (in seconds) Total run time (in parallel) (in seconds)
11,000 x 320 26.85 MB 923.15 MB (0.9 GB) 63.18 4.76
22,000 x 320 53.7 MB 3,692.62 MB (3.6 GB) Error cannot allocate vector of size 3.6 Gb 13.87
35,000 x 320 85.44 MB 9,346 MB (9.12 GB) CRASHED 36.64
45,000 x 320 109.86 MB 15,449.52 MB (15.08 GB) CRASHED 42.18
Work load limitations (permutations)
Input array dimensions and permutation count Estimated total run time (in serial) Total run time (in parallel) (in seconds)
36,612 x 76 500,000 20,750 seconds 6 hours 73.18
36,612 x 76 1,000,000 41,500 seconds 12 hours 146.64
73,224 x 76 500,000 35,000 seconds 10 hours 148.46
73,224 x 76 1,000,000 70,000 seconds 20 hours 294.61
9Correlation code comparison
edata lt- read.table("largedata.dat") pearsonpairw
ise lt- cor(edata) write.table(pearsonpairwise,
"Correlations.txt") quit(save"no")
library("sprint") edata lt- read.table("largedata.
dat") ff_handle lt- pcor(edata) pterminate() qui
t(save"no")
10Permutation testing code comparison
data(golub) smallgd lt- golub1100, classlabel
lt- golub.cl resT lt- mt.maxT(smallgd, classlabel,
test"t", side"abs") quit(save"no")
library("sprint") data(golub) smallgd lt-
golub1100, classlabel lt- golub.cl resT lt-
pmaxT(smallgd, classlabel, test"t",
side"abs") pterminate() quit(save"no")
11SPRINT
- Website http//www.r-sprint.org/
- Source code can be downloaded from website
- Soon also in the CRAN repository
- Mailing list sprint_at_lists.ed.ac.uk
- Contact email sprint_at_ed.ac.uk
12Acknowledgements
- EPCC Team
- Terry Sloan
- Michal Piotrowski
- Savvas Petrou
- Bartek Dobrzelecki
- Jon Hill
- Florian Scharinger
- DPM Team
- Peter Ghazal
- Thorsten Forster
- Muriel Mewissen
This work is supported by the Wellcome Trust and
the NAG dCSE Support service.
13SPRINT - Demo
- Executing the same code in serial and in parallel.
R -vanilla --slave f maxT_serial.R mpiexec n
2 R -vanilla --slave f maxT_parallel.R