SPRINT - PowerPoint PPT Presentation

1 / 12

About This Presentation

Title:

SPRINT

Description:

SPRINT A Simple Parallel R INTerface – PowerPoint PPT presentation

Number of Views:103

Avg rating:3.0/5.0

Slides: 13

Provided by: spet151

Category:

more less

Transcript and Presenter's Notes

Title: SPRINT

1
SPRINT

A Simple Parallel R INTerface

2
Overview

What is SPRINT
How is SPRINT different from other parallel R
packages
Biological example Post-genomic data analysis
Code comparison

3
SPRINT
Simple Parallel R INTerface (www.r-sprint.org) S
PRINT A new parallel framework for R, J Hill et
al, BMC Bioinformatics, Dec 2008.
4
Issues of existing parallel R packages

Difficult to program
Require scientist to also be a parallel
programmer!
Require substantial changes to existing scripts
Cant be used to solve some problems
No data dependencies allowed

5
Biological example

Data A matrix of expression measurements with
genes in rows and samples in columns

6
Biological example

ProblemUsing all or many genes will either crash
or be very slow (R memory allocation limits,
number of computations)

Data limitations (correlations)
Work load limitations (permutations)
Input array dimensions and size Final array size in memory
11,000 x 320 26.85 MB 923.15 MB (0.9 GB)
22,000 x 320 53.7 MB 3,692.62 MB (3.6 GB)
35,000 x 320 85.44 MB 9,346 MB (9.12 GB)
45,000 x 320 109.86 MB 15,449.52 MB (15.08 GB)
Input array dimensions and permutation count Estimated total run time
36,612 x 76 500,000 20,750 seconds 6 hours
36,612 x 76 1,000,000 41,500 seconds 12 hours
73,224 x 76 500,000 35,000 seconds 10 hours
73,224 x 76 1,000,000 70,000 seconds 20 hours
7
Workarounds and solution

Workaround
Remove as many genes as possible before applying
algorithm. This can be an arbitrary process and
remove relevant data.
Perform multiple executions and post-process the
data. Can become very painful procedure.
SolutionParallelisation of R code can be made
accessible to bioinformaticians/statisticians.A
library with expert coded solutions once, then
easy end-point use by all.

Big Post Genomic Data
SPRINT
HPC
R
Biological Results
8
Benchmarks (256 processes)
Data limitations (correlations)
Input array dimensions and size Final array size in memory Total run time (in serial) (in seconds) Total run time (in parallel) (in seconds)
11,000 x 320 26.85 MB 923.15 MB (0.9 GB) 63.18 4.76
22,000 x 320 53.7 MB 3,692.62 MB (3.6 GB) Error cannot allocate vector of size 3.6 Gb 13.87
35,000 x 320 85.44 MB 9,346 MB (9.12 GB) CRASHED 36.64
45,000 x 320 109.86 MB 15,449.52 MB (15.08 GB) CRASHED 42.18
Work load limitations (permutations)
Input array dimensions and permutation count Estimated total run time (in serial) Total run time (in parallel) (in seconds)
36,612 x 76 500,000 20,750 seconds 6 hours 73.18
36,612 x 76 1,000,000 41,500 seconds 12 hours 146.64
73,224 x 76 500,000 35,000 seconds 10 hours 148.46
73,224 x 76 1,000,000 70,000 seconds 20 hours 294.61
9
Correlation code comparison
edata lt- read.table("largedata.dat") pearsonpairw
ise lt- cor(edata) write.table(pearsonpairwise,
"Correlations.txt") quit(save"no")
library("sprint") edata lt- read.table("largedata.
dat") ff_handle lt- pcor(edata) pterminate() qui
t(save"no")
10
Permutation testing code comparison
data(golub) smallgd lt- golub1100, classlabel
lt- golub.cl resT lt- mt.maxT(smallgd, classlabel,
test"t", side"abs") quit(save"no")
library("sprint") data(golub) smallgd lt-
golub1100, classlabel lt- golub.cl resT lt-
pmaxT(smallgd, classlabel, test"t",
side"abs") pterminate() quit(save"no")
11
SPRINT