Clusterbased High Performance Computing HPC Applications to Bioinformatics

1 / 25

About This Presentation

Title:

Clusterbased High Performance Computing HPC Applications to Bioinformatics

Description:

Department of Electrical, Computer, and Biomedical Engineering. October 13th, 2006 ... participant of caBIGTM cancer Biomedical Informatics Grid sponsored by the ... –

Number of Views:110

Avg rating:3.0/5.0

Slides: 26

Provided by: yibin7

Category:

more less

Transcript and Presenter's Notes

Title: Clusterbased High Performance Computing HPC Applications to Bioinformatics

1
Cluster-based High Performance Computing (HPC)
Applications to Bioinformatics

Yibin Dong, Yuanjian Feng, Saifur Rahman, Yue
Wang
Advanced Research Institute
URL http//www.cbil.ece.vt.edu
E-mail yibin.dong_at_vt.edu

2
Agenda

Projects at CBIL
Equipment
Bottleneck of Traditional Computing
Architecture of Cluster Computing
Results of Cluster Computing
Expectation of Cluster Computing
Conclusions and Discussions

3
Projects _at_ CBIL

CBIL is a participant of caBIGTM cancer
Biomedical Informatics Grid sponsored by the
National Cancer Institute (NCI) of the National
Institutes of Health (NIH) http//caBIG.nci.nih.go
v
Goal of caBIGTM demonstrate how shared
informatics platform can allow a comprehensive,
federal grid of information to be made available
to the cancer research community.

4
Projects _at_ CBIL - continue

Role of CBIL in the caBIGTM supports the mission
of caBIGTM Integrative Cancer Research (ICR)
Workspace by developing an open-source
well-documented and validated toolset for use
throughout cancer research community.

5
Projects _at_ CBIL - continue

Two significant applications of grid computing
for ongoing projects at CBIL
Robust biomarker discovery and validation
Assessment of generalization performance
in molecular diagnostics

6
Robust Biomarker Discovery

Molecular Profiling
Monitoring the whole genome on a single chip so
that researchers can have a better picture of the
interactions among thousands of genes
simultaneously.
High Dimensional
Data (30,000)
Small Sample Size
(10-100)

7
Robust Biomarker Discovery

Differential analysis on molecular profiling data
requires extensive computation to obtain
statistical significance
Cross-validation and statistical re-sampling
Leave-one-out, k-fold, bootstrap/permutation
embedded
Stability analysis

8
Robust Biomarker Discovery

Ideal for distributed or parallel computing
Each worker takes one cross-validation trial.
Ensemble analysis is executed at the server.

9
Generalization Performance Assessment of
Diagnostic Predictor

Cross validation and stability analysis
Estimation bias and variance
Plug-in estimate

10
Equipment (traditional)

Single PC
DELL XPS PC (Intel P4 x32 CPU 3.2G, 1.5G RAM)

11
Bottleneck of Traditional Computing

Cross validation tasks have to stay in a queue to
get service.

12
Equipment (now)

Compute Cluster Solution
Computer Cluster 16 nodes
HP Proliant DL145 Generation 2 Blade Server (dual
core AMD Opteron processor 270, 2.01 GHz, 1G
RAM).

13
Architecture of Cluster Computing- Highlight
Tasks
Job
Results
Results
User
Head Node
Compute Cluster on a high-speed network
14
Architecture of Cluster Computing- Microsofts
Solution
15
Results of Cluster Computing

Reduce computational cost significantly

16
Two Significant Aspects of HPC

Given a fixed complexity of an algorithm, the
time consumption T will be decreased by a factor
of 1/N as N increases, where N is the number of
distributed computing workers in the cluster.
Increasing the number of distributed computing
workers in the cluster will enable the cluster to
handle jobs with higher complexity while maintain
almost the same time consumption.

17
Experimental Results (I)Robust Biomarker
Discovery

Dataset Muscular Dystrophy, 13 groups, 125
samples
Leave-one-out IDG selection algorithm
1 job with 125 independent tasks
Time consumption on DELL PC 199.707929 seconds
Time consumption on one node of cluster
125.292368 seconds
Time consumption on a 16-node cluster 19.376634
seconds
Time Reduction Rate (125.292368 -
19.376634)/125.292368 84.53

18
Experimental Results (I)Robust Biomarker
Discovery - continue
19
Experimental Results (II)Predictor Performance
Estimation

Dataset Breast Cancer, 2 groups, 78 samples from
Nature publication ('t Veer et al. 2002)
3-fold Cross Validation, M-SVM, 199 Predictors
1 job with 199 independent tasks, 50 iterations.
Time consumption on DELL PC 1,565.61 seconds
Time consumption on one node of cluster
828.741729 seconds
Time consumption on a 16-node cluster 65.628319
seconds
Time Reduction Rate (828.741729 -
65.63)/828.741729 92.08

20
Experimental Results (II)Predictor Performance
Estimation - continue
21
Experimental Results (II)Predictor Performance
Estimation - continue
22
Expectation of Microsoft Windows Compute
Cluster

Reduce more computational cost
Security
Integration with Active Directory enables
role-based security for administration and users.
Reliability
Scalability
Additional compute nodes can be added to the
compute cluster by simply plugging in the nodes
and connecting them.
Easy deployment and administration
Microsoft Management Console provides a familiar
administrative and scheduling interface
User friendly
MATLAB Applications
C Applications, Microsoft Visual Studio 2005,
SQL Server 2005

23
A Successful Case StudyCompute Cluster Server
2003 has been a fantastic solution for us. Its
affordable, easy to deploy and manage, and...it
doesnt require any of our researchers to rewrite
code.Yonael Teklu, IT Support Manager, Advanced
Research Institute, Virginia Tech

Faster research time and results
Simple deployment and management
Ease of use
Improved security authentication
Capacity for future expansion

Upgraded existing server computers to 64-bit
version of Microsoft Windows Server 2003
Purchased new server computers to create a
16-node cluster using Microsoft Windows Compute
Cluster Server 2003

Needed significant computing resources for data
and statistical analysis
Required an economical high-performance computing
solution
Reluctant to engage in complex system management

24
Conclusions and Discussions

Cluster computing solution will significantly
help CBIL to reduce computational cost.
Cancer research community will get benefits from
computational efficiency using cluster computing.
Microsoft Windows Compute Cluster Server 2003
brings high-performance computing (HPC) to
industry standard, low cost servers, which meets
CBILs needs perfectly.

25
Acknowledgments

The Microsoft Corporation
National Institutes of Health (NIH) under Grants
CA109872, EB000830 and caBIGTM-ICR-100501
The MathWorks, Inc.
Advanced Research Institute at Virginia Tech
Childrens National Medical Center (CNMC)

Write a Comment

User Comments (0)