Title: Clusterbased High Performance Computing HPC Applications to Bioinformatics
1Cluster-based High Performance Computing (HPC)
Applications to Bioinformatics
- Yibin Dong, Yuanjian Feng, Saifur Rahman, Yue
Wang - Advanced Research Institute
- URL http//www.cbil.ece.vt.edu
- E-mail yibin.dong_at_vt.edu
2Agenda
- Projects at CBIL
- Equipment
- Bottleneck of Traditional Computing
- Architecture of Cluster Computing
- Results of Cluster Computing
- Expectation of Cluster Computing
- Conclusions and Discussions
3Projects _at_ CBIL
- CBIL is a participant of caBIGTM cancer
Biomedical Informatics Grid sponsored by the
National Cancer Institute (NCI) of the National
Institutes of Health (NIH) http//caBIG.nci.nih.go
v - Goal of caBIGTM demonstrate how shared
informatics platform can allow a comprehensive,
federal grid of information to be made available
to the cancer research community.
4Projects _at_ CBIL - continue
- Role of CBIL in the caBIGTM supports the mission
of caBIGTM Integrative Cancer Research (ICR)
Workspace by developing an open-source
well-documented and validated toolset for use
throughout cancer research community.
5Projects _at_ CBIL - continue
- Two significant applications of grid computing
for ongoing projects at CBIL - Robust biomarker discovery and validation
- Assessment of generalization performance
- in molecular diagnostics
6Robust Biomarker Discovery
- Molecular Profiling
- Monitoring the whole genome on a single chip so
that researchers can have a better picture of the
interactions among thousands of genes
simultaneously. - High Dimensional
- Data (30,000)
- Small Sample Size
- (10-100)
7Robust Biomarker Discovery
- Differential analysis on molecular profiling data
requires extensive computation to obtain
statistical significance - Cross-validation and statistical re-sampling
- Leave-one-out, k-fold, bootstrap/permutation
embedded - Stability analysis
8Robust Biomarker Discovery
- Ideal for distributed or parallel computing
- Each worker takes one cross-validation trial.
- Ensemble analysis is executed at the server.
9Generalization Performance Assessment of
Diagnostic Predictor
- Cross validation and stability analysis
- Estimation bias and variance
- Plug-in estimate
10Equipment (traditional)
- Single PC
- DELL XPS PC (Intel P4 x32 CPU 3.2G, 1.5G RAM)
11Bottleneck of Traditional Computing
- Cross validation tasks have to stay in a queue to
get service.
12Equipment (now)
- Compute Cluster Solution
- Computer Cluster 16 nodes
- HP Proliant DL145 Generation 2 Blade Server (dual
core AMD Opteron processor 270, 2.01 GHz, 1G
RAM).
13Architecture of Cluster Computing- Highlight
Tasks
Job
Results
Results
User
Head Node
Compute Cluster on a high-speed network
14Architecture of Cluster Computing- Microsofts
Solution
15Results of Cluster Computing
- Reduce computational cost significantly
16Two Significant Aspects of HPC
- Given a fixed complexity of an algorithm, the
time consumption T will be decreased by a factor
of 1/N as N increases, where N is the number of
distributed computing workers in the cluster. - Increasing the number of distributed computing
workers in the cluster will enable the cluster to
handle jobs with higher complexity while maintain
almost the same time consumption.
17Experimental Results (I)Robust Biomarker
Discovery
- Dataset Muscular Dystrophy, 13 groups, 125
samples - Leave-one-out IDG selection algorithm
- 1 job with 125 independent tasks
- Time consumption on DELL PC 199.707929 seconds
- Time consumption on one node of cluster
125.292368 seconds - Time consumption on a 16-node cluster 19.376634
seconds - Time Reduction Rate (125.292368 -
19.376634)/125.292368 84.53
18Experimental Results (I)Robust Biomarker
Discovery - continue
19Experimental Results (II)Predictor Performance
Estimation
- Dataset Breast Cancer, 2 groups, 78 samples from
Nature publication ('t Veer et al. 2002) - 3-fold Cross Validation, M-SVM, 199 Predictors
- 1 job with 199 independent tasks, 50 iterations.
- Time consumption on DELL PC 1,565.61 seconds
- Time consumption on one node of cluster
828.741729 seconds - Time consumption on a 16-node cluster 65.628319
seconds - Time Reduction Rate (828.741729 -
65.63)/828.741729 92.08
20Experimental Results (II)Predictor Performance
Estimation - continue
21Experimental Results (II)Predictor Performance
Estimation - continue
22Expectation of Microsoft Windows Compute
Cluster
- Reduce more computational cost
- Security
- Integration with Active Directory enables
role-based security for administration and users. - Reliability
- Scalability
- Additional compute nodes can be added to the
compute cluster by simply plugging in the nodes
and connecting them. - Easy deployment and administration
- Microsoft Management Console provides a familiar
administrative and scheduling interface - User friendly
- MATLAB Applications
- C Applications, Microsoft Visual Studio 2005,
SQL Server 2005
23A Successful Case StudyCompute Cluster Server
2003 has been a fantastic solution for us. Its
affordable, easy to deploy and manage, and...it
doesnt require any of our researchers to rewrite
code.Yonael Teklu, IT Support Manager, Advanced
Research Institute, Virginia Tech
- Faster research time and results
- Simple deployment and management
- Ease of use
- Improved security authentication
- Capacity for future expansion
- Upgraded existing server computers to 64-bit
version of Microsoft Windows Server 2003 - Purchased new server computers to create a
16-node cluster using Microsoft Windows Compute
Cluster Server 2003
- Needed significant computing resources for data
and statistical analysis - Required an economical high-performance computing
solution - Reluctant to engage in complex system management
24Conclusions and Discussions
- Cluster computing solution will significantly
help CBIL to reduce computational cost. - Cancer research community will get benefits from
computational efficiency using cluster computing. - Microsoft Windows Compute Cluster Server 2003
brings high-performance computing (HPC) to
industry standard, low cost servers, which meets
CBILs needs perfectly.
25Acknowledgments
- The Microsoft Corporation
- National Institutes of Health (NIH) under Grants
CA109872, EB000830 and caBIGTM-ICR-100501 - The MathWorks, Inc.
- Advanced Research Institute at Virginia Tech
- Childrens National Medical Center (CNMC)