Clusterbased High Performance Computing HPC Applications to Bioinformatics

1 / 25
About This Presentation
Title:

Clusterbased High Performance Computing HPC Applications to Bioinformatics

Description:

Department of Electrical, Computer, and Biomedical Engineering. October 13th, 2006 ... participant of caBIGTM cancer Biomedical Informatics Grid sponsored by the ... –

Number of Views:110
Avg rating:3.0/5.0
Slides: 26
Provided by: yibin7
Category:

less

Transcript and Presenter's Notes

Title: Clusterbased High Performance Computing HPC Applications to Bioinformatics


1
Cluster-based High Performance Computing (HPC)
Applications to Bioinformatics
  • Yibin Dong, Yuanjian Feng, Saifur Rahman, Yue
    Wang
  • Advanced Research Institute
  • URL http//www.cbil.ece.vt.edu
  • E-mail yibin.dong_at_vt.edu

2
Agenda
  • Projects at CBIL
  • Equipment
  • Bottleneck of Traditional Computing
  • Architecture of Cluster Computing
  • Results of Cluster Computing
  • Expectation of Cluster Computing
  • Conclusions and Discussions

3
Projects _at_ CBIL
  • CBIL is a participant of caBIGTM cancer
    Biomedical Informatics Grid sponsored by the
    National Cancer Institute (NCI) of the National
    Institutes of Health (NIH) http//caBIG.nci.nih.go
    v
  • Goal of caBIGTM demonstrate how shared
    informatics platform can allow a comprehensive,
    federal grid of information to be made available
    to the cancer research community.

4
Projects _at_ CBIL - continue
  • Role of CBIL in the caBIGTM supports the mission
    of caBIGTM Integrative Cancer Research (ICR)
    Workspace by developing an open-source
    well-documented and validated toolset for use
    throughout cancer research community.

5
Projects _at_ CBIL - continue
  • Two significant applications of grid computing
    for ongoing projects at CBIL
  • Robust biomarker discovery and validation
  • Assessment of generalization performance
  • in molecular diagnostics

6
Robust Biomarker Discovery
  • Molecular Profiling
  • Monitoring the whole genome on a single chip so
    that researchers can have a better picture of the
    interactions among thousands of genes
    simultaneously.
  • High Dimensional
  • Data (30,000)
  • Small Sample Size
  • (10-100)

7
Robust Biomarker Discovery
  • Differential analysis on molecular profiling data
    requires extensive computation to obtain
    statistical significance
  • Cross-validation and statistical re-sampling
  • Leave-one-out, k-fold, bootstrap/permutation
    embedded
  • Stability analysis

8
Robust Biomarker Discovery
  • Ideal for distributed or parallel computing
  • Each worker takes one cross-validation trial.
  • Ensemble analysis is executed at the server.

9
Generalization Performance Assessment of
Diagnostic Predictor
  • Cross validation and stability analysis
  • Estimation bias and variance
  • Plug-in estimate

10
Equipment (traditional)
  • Single PC
  • DELL XPS PC (Intel P4 x32 CPU 3.2G, 1.5G RAM)

11
Bottleneck of Traditional Computing
  • Cross validation tasks have to stay in a queue to
    get service.

12
Equipment (now)
  • Compute Cluster Solution
  • Computer Cluster 16 nodes
  • HP Proliant DL145 Generation 2 Blade Server (dual
    core AMD Opteron processor 270, 2.01 GHz, 1G
    RAM).

13
Architecture of Cluster Computing- Highlight
Tasks
Job
Results
Results
User
Head Node
Compute Cluster on a high-speed network
14
Architecture of Cluster Computing- Microsofts
Solution
15
Results of Cluster Computing
  • Reduce computational cost significantly

16
Two Significant Aspects of HPC
  • Given a fixed complexity of an algorithm, the
    time consumption T will be decreased by a factor
    of 1/N as N increases, where N is the number of
    distributed computing workers in the cluster.
  • Increasing the number of distributed computing
    workers in the cluster will enable the cluster to
    handle jobs with higher complexity while maintain
    almost the same time consumption.

17
Experimental Results (I)Robust Biomarker
Discovery
  • Dataset Muscular Dystrophy, 13 groups, 125
    samples
  • Leave-one-out IDG selection algorithm
  • 1 job with 125 independent tasks
  • Time consumption on DELL PC 199.707929 seconds
  • Time consumption on one node of cluster
    125.292368 seconds
  • Time consumption on a 16-node cluster 19.376634
    seconds
  • Time Reduction Rate (125.292368 -
    19.376634)/125.292368 84.53

18
Experimental Results (I)Robust Biomarker
Discovery - continue
19
Experimental Results (II)Predictor Performance
Estimation
  • Dataset Breast Cancer, 2 groups, 78 samples from
    Nature publication ('t Veer et al. 2002)
  • 3-fold Cross Validation, M-SVM, 199 Predictors
  • 1 job with 199 independent tasks, 50 iterations.
  • Time consumption on DELL PC 1,565.61 seconds
  • Time consumption on one node of cluster
    828.741729 seconds
  • Time consumption on a 16-node cluster 65.628319
    seconds
  • Time Reduction Rate (828.741729 -
    65.63)/828.741729 92.08

20
Experimental Results (II)Predictor Performance
Estimation - continue
21
Experimental Results (II)Predictor Performance
Estimation - continue
22
Expectation of Microsoft Windows Compute
Cluster
  • Reduce more computational cost
  • Security
  • Integration with Active Directory enables
    role-based security for administration and users.
  • Reliability
  • Scalability
  • Additional compute nodes can be added to the
    compute cluster by simply plugging in the nodes
    and connecting them.
  • Easy deployment and administration
  • Microsoft Management Console provides a familiar
    administrative and scheduling interface
  • User friendly
  • MATLAB Applications
  • C Applications, Microsoft Visual Studio 2005,
    SQL Server 2005

23
A Successful Case StudyCompute Cluster Server
2003 has been a fantastic solution for us. Its
affordable, easy to deploy and manage, and...it
doesnt require any of our researchers to rewrite
code.Yonael Teklu, IT Support Manager, Advanced
Research Institute, Virginia Tech
  • Faster research time and results
  • Simple deployment and management
  • Ease of use
  • Improved security authentication
  • Capacity for future expansion
  • Upgraded existing server computers to 64-bit
    version of Microsoft Windows Server 2003
  • Purchased new server computers to create a
    16-node cluster using Microsoft Windows Compute
    Cluster Server 2003
  • Needed significant computing resources for data
    and statistical analysis
  • Required an economical high-performance computing
    solution
  • Reluctant to engage in complex system management

24
Conclusions and Discussions
  • Cluster computing solution will significantly
    help CBIL to reduce computational cost.
  • Cancer research community will get benefits from
    computational efficiency using cluster computing.
  • Microsoft Windows Compute Cluster Server 2003
    brings high-performance computing (HPC) to
    industry standard, low cost servers, which meets
    CBILs needs perfectly.

25
Acknowledgments
  • The Microsoft Corporation
  • National Institutes of Health (NIH) under Grants
    CA109872, EB000830 and caBIGTM-ICR-100501
  • The MathWorks, Inc.
  • Advanced Research Institute at Virginia Tech
  • Childrens National Medical Center (CNMC)
Write a Comment
User Comments (0)
About PowerShow.com