The Efficient Handling of BLAST Applications on the GRID - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

The Efficient Handling of BLAST Applications on the GRID

Description:

Hurng-Chun Lee1 and Jakub Moscicki2. 1 Academia Sinica Computing Centre, Taiwan ... Elapsed time. Speedup. Database: drosophila NT ~ 122 MBytes ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 20
Provided by: hurngc
Category:

less

Transcript and Presenter's Notes

Title: The Efficient Handling of BLAST Applications on the GRID


1
The Efficient Handling of BLAST Applications on
the GRID
  • Hurng-Chun Lee1 and Jakub Moscicki2
  • 1 Academia Sinica Computing Centre, Taiwan
  • 2 CERN IT-GD-ED, Switzerland

2
Outline
  • The consideration of distributing BLAST jobs
  • The master-worker computing model of BLAST
  • mpiBLAST
  • The Gridified BLAST
  • mpiBLAST-g2 vs. DIANE-BLAST
  • Summary

3
The considerations of distributing BLAST jobs
  • BLAST has been widely and routinely used for
    sequence analysis
  • The essential component in most of
    bioinformatics and life science applications
  • Problem Complexity O(SqxSd)
  • Sq The query size
  • Sd The database size
  • In most cases, Sd gtgt Sq
  • e.g. Sq O(MB), Sd O(GB)
  • The cost of moving query is lower
  • Database management, storage and sharing issues
  • Replication, Archive
  • Privacy, Security
  • Other perspective for service providing
  • scalability, robustness

4
The master-worker model of BLAST
  • Database splitting is the easiest way to
    distribute BLAST jobs
  • Fragmented databases for avoiding the memory
    swapping
  • Each sub task can be 100 independent
  • Each worker requests the tasks from master (pull
    model) and runs the normal BLAST search
  • The individual result can be easily merged by
    master process
  • Report generation (BioSeq fetching)
  • Multi-query blast search can be easily split to
    multiple independent single-query blast search by
    a trivial script
  • Master-worker model can also be applied in each
    single-query search

5
mpiBLASTLANL, US http//mpiblast.lanl.gov
  • The MPI implementation of BLAST master-worker
    model
  • Advantages
  • High throughput
  • Load Balancing
  • Running in local cluster
  • Performance and Problem size still be limited by
    local computing power
  • Simultaneous I/O to centralized database causes
    the performance bottleneck
  • Database sharing is still difficult

6
mpiBLAST-g2 ASCC, Taiwan and PRAGMA
http//bits.sinica.edu.tw/mpiBlast/index_en.php
  • A GT2-enabled parallel BLAST runs on Grid
  • GT2 GASSCOPY API
  • MPICH-g2
  • The enhancement from mpiBLAST by ASCC
  • Performing cross cluster scheme of job execution
  • Performing remote database sharing
  • Help Tools for
  • database replication
  • automatic resource specification and job
    submission (with static resource table)
  • multi-query job splitting and result merging
  • Close link with mpiBLAST development team
  • The new patches of mpiBLAST can be quickly
    applied in mpiBLAST-g2

7
SC2004 mpiBLAST-g2 demonstration
8
mpiBLAST-g2 current deployment
-- From PRAGMA GOC http//pragma-goc.rocksclusters
.org
9
mpiBLAST-g2Performance Evaluation (perfect case)
Elapsed time
Speedup
Searching Merging BioSeq fetching Overall
  • Database est_human 3.5 GBytes
  • Queries 441 test sequences 300 KBytes
  • Overall speedup is approximately linear

10
mpiBLAST-g2Performance Evaluation (worse case)
Elapsed time
Speedup
  • Database drosophila NT 122 MBytes
  • Queries 441 test sequences 300 KBytes
  • The overall speedup is limited by the unscalable
    BioSeq fetching

Searching Merging BioSeq fetching Overall
11
Issues of mpiBLAST-g2
  • Single error will crash the whole job
  • The MPICH nature
  • Error might be due to the transient problem on
    the loosely coupled Grid environment
  • MPI Job will be started only when all resources
    are available
  • Different level of resource availability
  • Error recovery is required for
  • providing a robust application service on the
    Grid
  • efficiently using the Grid resources
  • Asynchronous task dispatching/pulling to use the
    available resources immediately

12
The DIANEhttp//cern.ch/diane
  • DIstributed ANalysis Environment
  • Lightweight distributed framework for parallel
    scientific applications in master-worker model
  • A perfect match of the mpiBLAST computing model
  • Current applications
  • BLAST for Genomic Sequence Analysis (DIANE-BLAST)
  • Geant4 Simulation for Radiotherapy and
    Astrophysics
  • Image Rendering
  • Data Analysis for High Energy Physics

13
DIANE Features
  • planner
  • integrator

Pull Model
Batch and Interactive
  • Rapid prototyping
  • Python and CORBA
  • Error recovery
  • Heartbeat worker health check
  • Resubmission of failed tasks
  • User defined error recovery method
  • No need of outbound connectivity
  • Proxy of workers with only private IP
  • Job submitters for
  • Simple fork
  • Condor, LSF, SGE, PBS
  • GT2, LCG, gLite

Distributed workers
14
DIANE-BLAST implementation
  • Splitting mpiBLAST-g2 to DIANE components
  • Master (Planner and Integrator), Worker
  • Wrapping each component with Python
  • Hooking core BLAST C libraries with python swig
  • Implementing the DIANE GT2 job submitter
  • For running workers on the GT2-enabled clusters
  • Reusing the deployed databases for mpiBLAST-g2

15
mpiBLAST-g2 vs. DIANE-BLASTThe Speedup
  • Query
  • Drosophila chromosome 4
  • size 1.2 Mbps
  • DB
  • Drosophila nucleotide sequence database
  • size 1170 seq. 122 Mbps
  • no. fragments 32
  • Computing Resource
  • Available of CPU 12
  • PIII 1.4GHz
  • 1GByte Memory

16
mpiBLAST-g2 vs. DIANE-BLAST The Worker Lifeline
  • DIANE-BLAST task dispatching
  • Handled by DIANEs task thread
  • Due to the bugs in the current DIANE release
  • mpiBLAST-g2 task dispatching
  • mpiBLAST-g2 task handling logic

17
mpiBLAST-g2 vs. DIANE-BLASTOverall Comparisons
  • mpiBLAST-g2
  • Master-Worker model implemented by using MPICH-g2
    libraries
  • Gridification efforts
  • Implementing database sharing with GASSCOPY API
  • Recompilation with MPICH-g2 and GT2 libraries
  • Error recovery
  • Need the fault-tolerance MPI
  • Cross cluster computation
  • Requiring outbound connectivity on each worker
  • Performance/Throughput
  • In cluster performance is as well as the original
    mpiBLAST
  • DIANE-BLAST
  • Pluggable application for DIANE Master-Worker
    framework
  • Gridification efforts
  • Through the gridified DIANE framework
  • Error recovery
  • Task resubmission
  • Tracking the health of each worker
  • Cross cluster computation
  • Using proxy for workers with private IPs
  • Performance/Throughput
  • Performance can be tuned by controlling the job
    thread

18
Summary
  • Two grid-enabled BLAST implementations
    (mpiBLAST-g2 and DIANE-BLAST) were introduced for
    efficient handling the BLAST jobs on the Grid
  • Both implementations are based on the
    Master-Worker model for distributing BLAST jobs
    on the Grid
  • The mpiBLAST-g2 has good scalability and speedup
    in some cases
  • Require the fault-tolerance MPI implementation
    for error recovery
  • In the unscalable cases, BioSeq fetching is the
    bottleneck
  • DIANE-BLAST provides flexible mechanism for error
    recovery
  • Any master-worker workflow can be easily plugged
    into this framework
  • The job thread control should be improved to
    achieving the good performance and scalability

19
Thanks for your attention!!
Write a Comment
User Comments (0)
About PowerShow.com