Title: Distributed Computing in Biomedicine
1Distributed Computing in Biomedicine
- Arun Krishnan, PhD
- Francis Tang, PhD
- BioInformatics Institute, Singapore
2Agenda
- Project with NCC
- SCATTER High Throughput BLAST
- GridBLAST
- Grid-enabled high-throughput BLAST
- inGRD
- Inter Network Grid Resource Discovery
- GridX
- Meta scheduler for the grid
- Other Projects
3High-throughput BLAST
4PROBLEM
- 10,000 sequences / year increasing to 100,000 /
year in future - Sequence lengths 400-600 bp
- Current Process
- Involves submitting sequences one at a time on
public servers like NCBIs - Inherently Limiting
- Only one sequence can be submitted at a time
- There is a limit on the number of sequences that
can be submitted in a week
5Problem Formulation Contd
- Requirements
- High Throughput solution
- Storage for the databases, query sequences and
the results - Web Interface for submission of jobs
- Submission of multiple sequences at the same time
in the form of a file - Automatic vector clipping of the sequences
- Password protected login for the users
6Solution Architecture
- Client-server architecture
- Jobs submitted on the Master
- Master spawns jobs across the Slave nodes
- Scalability is nearly linear
- Web-based access
- Login with password protection
7GridBLAST
8GridBlast
- Distributed Grid Computing main focus areas
- Integrated computing resources form Grid.
- Developing applications to run on the grid
provides unique challenges - Dynamic configurations eg., performance changes,
hardware failures etc. - Data management
- Execution management
- Application management
9GridBLAST Solution Architecture
Queries Executables Databases
Results
COMPUTE/DATA GRID Grid Middleware (GLOBUS)
CLIENT/REMOTE MACHINES
SERVER/LOCAL MACHINE
10SPMD scheduling for GRIDs
- Heterogeneous environment communications,
processing speed, processor count - Naïve proportional
- More sophisticated Minmax
- A performance model also considering inter-node
latencies and bandwidth - Reduce to a linear optimization problem
11Performance Results Speedup
12inGRD Inter Network Grid Resource Discovery
13Why inGRD?
- Inconsistency in information that MDS can
provide. Dependent on Globus GIIS/GRIS
configuration by Grid Administrators. - Does not require further installation of sensors
on every compute node within a grid node. Makes
use of readily available resource information
collected by the job managers. - Pre-formatted data on Grid nodes enable faster
request, collection and processing of large
amounts of data.
14inGRD overview
- inGRD sensors are installed on Grid nodes to
collect available resource information from their
compute cluster. - inGRD client applications facilitate the
submission of requests and collection of
responses from the inGRD enabled Grid nodes. - Results are represented as a single XML document.
15Client Machine
Ingrd Client
Grid.xml
Globus Grid Middleware
External Grid Node
inGRD Executor
inGRD Sensor
Local.xml
Local job manager
External Grid node
16GridX Meta-scheduler for the Grid
17GridX Metascheduler for the Grid
- Metascheduler for scheduling jobs in a grid
framework - Will provide a user-friendly interface for grid
users to submit jobs - Provides Grid resources information by
interfacing with inGRD - Provides basic grid requirements job
submission, monitoring, cancellation, file
transfer, etc. - Advanced features include accounting, load
balancing, static and dynamic scheduling
strategies
18inGRD NWS Ganglia MDS
User inputs
GridInfoCrawler
AccountManager
User/Grp/Org Record
GridKeeper
Grid Info.
PolicyManager
User/AccountingPolicy
LogRecord
GridInfoMiner
Resource Monitoring Service
Administration Service
GridScheduler
GridBanker
Usage Record
Accounting and Billing Service
Performance Evaluator
Application Profiler
GridBalancer
GridMapper
GridLauncher
GridMonitor
Job Supervisor Service
Scheduling Service
App. Profile Record
Meta-Scheduler
Resource Reservation requests
Job submission Request
Monitoring events
19Other Projects
- GridGene Project
- High-throughput, grid-enabled version of two
different gene-finding applications, GenScan and
GeneWise - Project with GIS
- parallelization of mass spectrometry code for
analysis of proteomics data - Project with NCC
- In-silico cloning of genes
20Thank you!!