Title: Grids and Biology
1Grids and Biology
- A take on the Grid
- Issues in Bioinformatics for Grid
- Various BioGrids
- Applicability of Grid to Biology
- Reality check
2What is the Grid?
- Grid computing is distinguished from
conventional distributed computing by its focus
on large-scale resource sharing, innovative
applications, and, in some cases,
high-performance orientation...we review the
"Grid problem", which we define as flexible,
secure, coordinated resource sharing among
dynamic collections of individuals, institutions,
and resources - what we refer to as virtual
organizations." - From "The Anatomy of the Grid Enabling Scalable
Virtual Organizations" by Foster, Kesselman and
Tuecke
3What is the Grid?
- Resource sharing coordinated problem solving in
dynamic, multi-institutional virtual
organizations - On-demand, ubiquitous access to computing, data,
and services - New capabilities constructed dynamically and
transparently from distributed services - No central location, No central control, No
existing trust relationships, Little
predetermination - Uniformity for Pooling Resources
- Virtual pools of resources databases, clusters.
4Biology as a Grid Application
- Informational Science
- Large Scale
- Distributed
- No one organisation owns it all
5Motivation
Metabolic Pathways
Pharmacogenomics
Human Genome
Combinatorial Chemistry
Computational Load
Genome Data
Moores Law
1990
2000
2010
6BioMedical Computation
Rick Stevens, Argonne Labs
7Biomedical Data High Complexity and Large Scale
Rick Stevens, Argonne Labs
billions
Protein-Protein Interactions metabolism
pathways receptor-ligand 4ยบ structure
Physiology Cellular biology Biochemistry
Neurobiology Endocrinology etc.
Polymorphism and Variants genetic variants
individual patients epidemiology
millions
millions
Hundredthousands
ESTs Expression patterns Large-scale screens
Genetics and Maps Linkage Cytogenetic
Clone-based
MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...
billions
...atcgaattccaggcgtcacattctcaattcca...
millions
8BioGrid Projects
- EUROGRID BioGRID
- Asia Pacific BioGRID
- North Carolina BioGrid
- Bioinformatics Research Network
- Osaka University BioGrid
- Indiana University BioArchive BioGrid
- myGrid
- BioSim
- e-Protein
- ObiGrid
9Todays Grid
- A Single System Image
- Transparent wide-area access to large data banks
- Transparent wide-area access to applications on
heterogeneous platforms - Transparent wide-area access to processing
resources
- Security, certification, single sign-on
authentication, AAA - Grid Security Infrastructure,
- Data access,Transfer Replication
- GridFTP, Giggle
- Computational resource discovery, allocation and
process creation - GRAAM, Unicore, Condor-G
10Immediate benefits
- Uniform file views of directories, regardless of
platform - Grid-based data transfer libraries for faster
access to large files, reducing need for
mirror-site servers. - Replication to support mirroring
- Grid APIs provide a job manager with metadata
about services to the user. Evaluate the quality
of service providers based on factors that may
include more than just server performance and
availability. - Grid-aware applications -- split sequence
reference libraries among several servers, where
BLAST comparisons can be conducted in parallel. - Shielding from a variety of low-level computing
problems would otherwise have to address
themselves.
11Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
12Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
13Classical Grids
- Classical Grids emphasise sharing of physical
resources. - Existing Grid middleware (e.g. Globus, Condor,
Unicore) allows resource discovery, resource
allocation, data movement, certification
14High Performance Bioinformatics Software
Jack da Silva, NCSC, Paracel
15European DataGrid
16- Managed access to specialist remote resources
17- Access portal for biomolecular modeling
resources. - Interfaces to enable chemists and biologists to
be able to submit work to HPC facilities - Visualization of electrostatic field generated by
a molecule.
- dr Krzysztof Nowinski (ICM)
18Biogrid system
SCORE Management Station
SCORE Management Station
Myrinet-2000
Connected to Grid system3
Grid system 1 Express5800/ISS for
PC-Cluster Xeon2.2G x 8 Management node1
Flat Neighborhood networks
1000Base-SX
Grid system 2 NEC Blade Server78node(156CPU)
1000Base-T x 12
Data Grid Disk Express5800/140Ra-4 x3
19Remote control of instruments
- Sharing of UHVEM(Ultra High Voltage Electron
Microscopy) in Osaka University with NCMIR
(National Center for Microscopy and Imaging
Research) - 3 Million electron volts
- the most powerful microscopy
20Home ComputersEvaluate AIDS Drugs
- Community
- 1000s of home computer users
- Philanthropic computing vendor (Entropia)
- Research group (Scripps)
- Common goal advance AIDS research
From Steve Tuecke 12 Oct. 01
21Matlab
Geodise release in November 02 sjc_at_soton.ac.uk
- Matlab and toolboxes for mathematical
computation, analysis, visualization, and
algorithm development
MATLAB is an intuitive language and a technical
computing environment. It provides core
mathematics and advanced graphical tools for data
analysis, visualization, and algorithm and
application development. With more than 600
mathematical, statistical, and engineering
functions, engineers and scientists rely on the
MATLAB environment for their technical computing
needs. (www.mathworks.com)
CROSS PLATFORM/ OS
22BioSim -- Molecular simulations as a tool for
protein structure analysis
Sansom
synchrotron
compute GRID
MD database
novel biology
- Overall vision simulation as an integral
component of structural genomics - Needs both capacity (many systems) and capability
(large systems - HPCx) - Molecular Dynamics database (distributed)
23Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
24Visualization Bioinformatics
Rick Stevens Argonne Labs
Visualization Environment
Bioinformatic Analysis Tools
Microbiology Biochemistry
Function Assignment
Genome Visualization Tools
Whole Genome Analysis
Metabolic Reconstruction
Enzymatic Constants Metabolic
Network Visualization Tools
Stoichiometric Representation Flux Analysis
Proteomics
Interactive Stoichiometric Graphical Tools
Dynamic Simulation
Whole Cell Visualizations Image/Spectra
Augmentations
Laboratory Verification
25X-ray microtomography
- Scientific discovery can be enhanced by closely
coupling computation and experiment. Simulation,
visualization and data gathering coupled - X-ray microtomography produces 3D X-ray
attenuation maps of specimens at a microscopic
level - Expensive synchrotron beam time resources
optimally used to obtain sufficient resolution
for simulation
26Interactive Steering
- User steers calculation from laptop
- Controlled steering on supercomputers
- Visualization and computation use large scale
machines accessed via Grid.
- Enables controlled simulation using knowledge and
skills of trained scientist.
27Scalable molecular dynamics
- Structure of a protein in a fluid medium
- Calculation takes into account forces between
protein and ambient medium (in this case water
molecules) - Run on world largest academic computer, LeMieux
at PSC (6 Tflops theoretical peak)
28Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
29UCSF
UIUC
From Klaus Schulten, Center for Biomollecular
Modeling and Bioinformatics, Urbana-Champaign
30http//www.ks.uiuc.edu/Research/biocore/
31Grid Landscape DATA!!
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
32Information Weaving and Question Answering
- Large amounts of different kinds of data many
applications. - Highly heterogeneous.
- Different types, algorithms, forms,
implementations, communities, service providers - High autonomy.
- Highly complex and inter-related, volatile.
33Mike Sternberg
Annotation Pipeline
34myGrid
RASMOL
- Personalised extensible environments for
data-intensive in silico experiments in biology
- Straightforward discovery, interoperation,
deployment sharing of services - Service-oriented architecture
- Integration and Information
- Workflow Databases
- Experimentation
- Provenance, propagating change, personalisation
For bioinformaticians who are building tools and
using or providing services
35DiscoveryNet
http//www.discovery-on-the.net/
High Throughput Sensing (HTS) Applications
Large-scale Dynamic Real- time Decision support
Large-scale Dynamic System Knowledge Discovery
Based on Kensington Discovery Platform
Grid-based Knowledge Discovery Grid-based Data
Mining, Collaborative Visualisation
Information Structuring Information Integration
Composition, Semantics Domain-based
Ontologies, Sharing
Distributed Data Engineering Data Registration,
Data Normalisation, Data Quality
Based on Globus ORB Infrastructure
High Throughput Computing Services
Utilising Grid Infrastructure for HT Computing
Grid Basic Infrastructure Globus/Condor/SRB
36Grid Evolution
- 1st Generation Grid
- Computationally intensive, file access/transfer
- Bag of various heterogeneous protocols toolkits
- Recognises internet, Ignores Web
- Academic teams
- 2nd Generation Grid
- Data intensive -gt knowledge intensive
- Services-based architecture
- Recognises Web and Web services
- Global Grid Forum
- Industry participation
We are here!
37A Grid of resources, not just compute resources
but databases, digital libraries, instruments,
workflows, documents
A Grid vs The Grid
NovartisGrid
BioSimGrid
MouseGrid
Logical
Grid Middleware
These configurations are dynamic Resources
discovered, combined, used and disbanded as and
when needed or available.
Gigabit IP Network
Physical
Node
Node
Node
Geographically (e.g. UKGrid)
Node
38A configuration of resources
services
- Not just compute services but databases, digital
libraries, instruments, workflows, documents
Open Grid Service Architecture OGSA
Grid Services
Web Services
Grid Technology
39Bio Services
- Drug Discovery
- Microbial Engineering
- Molecular Ecology
- Oncology Research
Domain Oriented Services
- Integrated Databases
- Sequence Analysis
- Protein Interactions
- Cell Simulation
Basic BioGrid Services
Grid Resource Services
- Compute Services
- Pipeline Services
- Data Archive Service
- Database Hosting
- Workflow Enactment
- Event notification
Common Services
Base Services
Fabric Services
40What We Need to Create
- Grid Bio applications enablement software layer
- Provide applications access to Grid services
- Provides OS independent services
- Grid enabled version of bioinformatics data
management tools (e.g. DL, SRS, etc.) - Need to support virtual databases via Grid
services - Grid support for commercial databases
- Bioinformatics applications plug-in modules
- End user tools for a variety of domains
- Support major existing Bio IT platforms
41Requirements for the BioGrid
- Open and extendable architecture
- Enable tie in to service stack at appropriate
points - Not just access via Portals
- Leverage scripting tools in wide use for
Bioinformatics - Create BioGrid services bindings for PERL and
Python - Address data federation and integration
- Leverage work of IBM, Lion BioSciences, DAS,
BioMOBY, etc. - Match the biology workflow and tool chain
- Create high-level BioGrid services to address
critical stages in existing workflow - Support composibility of new BioGrid tools with
existing tool chain elements
42Some BioGrid Challenges
- Scalable human bioinformatics expertise
- Best people working on the important problems
- Exploit collaboration technology to create world
class teams - Robust local bioinformatics computing environment
- Best systems administrators and high-end
technologies - Embed local resources into the Grid via portal
technologies - Access to leading edge bioinformatics software
and databases customized to user needs - Core content from top scientists and developers
- Integrated access to biological databases
- Worldwide access to robust computing and database
infrastructure - Leverage Grid technology to provide worldwide
access - Integrate purpose built systems and service
providers
43Reality Checks!!
- The Technology is Ready
- Not true its emerging
- Building middleware, Advancing Standards,
Developing, Dependability - Building demonstrators.
- The computational grid is in advance of the data
intensive middleware - Integration and curation are probably the
obstacles - But!! It doesnt have to be all there to be
useful. - We know how we will use grid services
- No Disruptive technology
- Lower the barriers of entry.
44Reality Checks!!
- Its the only game
- Not true I3C, BioMOBY, bioDAS, OMG LSR
- Grid and Web service merge makes integration
likely. - One Size Fits All
- Not true
- Addressed by a minimum set of composable virtual
services, But starting with Globus - Its only for big science
- No small science collaborates too!
- Biology is not unique!
- AstroGrid
45Not a silver bullet!
- Its just middleware not magic
- Data quality
- Content management of databases (controlled
vocabularies) - Provenance and versioning policies
- Appropriate use of tools
- Computational inaccessibility of free text
annotation - Database accessibility through means other than
point and click web interfaces. - Independent of the Grid!
46 Life Sciences Grid (LSG)
http//people.cs.uchicago.edu/dangulo/LSG/
47Spares
48- Sun BioGrid symposium web site
- GGF Life Sciences web site
49An International Systems Biology Grid
- A Data, Experiment and Simulation Grid Linking
- People biologists, computer scientists,
mathematicians, etc. - Experimental systems arrays, detectors, MS, MRI,
EM, etc. - Databases data centres, curators, analysis
servers - Simulation Resources supercomputers,
visualization, desktops - Discovery Resources optimized search servers
- Education and Teaching Resources classrooms,
labs, etc. - Different than and more fine grain than current
Grid Projects - More laboratory integration small laboratory
interfaces - Many participants will be experimentalists
workflow, visualization - More diversity of data sources and databases
integration, federation - More portals to simulation environments Advanced
Photon Source models
Rick Stevens, Argonne Lab
50Marketeer Figure
Metabolic Pathways
Pharmacogenomics
Human Genome
Petabytes
Combinatorial Chemistry
Computational Biology
Moores Law
1990
2000
2010
Abbas Farazdel, IBM Life Sciences
51Motivation
52Matlab
- Flexibility
- Integrated Development Environment
- Provides full GUI Support and can integrate code
from Fortran, C, Java, Web/ Grid Services - Monitoring
- Command line/ batch mode
- Maintainable
- Matlab Scripts readable and editable by Engineers
- Usability
- Already in widespread use by Engineers
53(No Transcript)
54Views on the Grid
Grid collection of services
Knowledge questions
55Grid Computing from Matlab
- Globus commands (based on Java CoG v0.9.13)
- gd_CreateProxyCertVisual
- gd_BatchSubmit
- gd_JobStatus
- gd_GetFile (GSIftp v0.9.14)
- gd_PutFile (GSIftp v0.9.14)
- Database commands
- gd_archive
- gd_query
- gd_retrieve
Geodise release in November sjc_at_soton.ac.uk
56myGrid Framework
Portal
Work Bench
Applications UTOPIA
Bio-Medical Services Library DAS, Talisman,
workflow sets
Upper level knowledge-based Grid Common
Services Semantic integration, knowledge based
querying, workflow composition, visualisation,
provenance mgt, semantic service discovery
Middle level Grid Common Services Database
access, distributed query processing, service
discovery, workflow enactment, event notification
Low level Grid Common Services (OGSI) Co-schedulin
g, data shipping, authentication, job execution,
resource monitoring, replication
57- DiscoveryNet get slides from Yike
- E-Protein get slides from Mike
- myGrid
- BioSim
- OBIGrid
- Canada BioGrid
- Paracel
- Novartis
- EU Data Grid get slides from Vincent
58 Phenotype Engineering
59(No Transcript)
60Software Infrastructure in Drug Discovery
Ontologies and Domain Specific Integration
TM
61A Grid not the Grid
- The Grid is the middleware to build .
- A Grid of resources, not just compute resources
but databases, digital libraries, instruments,
workflows, documents - geographically (UKGrid)
- for a particular community (MouseGrid)
- to solve a particular problem (BioSim,
e-Protein) - enterprise (Novartis Grid)
- organised into tiers (CERN Physics Grid).
- These configurations are dynamic
- Resources discovered, combined, used and
disbanded as and when needed or available.
62Grid Evolution
- Computationally intensive, file access/transfer
- -gt data intensive
- -gt knowledge intensive
- Bag of protocols toolkits
- -gt services architecture
- A few academic teams
- -gt Global Grid Forum industry
-
engagement
63Will Biology Dominate the Grid?
- The largest science discipline
- The most scientists (Globally 500,000-1,000,000)
- The most research funding (Globally 50
Billion/year) - The most graduate students (gt20,000 year)
- Strong couplings to
- Medicine and human health
- Agriculture and food supplies
- Energy and ecology
- Future industrial processes (bio-nano)
- Consumer of other scientific technologies
64BioGrid Needs To Provide
- An open platform to facilitate interoperability
- Scalable compute and data capabilities beyond
that available locally - Distributed infrastructure available 24x7
worldwide - Enables leverage of remote systems administration
and support via service providers - Integration with local bioinformation systems for
seamless computing and data management - The explicit management of experimental process
- Workflows, provenance, change notification
- Governance services Security, accounting,
ownership, versioning
65Grids vs Web tools for biology
- The biology community has developed an extensive
collection of web resources to support research - Databases and search engines (entrez, etc)
- Functional annotation systems (wit, etc.)
- Organism specific databases (ecocyc, etc.)
- Literature search engines (pubmed, etc.)
- Web based modeling systems (vcell, etc.)
66A Modest Proposal myGrid paper!!
- Build an Bioinformatics applications development
layer on top of basic grid services - Think Grid enabled Matlab Toolkit for Biology
- Re-engineer bioinformatics database integration
layer to target Grid services model for access - Virtualize access to biology databases
- Deploy a network of virtual bioinformatics
Computer Centers leveraging existing BioGrid
resources and new Grid infrastructure (e.g.
TeraGrid etc.) - Create rich market of resources and services
based on common view of BioGrid
67Mathematical Toolkits for Modeling Biological
Systems
- A Mathematica for molecular, cellular and
systems biology - Core data models and structures
- Optimized functions
- Scripting environment e.g. Python, PERL, etc.
- Database accessors and built-in schemas
- Simulation interfaces
- Parallel and accelerated kernels
- Visualization interfaces info-vis and sci-vis
- Collaborative workflow and group
- use interfaces
Rick Stevens, Argonne Labs