Grids and Biology - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

Grids and Biology

Description:

Grids and Biology. A take on the Grid. Issues in Bioinformatics for Grid. Various BioGrids ... Not a silver bullet! Its just middleware not magic. Data quality ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 67
Provided by: Carole143
Category:
Tags: biology | bullet | grids | magic

less

Transcript and Presenter's Notes

Title: Grids and Biology


1
Grids and Biology
  • A take on the Grid
  • Issues in Bioinformatics for Grid
  • Various BioGrids
  • Applicability of Grid to Biology
  • Reality check

2
What is the Grid?
  • Grid computing is distinguished from
    conventional distributed computing by its focus
    on large-scale resource sharing, innovative
    applications, and, in some cases,
    high-performance orientation...we review the
    "Grid problem", which we define as flexible,
    secure, coordinated resource sharing among
    dynamic collections of individuals, institutions,
    and resources - what we refer to as virtual
    organizations."
  • From "The Anatomy of the Grid Enabling Scalable
    Virtual Organizations" by Foster, Kesselman and
    Tuecke

3
What is the Grid?
  • Resource sharing coordinated problem solving in
    dynamic, multi-institutional virtual
    organizations
  • On-demand, ubiquitous access to computing, data,
    and services
  • New capabilities constructed dynamically and
    transparently from distributed services
  • No central location, No central control, No
    existing trust relationships, Little
    predetermination
  • Uniformity for Pooling Resources
  • Virtual pools of resources databases, clusters.

4
Biology as a Grid Application
  • Informational Science
  • Large Scale
  • Distributed
  • No one organisation owns it all

5
Motivation
Metabolic Pathways
Pharmacogenomics
Human Genome
Combinatorial Chemistry
Computational Load
Genome Data
Moores Law
1990
2000
2010
6
BioMedical Computation
Rick Stevens, Argonne Labs
7
Biomedical Data High Complexity and Large Scale
Rick Stevens, Argonne Labs
billions
Protein-Protein Interactions metabolism
pathways receptor-ligand 4ยบ structure
Physiology Cellular biology Biochemistry
Neurobiology Endocrinology etc.
Polymorphism and Variants genetic variants
individual patients epidemiology
millions
millions
Hundredthousands
ESTs Expression patterns Large-scale screens
Genetics and Maps Linkage Cytogenetic
Clone-based
MPMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYT...
billions
...atcgaattccaggcgtcacattctcaattcca...
millions
8
BioGrid Projects
  • EUROGRID BioGRID
  • Asia Pacific BioGRID
  • North Carolina BioGrid
  • Bioinformatics Research Network
  • Osaka University BioGrid
  • Indiana University BioArchive BioGrid
  • myGrid
  • BioSim
  • e-Protein
  • ObiGrid

9
Todays Grid
  • A Single System Image
  • Transparent wide-area access to large data banks
  • Transparent wide-area access to applications on
    heterogeneous platforms
  • Transparent wide-area access to processing
    resources
  • Security, certification, single sign-on
    authentication, AAA
  • Grid Security Infrastructure,
  • Data access,Transfer Replication
  • GridFTP, Giggle
  • Computational resource discovery, allocation and
    process creation
  • GRAAM, Unicore, Condor-G

10
Immediate benefits
  • Uniform file views of directories, regardless of
    platform
  • Grid-based data transfer libraries for faster
    access to large files, reducing need for
    mirror-site servers.
  • Replication to support mirroring
  • Grid APIs provide a job manager with metadata
    about services to the user. Evaluate the quality
    of service providers based on factors that may
    include more than just server performance and
    availability.
  • Grid-aware applications -- split sequence
    reference libraries among several servers, where
    BLAST comparisons can be conducted in parallel.
  • Shielding from a variety of low-level computing
    problems would otherwise have to address
    themselves.

11
Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
12
Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
13
Classical Grids
  • Classical Grids emphasise sharing of physical
    resources.
  • Existing Grid middleware (e.g. Globus, Condor,
    Unicore) allows resource discovery, resource
    allocation, data movement, certification

14
High Performance Bioinformatics Software
Jack da Silva, NCSC, Paracel
15
European DataGrid
16
  • Managed access to specialist remote resources

17
  • Access portal for biomolecular modeling
    resources.
  • Interfaces to enable chemists and biologists to
    be able to submit work to HPC facilities
  • Visualization of electrostatic field generated by
    a molecule.
  • dr Krzysztof Nowinski (ICM)

18
Biogrid system
SCORE Management Station
SCORE Management Station
Myrinet-2000
Connected to Grid system3
Grid system 1 Express5800/ISS for
PC-Cluster Xeon2.2G x 8 Management node1
Flat Neighborhood networks
1000Base-SX
Grid system 2 NEC Blade Server78node(156CPU)
1000Base-T x 12
Data Grid Disk Express5800/140Ra-4 x3
19
Remote control of instruments
  • Sharing of UHVEM(Ultra High Voltage Electron
    Microscopy) in Osaka University with NCMIR
    (National Center for Microscopy and Imaging
    Research)
  • 3 Million electron volts
  • the most powerful microscopy

20
Home ComputersEvaluate AIDS Drugs
  • Community
  • 1000s of home computer users
  • Philanthropic computing vendor (Entropia)
  • Research group (Scripps)
  • Common goal advance AIDS research

From Steve Tuecke 12 Oct. 01
21
Matlab
Geodise release in November 02 sjc_at_soton.ac.uk
  • Matlab and toolboxes for mathematical
    computation, analysis, visualization, and
    algorithm development

MATLAB is an intuitive language and a technical
computing environment. It provides core
mathematics and advanced graphical tools for data
analysis, visualization, and algorithm and
application development. With more than 600
mathematical, statistical, and engineering
functions, engineers and scientists rely on the
MATLAB environment for their technical computing
needs. (www.mathworks.com)
CROSS PLATFORM/ OS
22
BioSim -- Molecular simulations as a tool for
protein structure analysis
Sansom
synchrotron
compute GRID
MD database
novel biology
  • Overall vision simulation as an integral
    component of structural genomics
  • Needs both capacity (many systems) and capability
    (large systems - HPCx)
  • Molecular Dynamics database (distributed)

23
Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
24
Visualization Bioinformatics
Rick Stevens Argonne Labs
Visualization Environment
Bioinformatic Analysis Tools
Microbiology Biochemistry
Function Assignment
Genome Visualization Tools
Whole Genome Analysis
Metabolic Reconstruction
Enzymatic Constants Metabolic
Network Visualization Tools
Stoichiometric Representation Flux Analysis
Proteomics
Interactive Stoichiometric Graphical Tools
Dynamic Simulation
Whole Cell Visualizations Image/Spectra
Augmentations
Laboratory Verification
25
X-ray microtomography
  • Scientific discovery can be enhanced by closely
    coupling computation and experiment. Simulation,
    visualization and data gathering coupled
  • X-ray microtomography produces 3D X-ray
    attenuation maps of specimens at a microscopic
    level
  • Expensive synchrotron beam time resources
    optimally used to obtain sufficient resolution
    for simulation

26
Interactive Steering
  • User steers calculation from laptop
  • Controlled steering on supercomputers
  • Visualization and computation use large scale
    machines accessed via Grid.
  • Enables controlled simulation using knowledge and
    skills of trained scientist.

27
Scalable molecular dynamics
  • Structure of a protein in a fluid medium
  • Calculation takes into account forces between
    protein and ambient medium (in this case water
    molecules)
  • Run on world largest academic computer, LeMieux
    at PSC (6 Tflops theoretical peak)

28
Grid Landscape
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
29
UCSF
UIUC
From Klaus Schulten, Center for Biomollecular
Modeling and Bioinformatics, Urbana-Champaign
30
http//www.ks.uiuc.edu/Research/biocore/
31
Grid Landscape DATA!!
Computationally Intensive
Collaborative
Visualisation
Data Intensive
Knowledge Intensive
32
Information Weaving and Question Answering
  • Large amounts of different kinds of data many
    applications.
  • Highly heterogeneous.
  • Different types, algorithms, forms,
    implementations, communities, service providers
  • High autonomy.
  • Highly complex and inter-related, volatile.

33
Mike Sternberg
Annotation Pipeline
34
myGrid
RASMOL
  • Personalised extensible environments for
    data-intensive in silico experiments in biology

  • Straightforward discovery, interoperation,
    deployment sharing of services
  • Service-oriented architecture
  • Integration and Information
  • Workflow Databases
  • Experimentation
  • Provenance, propagating change, personalisation

For bioinformaticians who are building tools and
using or providing services
35
DiscoveryNet
http//www.discovery-on-the.net/
High Throughput Sensing (HTS) Applications
Large-scale Dynamic Real- time Decision support
Large-scale Dynamic System Knowledge Discovery
Based on Kensington Discovery Platform
Grid-based Knowledge Discovery Grid-based Data
Mining, Collaborative Visualisation
Information Structuring Information Integration
Composition, Semantics Domain-based
Ontologies, Sharing
Distributed Data Engineering Data Registration,
Data Normalisation, Data Quality
Based on Globus ORB Infrastructure
High Throughput Computing Services
Utilising Grid Infrastructure for HT Computing
Grid Basic Infrastructure Globus/Condor/SRB
36
Grid Evolution
  • 1st Generation Grid
  • Computationally intensive, file access/transfer
  • Bag of various heterogeneous protocols toolkits
  • Recognises internet, Ignores Web
  • Academic teams
  • 2nd Generation Grid
  • Data intensive -gt knowledge intensive
  • Services-based architecture
  • Recognises Web and Web services
  • Global Grid Forum
  • Industry participation

We are here!
37
A Grid of resources, not just compute resources
but databases, digital libraries, instruments,
workflows, documents
A Grid vs The Grid
NovartisGrid
BioSimGrid
MouseGrid
Logical
Grid Middleware
These configurations are dynamic Resources
discovered, combined, used and disbanded as and
when needed or available.
Gigabit IP Network
Physical
Node
Node
Node
Geographically (e.g. UKGrid)
Node
38
A configuration of resources
services
  • Not just compute services but databases, digital
    libraries, instruments, workflows, documents

Open Grid Service Architecture OGSA
Grid Services
Web Services
Grid Technology
39
Bio Services
  • Drug Discovery
  • Microbial Engineering
  • Molecular Ecology
  • Oncology Research

Domain Oriented Services
  • Integrated Databases
  • Sequence Analysis
  • Protein Interactions
  • Cell Simulation

Basic BioGrid Services
Grid Resource Services
  • Compute Services
  • Pipeline Services
  • Data Archive Service
  • Database Hosting
  • Workflow Enactment
  • Event notification

Common Services
Base Services
Fabric Services
40
What We Need to Create
  • Grid Bio applications enablement software layer
  • Provide applications access to Grid services
  • Provides OS independent services
  • Grid enabled version of bioinformatics data
    management tools (e.g. DL, SRS, etc.)
  • Need to support virtual databases via Grid
    services
  • Grid support for commercial databases
  • Bioinformatics applications plug-in modules
  • End user tools for a variety of domains
  • Support major existing Bio IT platforms

41
Requirements for the BioGrid
  • Open and extendable architecture
  • Enable tie in to service stack at appropriate
    points
  • Not just access via Portals
  • Leverage scripting tools in wide use for
    Bioinformatics
  • Create BioGrid services bindings for PERL and
    Python
  • Address data federation and integration
  • Leverage work of IBM, Lion BioSciences, DAS,
    BioMOBY, etc.
  • Match the biology workflow and tool chain
  • Create high-level BioGrid services to address
    critical stages in existing workflow
  • Support composibility of new BioGrid tools with
    existing tool chain elements

42
Some BioGrid Challenges
  • Scalable human bioinformatics expertise
  • Best people working on the important problems
  • Exploit collaboration technology to create world
    class teams
  • Robust local bioinformatics computing environment
  • Best systems administrators and high-end
    technologies
  • Embed local resources into the Grid via portal
    technologies
  • Access to leading edge bioinformatics software
    and databases customized to user needs
  • Core content from top scientists and developers
  • Integrated access to biological databases
  • Worldwide access to robust computing and database
    infrastructure
  • Leverage Grid technology to provide worldwide
    access
  • Integrate purpose built systems and service
    providers

43
Reality Checks!!
  • The Technology is Ready
  • Not true its emerging
  • Building middleware, Advancing Standards,
    Developing, Dependability
  • Building demonstrators.
  • The computational grid is in advance of the data
    intensive middleware
  • Integration and curation are probably the
    obstacles
  • But!! It doesnt have to be all there to be
    useful.
  • We know how we will use grid services
  • No Disruptive technology
  • Lower the barriers of entry.

44
Reality Checks!!
  • Its the only game
  • Not true I3C, BioMOBY, bioDAS, OMG LSR
  • Grid and Web service merge makes integration
    likely.
  • One Size Fits All
  • Not true
  • Addressed by a minimum set of composable virtual
    services, But starting with Globus
  • Its only for big science
  • No small science collaborates too!
  • Biology is not unique!
  • AstroGrid

45
Not a silver bullet!
  • Its just middleware not magic
  • Data quality
  • Content management of databases (controlled
    vocabularies)
  • Provenance and versioning policies
  • Appropriate use of tools
  • Computational inaccessibility of free text
    annotation
  • Database accessibility through means other than
    point and click web interfaces.
  • Independent of the Grid!

46
Life Sciences Grid (LSG)
http//people.cs.uchicago.edu/dangulo/LSG/
47
Spares
48
  • Sun BioGrid symposium web site
  • GGF Life Sciences web site

49
An International Systems Biology Grid
  • A Data, Experiment and Simulation Grid Linking
  • People biologists, computer scientists,
    mathematicians, etc.
  • Experimental systems arrays, detectors, MS, MRI,
    EM, etc.
  • Databases data centres, curators, analysis
    servers
  • Simulation Resources supercomputers,
    visualization, desktops
  • Discovery Resources optimized search servers
  • Education and Teaching Resources classrooms,
    labs, etc.
  • Different than and more fine grain than current
    Grid Projects
  • More laboratory integration small laboratory
    interfaces
  • Many participants will be experimentalists
    workflow, visualization
  • More diversity of data sources and databases
    integration, federation
  • More portals to simulation environments Advanced
    Photon Source models

Rick Stevens, Argonne Lab
50
Marketeer Figure
Metabolic Pathways
Pharmacogenomics
Human Genome
Petabytes
Combinatorial Chemistry
Computational Biology
Moores Law
1990
2000
2010
Abbas Farazdel, IBM Life Sciences
51
Motivation
52
Matlab
  • Flexibility
  • Integrated Development Environment
  • Provides full GUI Support and can integrate code
    from Fortran, C, Java, Web/ Grid Services
  • Monitoring
  • Command line/ batch mode
  • Maintainable
  • Matlab Scripts readable and editable by Engineers
  • Usability
  • Already in widespread use by Engineers

53
(No Transcript)
54
Views on the Grid
Grid collection of services
Knowledge questions
55
Grid Computing from Matlab
  • Globus commands (based on Java CoG v0.9.13)
  • gd_CreateProxyCertVisual
  • gd_BatchSubmit
  • gd_JobStatus
  • gd_GetFile (GSIftp v0.9.14)
  • gd_PutFile (GSIftp v0.9.14)
  • Database commands
  • gd_archive
  • gd_query
  • gd_retrieve

Geodise release in November sjc_at_soton.ac.uk
56
myGrid Framework
Portal
Work Bench
Applications UTOPIA
Bio-Medical Services Library DAS, Talisman,
workflow sets
Upper level knowledge-based Grid Common
Services Semantic integration, knowledge based
querying, workflow composition, visualisation,
provenance mgt, semantic service discovery
Middle level Grid Common Services Database
access, distributed query processing, service
discovery, workflow enactment, event notification
Low level Grid Common Services (OGSI) Co-schedulin
g, data shipping, authentication, job execution,
resource monitoring, replication
57
  • DiscoveryNet get slides from Yike
  • E-Protein get slides from Mike
  • myGrid
  • BioSim
  • OBIGrid
  • Canada BioGrid
  • Paracel
  • Novartis
  • EU Data Grid get slides from Vincent

58





Phenotype Engineering
59
(No Transcript)
60
Software Infrastructure in Drug Discovery
Ontologies and Domain Specific Integration
TM
61
A Grid not the Grid
  • The Grid is the middleware to build .
  • A Grid of resources, not just compute resources
    but databases, digital libraries, instruments,
    workflows, documents
  • geographically (UKGrid)
  • for a particular community (MouseGrid)
  • to solve a particular problem (BioSim,
    e-Protein)
  • enterprise (Novartis Grid)
  • organised into tiers (CERN Physics Grid).
  • These configurations are dynamic
  • Resources discovered, combined, used and
    disbanded as and when needed or available.

62
Grid Evolution
  • Computationally intensive, file access/transfer
  • -gt data intensive
  • -gt knowledge intensive
  • Bag of protocols toolkits
  • -gt services architecture
  • A few academic teams
  • -gt Global Grid Forum industry

  • engagement

63
Will Biology Dominate the Grid?
  • The largest science discipline
  • The most scientists (Globally 500,000-1,000,000)
  • The most research funding (Globally 50
    Billion/year)
  • The most graduate students (gt20,000 year)
  • Strong couplings to
  • Medicine and human health
  • Agriculture and food supplies
  • Energy and ecology
  • Future industrial processes (bio-nano)
  • Consumer of other scientific technologies

64
BioGrid Needs To Provide
  • An open platform to facilitate interoperability
  • Scalable compute and data capabilities beyond
    that available locally
  • Distributed infrastructure available 24x7
    worldwide
  • Enables leverage of remote systems administration
    and support via service providers
  • Integration with local bioinformation systems for
    seamless computing and data management
  • The explicit management of experimental process
  • Workflows, provenance, change notification
  • Governance services Security, accounting,
    ownership, versioning

65
Grids vs Web tools for biology
  • The biology community has developed an extensive
    collection of web resources to support research
  • Databases and search engines (entrez, etc)
  • Functional annotation systems (wit, etc.)
  • Organism specific databases (ecocyc, etc.)
  • Literature search engines (pubmed, etc.)
  • Web based modeling systems (vcell, etc.)

66
A Modest Proposal myGrid paper!!
  • Build an Bioinformatics applications development
    layer on top of basic grid services
  • Think Grid enabled Matlab Toolkit for Biology
  • Re-engineer bioinformatics database integration
    layer to target Grid services model for access
  • Virtualize access to biology databases
  • Deploy a network of virtual bioinformatics
    Computer Centers leveraging existing BioGrid
    resources and new Grid infrastructure (e.g.
    TeraGrid etc.)
  • Create rich market of resources and services
    based on common view of BioGrid

67
Mathematical Toolkits for Modeling Biological
Systems
  • A Mathematica for molecular, cellular and
    systems biology
  • Core data models and structures
  • Optimized functions
  • Scripting environment e.g. Python, PERL, etc.
  • Database accessors and built-in schemas
  • Simulation interfaces
  • Parallel and accelerated kernels
  • Visualization interfaces info-vis and sci-vis
  • Collaborative workflow and group
  • use interfaces

Rick Stevens, Argonne Labs
Write a Comment
User Comments (0)
About PowerShow.com