GTL- Modeling - PowerPoint PPT Presentation

About This Presentation
Title:

GTL- Modeling

Description:

GTL Modeling – PowerPoint PPT presentation

Number of Views:232
Avg rating:3.0/5.0
Slides: 40
Provided by: roger167
Category:
Tags: gtl | ang | fac | modeling

less

Transcript and Presenter's Notes

Title: GTL- Modeling


1
GTL- Modeling Simulation
  • Submitted slides

2
1) How DOE type big iron computing could in
principle help biology (I am leading with things
that I believe are sensible rather than starting
from within the DOEs organismic and scientific
constraints) A) Molecular dynamic type stuff,
structure prediction, etc. B) Docking small
molecules and proteins Small molecules to
proteins Guided docking of proteins from
starting compounds to easily synthesized
derivative pharmacophores using genetic and
biochemical QSAR information One grand challenge
here use computational methods to identify uM
inhibitors of all the enzymes encoded by a
microbial genome. You get new drug leads and it
helps build up national rapid response
capability, and it helps educate computer type
people in DOE more about single major industrial
application of biology Another use the
structural insight coupled with evolution to
change specificity and catalytic properties of
enzymes that make stuff (hydrogen, useful
polymers), or break it down (cleanup) C)
Vaccines Use sequence analysis and
structure prediction to pick all the good B cell
epitopes from a microbial genome. Use
sequence analysis structural information about
human Class 1 and Class 2 to pick all the good T
cell epitopes from a microbial genome. D) Any
simulation work, particularly whizzing molecule
simulations, should us would be simulationists
succeed.
Roger Brent
3
2) Barriers to above, hardware, software,
algorithms Yes 3) How would you measure
success? a) Predicted structure of
majority of proteins in newly sequenced bacterium
or virus in one week of sequence, with
predictions validated by experiment (2006)
b) Validated lead drug compounds against new
targets in virus or bacterium one year after work
start (2005) c) Predicted
vaccine one week after new microorganism
sequenced, B and T cell epitopes going into
validation steps (2006) d)
Simulation would have to work, give nontrivial
insight, be deemed to do so by majority of
academic bioloogists, NIH, HHMI,
and NAS (2010) 4) Resources a) Many
questions seem to bear on simulation. Almost
moot until simulation works. b) The
structure/ drug/ vaccine ideas would require an
increase in DOE internal competency. A 20 year
commitment would be completely
appropriate. NIH, NSF and industry fund some
efforts along these lines now. A
serious effort on tne the structure / drug/
vaccine front would require circa 1/2-1B/ year,
would probably need to be spent t a
new, urban center rather than a national lab, and
most of it wouldnt be computation.
Could be complementary with NIH. 5) Why
undertake the work? a) Better security
against biological attack on people, animals,
plants, materiel, our ecology b) This
capability is part of stewardship of the
planetary ecology, with DOE handling the
microbial ecology 6) ) A general consideration
that would help MSI interact with DOE and DOE
interact with the current research
envirnoment outside of the national labs.
ll DOE software should be open source under LGLP
or equiv, all biological and chemical reagents
freely licensed using standard academic
treaty type MTAs. JGI delays data release for
a year, not NIH or MRC/ Wellcome standard
Roger Brent
4
From the DOEs report on the GTL mathematics
workshop DOEs current responsibility for
remidiating 1.7 trillion gallonms of contaminated
groundwater and 40 million cubic meters of
contaminated soil demonstrates the
significance and scale of the need for a new
computational biology program Most academic
biomedical biologists wont buy this,
will consider it a non-sequitar.
Roger Brent
5
Larry Lok, The Molecular Sciences Institute.
  • Data management infrastructure. Data analysis,
    knowledge infrastructure, data mining.
  • Flexibility facilitate development of
    intelligent, domain-specific interfaces. Monod.
  • TIA on recent publications. Help supplant
    publication?
  • Protein complexes are a database challenge.
    Adapting in-memory techniques.
  • Inference tools for ... distributed biological
    data.
  • Quantitative simulation largely unsupported by
    current big piles of data.
  • Prediction of protein-protein interaction
    kinetics via MD.
  • Behavioral data, experimental and from
    simulation.
  • Inference of reductionist models. Reaction
    networks and their parameters.
  • Qualitative modeling styles QDE, dynamic
    Bayesian networks, etc.
  • Data analysis, modeling, visualization
    facilities.
  • Batch uniP/SMP jobs always popular. SMP support
    in tera-scale facilities?
  • Toward device independence?
  • Reaction network generation discrete-event-style
    difficulties.
  • Reaction network simulation familiar territory
    for Nat. Labs?
  • ODE-like approaches. Connectivity clustering to
    reduce bandwidth.
  • Spatial approaches PDE, particle, etc. Spatial
    distribution. Visualization demands both for
    setup (e.g. modeling membrane or E.R.) and
    analysis.

6
Prediction of Protein Structure
Very low homology T0173 Mycothiol deacetylase
  • Goals
  • - Better understanding of evolutionary
    relationships
  • - Characterization of molecular function
  • - Guiding further experiments
  • Major challenges
  • Comparative modeling (homology modeling)
  • Reliability of sequence alignments
  • Identification and modeling of structural change
  • Refinement!
  • Fold recognition
  • Sequence alignments (potentially combinatorial)
  • Whole genome applications (model quality
    assessment)
  • De novo structure prediction
  • Still mostly an unsolved problem!
  • Importance of methods development cannot be
    underestimated

K. Fidelis
7
Modeling and Simulation Issues
  • Data flow in a heterogeneous environment
  • Avoid bottlenecks, archiving, distributing
  • Build in performance measures
  • Complex modeling capability
  • Universality of storage/compression details
  • Capacity may be more important than capability
  • Parallel paradigms
  • Decomposable in space/time, macro/micro?
  • Security
  • Manageability
  • Scalability
  • Systems approach to design, user involvement
  • Keep it simple, focused, useful, dont reinvent
  • Choices have costs

Stephen Elbert, IBM
8
Inference and modeling of Microbial Regulatory
and Signaling Pathways
  • Reverse engineering problem build pathway
    models that are most consistent with genomic,
    proteomic, metabolic data and general biological
    knowledge
  • data mining is an essential first step in solving
    the reverse engineering problem a great
    amount of information is hidden in the often
    noisy, incomplete, and sometimes conflicting data
  • computational prediction/modeling and data
    collection through experiments should be one
    integrated process computation should be a key
    driver for rational design of experiments
  • Computational challenges
  • it represents a highly challenging computation
    problem to rigorously reverse engineer or solve
    a network model, e.g., Boolean network, Bayesian
    network, petrinet, that best matches known
    data/knowledge
  • given a list of candidate genes possibly involved
    in a regulatory/signaling network, their
    predicted functions, their predicted interactions
    and causality relationship, their predicted
    regulatory elements,
  • network validation problem how to design a set
    of experiments that could provide maximal
    amount of information, in a most economic manner,
    for validation, rejection and revision of network
    models

Y.Xu
9
phosphorus assimilation pathway
Y.Xu
10
Petascale Distributed Data Analysis
Important issues for mining massive biological
data sets
Distributed Existing methods work on single
centralized dataset. Data transfer is prohibitive
Scalable Popular methods do not scale in terms
of time and storage
protein structure
genomes
pathways
Raw data
regulatory elements
models
High-dimensional Need new methods that scale up
with the number of dimensions
Dynamic Most methods work with static data -
Changes lead to complete re-computation
11
Computational Feasibility on a Teraflop Computer
  • Biological Data Growth Trend
  • Genome Assembly 300TB/genome
  • Protein Structure Prediction PetaByte
  • Simulations of Bionetworks 1000s of PBs
  • Algorithmic Complexity
  • Calculate means O(n)
  • Calculate FFT O(n log(n)
  • Clustering algorithms O(n2)

Algorithm Complexity
Data size, n
n3
n2
nlog(n)
n
11 days
1 sec.
10-5 sec.
10-6 sec.
1MB
31 millenia
3 hrs
10-3 sec.
10-4 sec.
100MB
1011x age of the Universe
3 yrs.
0.1 sec.
10-2 sec.
10GB
Bottom line Bigger Computers arent going to
solve our problems We need breakthroughs in
modeling and simulation algorithms
12
100 TeraFLOP Computers Enable First Principles
MD simulations of Enzyme Mechanisms
We are starting to study enzyme mechanisms
We have been using FPMD to simulate the chemical
reactions
FPMD 500 atoms, 10-11 sec
G1
C2
G3
His162
Mg2
Glu14
Asp12
Asp167
  • Constrained FPMD simulations of drug with 70
    water molecules (231 atoms total)
  • 80,000 basis functions
  • Computational requirements
  • ASCII blue 1ps on 36 nodes takes 5 days
  • TC2K 1ps on 27 nodes takes 3.1 days

Long-term GTL applications
  • Design of O2 resistant hydrogenases
  • Re-engineering substrate specificity of
    degradative enzymes
  • Modify properties of DNA-binding regulatory
    enzymes

M. Colvin
13
Integrative Cellular models (for E.E. Selkov,
MCS, Argonne)
  • Integrative Cellular Models
  • Imperative for whole cell simulation
  • Expose modeling/mathematical issues
  • Conceptual, computational, algorithmic,
    infrastructural
  • Integration with Bioinformatics
  • Uniting genomics, proteomics, metabolomics
  • Verification methodology
  • Experimental closure
  • Cyanobacterial Dynamics Models
  • Integrates genetic, MRT and regulatory data
  • Integrates bioinformatics data (EMP, WIT)
  • Practical importance
  • Carbon sequestration, bio-H2 via optimal
    engineering
  • Circadian clock model fundamental problem in
    biology with a lot of applications
  • Experimental verification
  • Proteomics, metabolomics
  • Experimental facitily (CHM, Purdue)
  • Modeling Challenges
  • Bridging the scale gap (spatial temporal)
  • Multi-scale/multi-model approach
  • Integrating micro and macro description
  • New bioinformatics data model DB
  • Parameter determination, model validation
  • Systemic approach
  • Computation and Algorithms
  • Large-scale parallel simulation
  • Scalable stiff/differential-algebraic integrators
  • Multi-objective constrained optimization
  • Combinatorial continuous
  • Integration with dababases
  • Multi-parameter bifurcation sensitivity
    analysis

14
Integrative Cellular Models
  • Imperative for whole cell simulation
  • Expose modeling/mathematical issues
  • Conceptual, computational, algorithmic,
    infrastructural
  • Integration with Bioinformatics
  • Uniting genomics, proteomics, metabolomics
  • Verification methodology
  • Experimental closure

15
Modeling Challenges
  • Bridging the scale gap (spatial temporal)
  • Multi-scale/multi-model approach
  • Integrating micro and macro description
  • New bioinformatics data model DB
  • Parameter determination, model validation
  • Systemic approach

16
Cyanobacterial Dynamics Models
  • Integrates genetic, MRT and regulatory data
  • Integrates bioinformatics data (EMP, WIT)
  • Practical importance
  • O2 production, bio-H2 via optimal engineering
  • Synchronous population cultivation
  • Experimental verification
  • Proteomics, metabolomics
  • Experimental facitily (CHM, Purdue)

17
Computation and Algorithms
  • Large-scale parallel simulation
  • Scalable stiff/differential-algebraic integrators
  • Multi-objective constrained optimization
  • Combinatorial continuous
  • Integration with dababases
  • Multi-parameter bifurcation sensitivity
    analysis

18
  • See PDF File

19
Computation Biology Infrastructure for Complex
Microbial Communities From Genomes to Molecular
Machines
Daniel Van Der Lelie
20
Molecular interaction networks are
revolutionizing the study of biological pathways.
Yeast now have over 20,000 measured
protein-protein, protein-DNA, protein-small
molecule interactions. Similar networks will
soon be avail. for a variety of bacteria, worm,
fly, mouse, human. There is a pressing need for
computational models and tools able to integrate
molecular interaction networks with molecular
states on a global scale. Pathway mapping
Identify and verify pathways and complexes of
interactions (circuit modules) that correlate
with the observed changes in molecular state.
Pathway alignment Identify conserved regions
between the networks of pathogens and hosts,
commensal species, a single species under
different environmental conditions, tissues,
stages of development, etcetera.
Ideker and Lauffenburger, Trends in Biotech June
2003
21
The Scientific Demand for Modeling Simulation
High Throughput Data
Cellular Complexity
Increasing RD efficiency and productivity
22
Developing, implementing, and delivering
model-driven research methodologies
  • Demonstrating how microbial models can drive
    biological discovery
  • Basic scientific understanding of energy-related
    biological systems (improve efficiency of
    discovery)
  • Bio-based economy, biomass-derived products
  • Bio-fuels
  • bioremediation
  • Tight integration with experimental approaches,
    guide experimental design
  • Illustrate how models provide the biological
    context for the integration of genomics,
    proteomics, metabolomics (focus on biologically
    driven integration as opposed to IT driven
    integration)
  • Demonstrated case studies with real biological
    impact! (Let the biology drive the math)
  • Provide QA/QC of biological content in models to
    support Iterative Model Development
  • Distribution of Systems Biology/Modeling
    Platforms and Methodologies (visible impact)
  • Scalable modeling framework for examining
    cellular pathways on up to heterogeneous
    microbial populations (focused on metabolism)
  • Expectation management with the biological
    community (what data do I need?)

Metabolic biochemistry at the systems-level
23
Protein and Gene Networks Inference
1. New Science What are the underlying
principles (static and dynamic) of biological
networks ?
Dynamical attractors
Scale free static networks
Pragmatic problem search space size
Random Scale-free Networks Non-chaotic networks
networks with similar networks (100
nodes) dynamics 103010 1055 108 ?
Jean-Loup Faulon, GTL Modeling Simulation
Workshop, July 23, 2003
24
Protein and Gene Networks Inference
2. Barriers - Reaction rates (experimental) -
Static and dynamic network characterization tools
(algo math) - Data format standard (software
hardware) 2-Hybrid systems, phage display, MS,
gene microarray, protein chips, bioinformatics -
Inference algorithm with sensitivity analysis
(algo)
3. Success - Biological question answered -
Inference prediction drives experiment
Number of data points required to infer unique
parsimonious Boolean networks from microarray
data and number of clusters with similar dynamics
vs. number of networks
4. Resources - Database (hardware software) -
Manpower
Jean-Loup Faulon, GTL Modeling Simulation
Workshop, July 23, 2003
25
(No Transcript)
26
Now Gen II
Science
Technology
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
27
Now Gen II
Science
Technology
Complex Systems Interactions Active
Management Patience Focus on End-to-End
Performance On Critical Targets
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
28
Now Gen II
Science
Technology
Complex Systems Interactions Active
Management Patience Focus on End-to-End
Performance On Critical Targets
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
29
Quantitative and Computational Cell Biology the
Virtual Cell PerspectiveIon I.
MoraruNational Resource for Cell Analysis and
Modelinghttp//www.nrcam.uchc.edu
30
QCB/CCB
  • Scope and Goals Tools for
  • Analyzing and modeling cellular function /
    subcellular to tissue scale
  • Reverse engineering and re-engineering eukaryotes
  • Issues Power and Sophistication !
  • Spatial resolution / complex geometries
  • Temporal resolution / stiffness
  • Lack of data / parameter space searching
  • Too much data / 5D imaging, -omics
  • Stochastic behavior / particles, fluctuations
  • Encapsulation and scalability / model reuse,
    supermodels
  • Simulations Grand Challenges ?
  • Complete organelle function (mitochondria, ER)
  • 4D pattern development (embryogenesis, tissue
    repair)
  • Cellular programming (apoptosis, cell cycle)
  • Structural control (mechanics, locomotion)
  • Neuronal signal integration (Purkinje cells)

31
Performance Progress
 
Neuroblastoma Model - simulation of 20 s real
time -
32
Near-term Potential Practical Wins for Modeling
and Simulation of Microbes
  • Bioinformatics
  • Predicting Domain-Ligand Interaction using
    Signature Kernel Support Vector Machines
  • Natural Language Processing
  • Gene Finding, Phylogeny
  • Hardware Operating Systems Research
  • What does the architecture of the computer look
    like that can solve these problems?

33
Near-term Potential Practical Wins for Modeling
and Simulation of Microbes
  • Computational Molecular Biophysics
  • 40ns Simulation of Rhodopsin Membrane Protein
    System for Insight into the determination of the
    light-adapted structure
  • Complex Systems
  • Network Modeling
  • Complex Systems
  • Massively Parallel Finite Elements and Meshing
  • Computational Technologies
  • Parallel Algorithm Development, Optimization,
    Data Mining and Management and Visualization,
    Frameworks User Interfaces

34
PGF Raw Data Organization
Project Series of Libraries that define a
genome Library Series of Plates Plate 384
Clones Clone 2 Lanes 1 Lane 1MB
each distributed into 4 files 1 FASTA file
1KB 1 scf file 50KB 1 abd
file 250KB 1 rsd/ab1file 650KB In
May-03, PGF ran 2.5 million successful lanes
2.5TB/month 10 million files
(0.75TB/month (9 TB/year) non-trace files)
This does not include any assembly, database or
metadata!
Michael Banda
35
Community Access to PGF Data
  • Access to these data is in demand by scientific
    fields that were not anticipated by the Human
    Genome Project
  • Microbiologists
  • Environmental Scientists BioGeologists
  • Evolutionary Scientists
  • GtL projects
  • Not everyone will want the same kind of files.
  • The computational sophistication of the user
  • community is uneven, at best.

Michael Banda
36
Data Organization Requirements
1. Metadata for the files being collected
-- schema definition development -- the
database system to support the metadata --
query interfaces to query the metadata --
possible rapid prototyping using the object based
tools 2. Data entry tools for the metadata
-- procedure to enforce metadata entry --
checks on the correctness of the metadata entered
None of this was contemplated in the Human
Project but is essential for JGI and GTL data
management
Michael Banda
37
  • Wide agreement on general need for new
    theoretical and software infrastructure for
    systems biology, beyond molecular biology,
    bioinformatics, -omics.
  • Potential differences in details and emphasis.
  • Multiscale and large-scale stochastic simulation
    must simultaneously deal with extreme stiffness
    (Petzold), stochastics (Gillespie),
    robustness/fragility, and complexity.
  • Simulation alone is not scalable to larger
    network problems because to answer biologically
    meaningful questions for complex, uncertain
    systems need an exponentially large number of
    simulations.
  • There are fundamental (i.e. necessary) laws
    governing the organization of biological
    networks, most remaining to be discovered.
    Without exploiting them, network complexity will
    eventually become overwhelming.
  • Dramatic progress in all areas, but lacking
    accessible exposition.
  • There has been extraordinary developments in
    mathematics of complex networks in last 2-3
    years, with promising applications to engineering
    and biological networks. Builds on operator
    theory, control theory, dynamical systems,
    computational complexity, semidefinite
    programming.

John Doyle
38
Systems Simulations Needs
  • Most Core Simulation Technologies Available
  • Already existent simulators for
  • ODEs, SDEs, PDEs, discrete particle,
    circuit-based, geometrically changing models
  • Models are not yet large enough for simulation to
    be severely limited by hardware
  • Hybrid simulation systems still in VERY early
    development
  • Mixed deterministic and stochastic
  • Mixed discrete and continuous
  • Mixed differential and algebraic (this is the
    most sophisticated)
  • Mixed scale simulations systems also still in
    early development
  • Combining structural and kinetic modeling e.g.
  • Formal methods for converting one model type to
    another still lacking in many areas
  • For example conversion of Chemical Master
    Equation to Langevin Equation still an art
  • ALL of these are limited by good biophysical
    models of most cellular processes.
  • Model Deduction and Parameter Estimation
  • New algorithms beginning to rely on statistical
    graph models stochastic optimization,
    computationally intensive.
  • Collaborative data filtering for data constraints
    on parameters large matrix manipulation,
    optimization
  • Model Analysis
  • Model Reduction e.g. automated time-scale
    separation, extensions of balanced truncation
  • Model abstraction e.g. conversion of physical
    models to circuit-like descriptions

Adam Arkin
39
Computation Biology Infrastructure for the
Analysis of Complex Microbial Communities From
Genomes to Molecular Machines
  • Displacement
  • - Commodity chemicals
  • - Fuels
  • - Metabolic pathways and reactions

CO2
CO2
CO2
Nitrogenase a MoFe protein (in blue and purple
at the center) and two copies of the Fe protein
dimer bound on either end (shown in green).
Newer carbon species
  • Sequestration
  • Long-term soil storage
  • Soil C species age
  • - Increased productivity

Below ground
Older carbon species
Rhizosphere -fungi and bacteria
Community structure metabolic diversity
Carl Anderson BNL 7/19/03
Write a Comment
User Comments (0)
About PowerShow.com