Title: GTL- Modeling
1GTL- Modeling Simulation
2 1) How DOE type big iron computing could in
principle help biology (I am leading with things
that I believe are sensible rather than starting
from within the DOEs organismic and scientific
constraints) A) Molecular dynamic type stuff,
structure prediction, etc. B) Docking small
molecules and proteins Small molecules to
proteins Guided docking of proteins from
starting compounds to easily synthesized
derivative pharmacophores using genetic and
biochemical QSAR information One grand challenge
here use computational methods to identify uM
inhibitors of all the enzymes encoded by a
microbial genome. You get new drug leads and it
helps build up national rapid response
capability, and it helps educate computer type
people in DOE more about single major industrial
application of biology Another use the
structural insight coupled with evolution to
change specificity and catalytic properties of
enzymes that make stuff (hydrogen, useful
polymers), or break it down (cleanup) C)
Vaccines Use sequence analysis and
structure prediction to pick all the good B cell
epitopes from a microbial genome. Use
sequence analysis structural information about
human Class 1 and Class 2 to pick all the good T
cell epitopes from a microbial genome. D) Any
simulation work, particularly whizzing molecule
simulations, should us would be simulationists
succeed.
Roger Brent
32) Barriers to above, hardware, software,
algorithms Yes 3) How would you measure
success? a) Predicted structure of
majority of proteins in newly sequenced bacterium
or virus in one week of sequence, with
predictions validated by experiment (2006)
b) Validated lead drug compounds against new
targets in virus or bacterium one year after work
start (2005) c) Predicted
vaccine one week after new microorganism
sequenced, B and T cell epitopes going into
validation steps (2006) d)
Simulation would have to work, give nontrivial
insight, be deemed to do so by majority of
academic bioloogists, NIH, HHMI,
and NAS (2010) 4) Resources a) Many
questions seem to bear on simulation. Almost
moot until simulation works. b) The
structure/ drug/ vaccine ideas would require an
increase in DOE internal competency. A 20 year
commitment would be completely
appropriate. NIH, NSF and industry fund some
efforts along these lines now. A
serious effort on tne the structure / drug/
vaccine front would require circa 1/2-1B/ year,
would probably need to be spent t a
new, urban center rather than a national lab, and
most of it wouldnt be computation.
Could be complementary with NIH. 5) Why
undertake the work? a) Better security
against biological attack on people, animals,
plants, materiel, our ecology b) This
capability is part of stewardship of the
planetary ecology, with DOE handling the
microbial ecology 6) ) A general consideration
that would help MSI interact with DOE and DOE
interact with the current research
envirnoment outside of the national labs.
ll DOE software should be open source under LGLP
or equiv, all biological and chemical reagents
freely licensed using standard academic
treaty type MTAs. JGI delays data release for
a year, not NIH or MRC/ Wellcome standard
Roger Brent
4From the DOEs report on the GTL mathematics
workshop DOEs current responsibility for
remidiating 1.7 trillion gallonms of contaminated
groundwater and 40 million cubic meters of
contaminated soil demonstrates the
significance and scale of the need for a new
computational biology program Most academic
biomedical biologists wont buy this,
will consider it a non-sequitar.
Roger Brent
5Larry Lok, The Molecular Sciences Institute.
- Data management infrastructure. Data analysis,
knowledge infrastructure, data mining. - Flexibility facilitate development of
intelligent, domain-specific interfaces. Monod. - TIA on recent publications. Help supplant
publication? - Protein complexes are a database challenge.
Adapting in-memory techniques. - Inference tools for ... distributed biological
data. - Quantitative simulation largely unsupported by
current big piles of data. - Prediction of protein-protein interaction
kinetics via MD. - Behavioral data, experimental and from
simulation. - Inference of reductionist models. Reaction
networks and their parameters. - Qualitative modeling styles QDE, dynamic
Bayesian networks, etc. - Data analysis, modeling, visualization
facilities. - Batch uniP/SMP jobs always popular. SMP support
in tera-scale facilities? - Toward device independence?
- Reaction network generation discrete-event-style
difficulties. - Reaction network simulation familiar territory
for Nat. Labs? - ODE-like approaches. Connectivity clustering to
reduce bandwidth. - Spatial approaches PDE, particle, etc. Spatial
distribution. Visualization demands both for
setup (e.g. modeling membrane or E.R.) and
analysis.
6Prediction of Protein Structure
Very low homology T0173 Mycothiol deacetylase
- Goals
- - Better understanding of evolutionary
relationships - - Characterization of molecular function
- - Guiding further experiments
- Major challenges
- Comparative modeling (homology modeling)
- Reliability of sequence alignments
- Identification and modeling of structural change
- Refinement!
- Fold recognition
- Sequence alignments (potentially combinatorial)
- Whole genome applications (model quality
assessment) - De novo structure prediction
- Still mostly an unsolved problem!
- Importance of methods development cannot be
underestimated
K. Fidelis
7Modeling and Simulation Issues
- Data flow in a heterogeneous environment
- Avoid bottlenecks, archiving, distributing
- Build in performance measures
- Complex modeling capability
- Universality of storage/compression details
- Capacity may be more important than capability
- Parallel paradigms
- Decomposable in space/time, macro/micro?
- Security
- Manageability
- Scalability
- Systems approach to design, user involvement
- Keep it simple, focused, useful, dont reinvent
- Choices have costs
Stephen Elbert, IBM
8Inference and modeling of Microbial Regulatory
and Signaling Pathways
- Reverse engineering problem build pathway
models that are most consistent with genomic,
proteomic, metabolic data and general biological
knowledge - data mining is an essential first step in solving
the reverse engineering problem a great
amount of information is hidden in the often
noisy, incomplete, and sometimes conflicting data - computational prediction/modeling and data
collection through experiments should be one
integrated process computation should be a key
driver for rational design of experiments - Computational challenges
- it represents a highly challenging computation
problem to rigorously reverse engineer or solve
a network model, e.g., Boolean network, Bayesian
network, petrinet, that best matches known
data/knowledge - given a list of candidate genes possibly involved
in a regulatory/signaling network, their
predicted functions, their predicted interactions
and causality relationship, their predicted
regulatory elements, - network validation problem how to design a set
of experiments that could provide maximal
amount of information, in a most economic manner,
for validation, rejection and revision of network
models
Y.Xu
9phosphorus assimilation pathway
Y.Xu
10Petascale Distributed Data Analysis
Important issues for mining massive biological
data sets
Distributed Existing methods work on single
centralized dataset. Data transfer is prohibitive
Scalable Popular methods do not scale in terms
of time and storage
protein structure
genomes
pathways
Raw data
regulatory elements
models
High-dimensional Need new methods that scale up
with the number of dimensions
Dynamic Most methods work with static data -
Changes lead to complete re-computation
11Computational Feasibility on a Teraflop Computer
- Biological Data Growth Trend
- Genome Assembly 300TB/genome
- Protein Structure Prediction PetaByte
- Simulations of Bionetworks 1000s of PBs
- Algorithmic Complexity
- Calculate means O(n)
- Calculate FFT O(n log(n)
- Clustering algorithms O(n2)
Algorithm Complexity
Data size, n
n3
n2
nlog(n)
n
11 days
1 sec.
10-5 sec.
10-6 sec.
1MB
31 millenia
3 hrs
10-3 sec.
10-4 sec.
100MB
1011x age of the Universe
3 yrs.
0.1 sec.
10-2 sec.
10GB
Bottom line Bigger Computers arent going to
solve our problems We need breakthroughs in
modeling and simulation algorithms
12100 TeraFLOP Computers Enable First Principles
MD simulations of Enzyme Mechanisms
We are starting to study enzyme mechanisms
We have been using FPMD to simulate the chemical
reactions
FPMD 500 atoms, 10-11 sec
G1
C2
G3
His162
Mg2
Glu14
Asp12
Asp167
- Constrained FPMD simulations of drug with 70
water molecules (231 atoms total) - 80,000 basis functions
- Computational requirements
- ASCII blue 1ps on 36 nodes takes 5 days
- TC2K 1ps on 27 nodes takes 3.1 days
Long-term GTL applications
- Design of O2 resistant hydrogenases
- Re-engineering substrate specificity of
degradative enzymes - Modify properties of DNA-binding regulatory
enzymes
M. Colvin
13Integrative Cellular models (for E.E. Selkov,
MCS, Argonne)
- Integrative Cellular Models
- Imperative for whole cell simulation
- Expose modeling/mathematical issues
- Conceptual, computational, algorithmic,
infrastructural - Integration with Bioinformatics
- Uniting genomics, proteomics, metabolomics
- Verification methodology
- Experimental closure
- Cyanobacterial Dynamics Models
- Integrates genetic, MRT and regulatory data
- Integrates bioinformatics data (EMP, WIT)
- Practical importance
- Carbon sequestration, bio-H2 via optimal
engineering - Circadian clock model fundamental problem in
biology with a lot of applications - Experimental verification
- Proteomics, metabolomics
- Experimental facitily (CHM, Purdue)
- Modeling Challenges
- Bridging the scale gap (spatial temporal)
- Multi-scale/multi-model approach
- Integrating micro and macro description
- New bioinformatics data model DB
- Parameter determination, model validation
- Systemic approach
- Computation and Algorithms
- Large-scale parallel simulation
- Scalable stiff/differential-algebraic integrators
- Multi-objective constrained optimization
- Combinatorial continuous
- Integration with dababases
- Multi-parameter bifurcation sensitivity
analysis
14Integrative Cellular Models
- Imperative for whole cell simulation
- Expose modeling/mathematical issues
- Conceptual, computational, algorithmic,
infrastructural - Integration with Bioinformatics
- Uniting genomics, proteomics, metabolomics
- Verification methodology
- Experimental closure
15Modeling Challenges
- Bridging the scale gap (spatial temporal)
- Multi-scale/multi-model approach
- Integrating micro and macro description
- New bioinformatics data model DB
- Parameter determination, model validation
- Systemic approach
16Cyanobacterial Dynamics Models
- Integrates genetic, MRT and regulatory data
- Integrates bioinformatics data (EMP, WIT)
- Practical importance
- O2 production, bio-H2 via optimal engineering
- Synchronous population cultivation
- Experimental verification
- Proteomics, metabolomics
- Experimental facitily (CHM, Purdue)
17Computation and Algorithms
- Large-scale parallel simulation
- Scalable stiff/differential-algebraic integrators
- Multi-objective constrained optimization
- Combinatorial continuous
- Integration with dababases
- Multi-parameter bifurcation sensitivity
analysis
18 19Computation Biology Infrastructure for Complex
Microbial Communities From Genomes to Molecular
Machines
Daniel Van Der Lelie
20Molecular interaction networks are
revolutionizing the study of biological pathways.
Yeast now have over 20,000 measured
protein-protein, protein-DNA, protein-small
molecule interactions. Similar networks will
soon be avail. for a variety of bacteria, worm,
fly, mouse, human. There is a pressing need for
computational models and tools able to integrate
molecular interaction networks with molecular
states on a global scale. Pathway mapping
Identify and verify pathways and complexes of
interactions (circuit modules) that correlate
with the observed changes in molecular state.
Pathway alignment Identify conserved regions
between the networks of pathogens and hosts,
commensal species, a single species under
different environmental conditions, tissues,
stages of development, etcetera.
Ideker and Lauffenburger, Trends in Biotech June
2003
21The Scientific Demand for Modeling Simulation
High Throughput Data
Cellular Complexity
Increasing RD efficiency and productivity
22Developing, implementing, and delivering
model-driven research methodologies
- Demonstrating how microbial models can drive
biological discovery - Basic scientific understanding of energy-related
biological systems (improve efficiency of
discovery) - Bio-based economy, biomass-derived products
- Bio-fuels
- bioremediation
- Tight integration with experimental approaches,
guide experimental design - Illustrate how models provide the biological
context for the integration of genomics,
proteomics, metabolomics (focus on biologically
driven integration as opposed to IT driven
integration) - Demonstrated case studies with real biological
impact! (Let the biology drive the math) - Provide QA/QC of biological content in models to
support Iterative Model Development - Distribution of Systems Biology/Modeling
Platforms and Methodologies (visible impact) - Scalable modeling framework for examining
cellular pathways on up to heterogeneous
microbial populations (focused on metabolism) - Expectation management with the biological
community (what data do I need?)
Metabolic biochemistry at the systems-level
23Protein and Gene Networks Inference
1. New Science What are the underlying
principles (static and dynamic) of biological
networks ?
Dynamical attractors
Scale free static networks
Pragmatic problem search space size
Random Scale-free Networks Non-chaotic networks
networks with similar networks (100
nodes) dynamics 103010 1055 108 ?
Jean-Loup Faulon, GTL Modeling Simulation
Workshop, July 23, 2003
24Protein and Gene Networks Inference
2. Barriers - Reaction rates (experimental) -
Static and dynamic network characterization tools
(algo math) - Data format standard (software
hardware) 2-Hybrid systems, phage display, MS,
gene microarray, protein chips, bioinformatics -
Inference algorithm with sensitivity analysis
(algo)
3. Success - Biological question answered -
Inference prediction drives experiment
Number of data points required to infer unique
parsimonious Boolean networks from microarray
data and number of clusters with similar dynamics
vs. number of networks
4. Resources - Database (hardware software) -
Manpower
Jean-Loup Faulon, GTL Modeling Simulation
Workshop, July 23, 2003
25(No Transcript)
26Now Gen II
Science
Technology
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
27Now Gen II
Science
Technology
Complex Systems Interactions Active
Management Patience Focus on End-to-End
Performance On Critical Targets
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
28Now Gen II
Science
Technology
Complex Systems Interactions Active
Management Patience Focus on End-to-End
Performance On Critical Targets
Pilots
?
!
?
?
!
?
!
Pilots
Pilots
!
Facilities
Computing
Workshops
!
?
1B Need Gen V
29Quantitative and Computational Cell Biology the
Virtual Cell PerspectiveIon I.
MoraruNational Resource for Cell Analysis and
Modelinghttp//www.nrcam.uchc.edu
30QCB/CCB
- Scope and Goals Tools for
- Analyzing and modeling cellular function /
subcellular to tissue scale - Reverse engineering and re-engineering eukaryotes
- Issues Power and Sophistication !
- Spatial resolution / complex geometries
- Temporal resolution / stiffness
- Lack of data / parameter space searching
- Too much data / 5D imaging, -omics
- Stochastic behavior / particles, fluctuations
- Encapsulation and scalability / model reuse,
supermodels - Simulations Grand Challenges ?
- Complete organelle function (mitochondria, ER)
- 4D pattern development (embryogenesis, tissue
repair) - Cellular programming (apoptosis, cell cycle)
- Structural control (mechanics, locomotion)
- Neuronal signal integration (Purkinje cells)
31Performance Progress
Â
Neuroblastoma Model - simulation of 20 s real
time -
32Near-term Potential Practical Wins for Modeling
and Simulation of Microbes
- Bioinformatics
- Predicting Domain-Ligand Interaction using
Signature Kernel Support Vector Machines - Natural Language Processing
- Gene Finding, Phylogeny
- Hardware Operating Systems Research
- What does the architecture of the computer look
like that can solve these problems?
33Near-term Potential Practical Wins for Modeling
and Simulation of Microbes
- Computational Molecular Biophysics
- 40ns Simulation of Rhodopsin Membrane Protein
System for Insight into the determination of the
light-adapted structure
- Complex Systems
- Network Modeling
- Complex Systems
- Massively Parallel Finite Elements and Meshing
- Computational Technologies
- Parallel Algorithm Development, Optimization,
Data Mining and Management and Visualization,
Frameworks User Interfaces
34PGF Raw Data Organization
Project Series of Libraries that define a
genome Library Series of Plates Plate 384
Clones Clone 2 Lanes 1 Lane 1MB
each distributed into 4 files 1 FASTA file
1KB 1 scf file 50KB 1 abd
file 250KB 1 rsd/ab1file 650KB In
May-03, PGF ran 2.5 million successful lanes
2.5TB/month 10 million files
(0.75TB/month (9 TB/year) non-trace files)
This does not include any assembly, database or
metadata!
Michael Banda
35Community Access to PGF Data
- Access to these data is in demand by scientific
fields that were not anticipated by the Human
Genome Project - Microbiologists
- Environmental Scientists BioGeologists
- Evolutionary Scientists
- GtL projects
- Not everyone will want the same kind of files.
- The computational sophistication of the user
- community is uneven, at best.
Michael Banda
36Data Organization Requirements
1. Metadata for the files being collected
-- schema definition development -- the
database system to support the metadata --
query interfaces to query the metadata --
possible rapid prototyping using the object based
tools 2. Data entry tools for the metadata
-- procedure to enforce metadata entry --
checks on the correctness of the metadata entered
None of this was contemplated in the Human
Project but is essential for JGI and GTL data
management
Michael Banda
37- Wide agreement on general need for new
theoretical and software infrastructure for
systems biology, beyond molecular biology,
bioinformatics, -omics. - Potential differences in details and emphasis.
- Multiscale and large-scale stochastic simulation
must simultaneously deal with extreme stiffness
(Petzold), stochastics (Gillespie),
robustness/fragility, and complexity. - Simulation alone is not scalable to larger
network problems because to answer biologically
meaningful questions for complex, uncertain
systems need an exponentially large number of
simulations. - There are fundamental (i.e. necessary) laws
governing the organization of biological
networks, most remaining to be discovered.
Without exploiting them, network complexity will
eventually become overwhelming. - Dramatic progress in all areas, but lacking
accessible exposition. - There has been extraordinary developments in
mathematics of complex networks in last 2-3
years, with promising applications to engineering
and biological networks. Builds on operator
theory, control theory, dynamical systems,
computational complexity, semidefinite
programming.
John Doyle
38Systems Simulations Needs
- Most Core Simulation Technologies Available
- Already existent simulators for
- ODEs, SDEs, PDEs, discrete particle,
circuit-based, geometrically changing models - Models are not yet large enough for simulation to
be severely limited by hardware - Hybrid simulation systems still in VERY early
development - Mixed deterministic and stochastic
- Mixed discrete and continuous
- Mixed differential and algebraic (this is the
most sophisticated) - Mixed scale simulations systems also still in
early development - Combining structural and kinetic modeling e.g.
- Formal methods for converting one model type to
another still lacking in many areas - For example conversion of Chemical Master
Equation to Langevin Equation still an art - ALL of these are limited by good biophysical
models of most cellular processes. - Model Deduction and Parameter Estimation
- New algorithms beginning to rely on statistical
graph models stochastic optimization,
computationally intensive. - Collaborative data filtering for data constraints
on parameters large matrix manipulation,
optimization - Model Analysis
- Model Reduction e.g. automated time-scale
separation, extensions of balanced truncation - Model abstraction e.g. conversion of physical
models to circuit-like descriptions
Adam Arkin
39Computation Biology Infrastructure for the
Analysis of Complex Microbial Communities From
Genomes to Molecular Machines
- Displacement
- - Commodity chemicals
- - Fuels
- - Metabolic pathways and reactions
CO2
CO2
CO2
Nitrogenase a MoFe protein (in blue and purple
at the center) and two copies of the Fe protein
dimer bound on either end (shown in green).
Newer carbon species
- Sequestration
- Long-term soil storage
- Soil C species age
- - Increased productivity
Below ground
Older carbon species
Rhizosphere -fungi and bacteria
Community structure metabolic diversity
Carl Anderson BNL 7/19/03