Title: Adaptive Probabilistic Approach:
1(No Transcript)
2Adaptive Probabilistic Approach
Applications to Rapid and Robust NMR Structure
Determination
Hamid R. Eghbalnia Department of Biochemistry and
Mathematics, University of Wisconsin-Madison
3People who made it all happen
Marco Tonelli Klaas Hallenga Gabriel Cornilescu
Liya Wang
Fariba Assadi-Porter Claudia Cornilescu Shanteri
Singh Rob Tyler Anna Füzery Nick Reiter CESG
Arash Barhami
John Markley Milo Westler Eldon Ulrich Jurgen
Doreleijers Mark Anderson
4brazzein - 53 a.a.
ubiquitin - 76 a.a.
flavodoxin - 176 a.a.
HNCO
HN(CO)CA
HNCA
CBCA(CO)NH
HN(CA)CB
98 of spin systems assigned with PINE
95 of spin systems assigned with PINE
96 of spin systems assigned with PINE
HNCACB
14 h
48 h
12 h
Total time to obtain complete backbone
information - assignment, 2o structure, other
corrections.
5- Central ideas in the adaptive probabilistic
approach - Implementation of these ideas as various tools
- New tools and extensions to our approach
- A preview of almost-published and unpublished
6For rapid and robust NMR structure determination,
we need
- To find a formulation for the problem that is
- Consequential
- Addresses the current and future challenges posed
- Robust
- Tractable
- Measurable
- merit for the solution can be stated.
Rapid and robust NMR structure determinationAu
tomation
7The (simplified) big picture
Predictions from sequence and available data
chemical shifts, appropriate strategy
Construct design protein production and
labeling screening for suitability
Data deposition and publication
Data collection
Structure refinement and validation
Chemical shift assignments
Structure determination
Secondary structure and other constraint
determinations
8The big picture (in reality)
Predictions from sequence and available data
chemical shifts, appropriate strategy
Construct design protein production and
labeling screening for suitability
Data deposition and publication
Data collection
Structure refinement and validation
Chemical shift assignments
Structure determination
Secondary structure and other constraint
determinations
9Decision trees
The basic paradigm for translation of experts
approach to a computer algorithm is to use that
analogy of decision trees.
Decision variables
Decision options
What are the challenges?
10A useful analogy to 20 questions
Choose a number between 1 and 1000000
gt 500000?
lt 500000?
11More challenging version of 20 questions!
Responses to the queries are not yes/no
answers, and they are not always the truth!
12Local to global structures
The challenge put together local data into
globally coherent information
13What is local to global?
HN(CO)CA
Local information
These examples from NMR structure determination
are representative of a more general phenomena in
biology.
14Integrating data collection and analysis --
automation
- Automating analysis in biology is less like
automating a factory or a sample changer - We do not assemble the same product over and
over. Interesting proteins are unique. - Automating analysis in biology is more like
creating a smart robot to deal with new
situations as they arise - We give the robot the flexibility to interpret
unknown situations and adapt as needed. - For typical fuzzy real-world situations, a
probabilistic approach provides flexibility and a
decision-based approach provides adaptability
15Integrating data collection and analysis in NMR
- The strategy may depend on
- Size of the protein (e.g Relaxation)
- Folds and fold topology (e.g how much overlap)
- hetero/homo-multi/mono(mer) (e.g degeneracy)
- Existence of homologs (e.g a priori knowledge)
- Required resolution (e.g desired accuracy)
- etc
- Successful strategies generate more value from
a given quantity of data.
16The larger impact of generating more value from
data
- The idea of generating more value from data is
emerging as a key problem in biological
investigations - Today, analyzing biological systems remains a
challenging, sometimes ad hoc, and human
knowledge-intensive endeavor - Most existing methods fail to scale when
presented with large systems-oriented data sets - Robust, reusable, and computationally feasible
approaches are needed that require little
subjective intervention but offer tools for
scientific interpretation - This is a tough target
17Adaptive probabilistic approach generating more
value from data
- The adaptive probabilistic paradigm offers a
novel and promising approach to obtaining more
value from available data (database and
experimental). It has the potential of becoming
a key approach for addressing important
biological questions. - Data collection and analysis
- Protein structure and refinement
- Dynamics of molecules
- Function, binding and interaction
- Fingerprinting and profiling metabolites
- RNA structure determination and refinement
18Adaptive probabilistic approach
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
19A rigorous Adaptive Probabilistic
- Non-deterministic or randomized
- No idea what nature has in store.
- Try options without any preferences
- Probabilistic
- Have observed nature and collected data
- Use statistics to guide my decisions
- Use models on top of statistics
- Adaptive
- Adjust the cost of decisions to based on the
known
20Example existing tools for NMR
21Adaptive probabilistic approach
MLAAKEGAAVSNTPLKK
22Implementation of our adaptive, probabilistic
strategy
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
Data collection
23Recording multidimensional experiments is time
consuming
Reduced dimensionality (RD) can be viewed as an
alternative sampling strategy that leads to
collecting less data All RD experiments lose
information. The adaptive and probabilistic
approach taken by HIFI minimizes the information
loss - not simply convergence to n peaks!
t1
t2
t3
128x12816,384 FIDs 136.5 h
High-resolution Iterative Frequency
Identification (HIFI)
simultaneously evolving indirect frequencies are
extracted from 2D RD spectra
multiple tilted planes are used
angle of each tilted plane is chosen adaptively
in real time
24Reduced dimensionality techniques
RD planes
tilted planes of multidimensional spectra
25HIFI on CBCA(CO)NH
Combined peaks from HIFI planes are in magenta
Hand picked peaks from 3D spectrum in green
26Simplified description of the HIFI NMR approach
Eghbalnia et al (2005) JACS 12712528
27Summary HIFI
- HIFI versions of nearly all backbone experiments
are available - HIFI is being developed for sidechain experiments
- HIFI NOE data have been collected -- analysis is
proceeding -
- HIFI is now completely automated for backbone
data collection by six robust backbone
experiments - Automation of additional backbone experiments can
be implemented very easily
28HIFI applications
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
Restraint generation HIFI RDC
29Implementation of our adaptive, probabilistic
strategy
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
Automated assignment
30PISTACHIO (Probabilistic Identification of Spin
Systems and their Assignments including
Coil-Helix Inference as Output)
- Steps
- Parse peak lists associated with particular
experiments into the set of all possible
tripeptide spin systems specified by the peptide
sequence - Compute probability scores for matching the
chemical shifts of the overlapping tripeptide
spin systems to residues in the sequence (this
makes use of our prior analysis of the BMRB
database of chemical shifts) - Assemble the overlapping tripeptides to match the
sequence and to achieve the maximum probability
for correct assignments (the approach used is
similar to ones used in problems of statistical
physics and combinatorial optimization)
Use existing data to predict an assignment
configuration.
Use prediction to postulate a configuration
distribution
Compare postulated local configuration to
globally minimal solutions
31http//bija.nmrfam.wisc. edu/PISTACHIO/
PISTACHIO is run by uploading files in XEASY or
NMR-STAR format Data from up to 15 standard
double- and triple-resonance experiments can be
used as input. Other types of data can be
accommodated on request.
Aug 2005 to June 2006
32PISTACHIO uses a new data format for
probabilistic assignments
An NMR-STAR data format for probabilistically
assigned protein NMR data has been developed in
collaboration with BMRB The PISTACHIO server
outputs data in this format Algorithms under
development carry the probabilistic assignments
forward and refines them as the structure
determination proceeds BMRB accepts data
depositions in this format A graphical interface
for PISTACHIO / LACS / PECAN results is nearing
completion and will be released soon
33Implementation of our adaptive, probabilistic
strategy
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
Validation of 13C referencing detection of
possible mis-assignments
34Carbon chemical shifts irrespective of structure
can be represented by three Gaussian distributions
Data for all alanines in RefDB
Data for 13Ca as a function of d13Ca d13Cb
35Linear Analysis of Chemical Shifts (LACS) plot
Data for all valine residues in RefDB
L. Wang et al. (2005) J. Biomol. NMR, 3213
36LACS of a single protein can be used to identify
problems with referencing and possible assignment
outliers
This intercept should be at (0,0) for properly
referenced data
L. Wang et al. (2005) J. Biomol. NMR, 3213-22
37We have used LACS to re-reference the BMRB
database
- 11 ( 1.0 ppm )
- 26 ( 0.5 ppm )
- 46 ( 0.3 ppm )
L. Wang et al. (2005) J. Biomol. NMR, 3213-22
38Implementation of our adaptive, probabilistic
strategy
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
2o structure determination
39PECAN (Protein Energetic Conformational Analysis
from NMR chemical shifts) analysis of secondary
structure from assigned chemical shifts and the
protein sequence energy model for a particular
protein (bmr 4083)
Color key helix, strand, non-helix / non-strand
Energy
Residue number
PECANs average accuracy better than 90 across
all structural regions measured on the largest
data set to date.
Eghbalnia et al. (2005) J. Biomol. NMR 3271-81
40PECAN (Protein Energetic Conformational Analysis
from NMR chemical shifts) analysis of secondary
structure from assigned chemical shifts and the
protein sequence example of output
helix
transition region
strand
Eghbalnia et al. (2005) J. Biomol. NMR 3271-81
41Implementation of our adaptive, probabilistic
strategy
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
PINE
42PINE
PINE combines information in order to refine
probabilities that reflect our state of knowledge
Use existing data to predict an assignment
configuration.
Use existing data to predict an secondary
structure configuration.
Use assignments to predict an chemical shift
configuration.
Use prediction to postulate a configuration
distribution
Use prediction to postulate a configuration
distribution
Use prediction to postulate a configuration
distribution
Compare postulated local configuration to
globally minimal solutions
Compare postulated local configuration to
globally minimal solutions
Compare postulated local configuration to
globally minimal solutions
43Integration of probabilistic tools current
version of PINE combines PISTACHIO, LACS, and
PECAN
Example HIFI data for ubiquitin
In PINE, assignments from PISTACHO are validated
by LACS, 2o structure is assigned by PECAN, and
inconsistent values are flagged and given lower
probability in the next round of PISTACHIO. The
process is repeated until consistency is achieved.
PISTACHIO alone
PINE
44brazzein - 53 a.a.
ubiquitin - 76 a.a.
flavodoxin - 176 a.a.
HNCO
HN(CO)CA
HNCA
CBCA(CO)NH
HN(CA)CB
98 of spin systems assigned with PINE
95 of spin systems assigned with PINE
96 of spin systems assigned with PINE
HNCACB
14 h
48 h
12 h
Total time to obtain complete backbone
information - assignment, 2o structure, other
corrections.
45(No Transcript)
46Implementation of our adaptive, probabilistic
strategy
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
Chemical shift prediction
47Chemical shift prediction
- Chemical shifts are the most easily obtained and
most precisely measured observables in
biomolecular NMR - Chemical shifts are highly sensitive to structure
- Chemical shifts are coordinate free
- Our prediction of chemical shifts simulates an
adaptive probabilistic walk on the space of
known chemical shifts (BMRB) using a simple
principle - Protein folding is multi-rate stochastic
process that is rate-insensitive in each rate
domain - Results can be used to decide where to look for
unobserved peaks, to derive approximate values
for missing peaks, and to produce restraints for
structural refinement
48Chemical shift prediction -- TONES
Time Evolution
Manuscript in preparation
49Summary adaptive probabilistic tools for NMR
- HIFI-NMR, a probabilistic approach to data
collection, aims to extract multidimensional NMR
peak positions in an optimally efficient manner - PISTACHIO, turns peak lists associated with a
protein sequence into probabilistic backbone and
side chain assignments - LACS provides the means for checking data sets
for possible referencing problems and
misassignments in advance of a structure
determination - PECAN offers a reliable probabilistic analysis of
protein secondary structure - PINE incorporates PISTACHIO, LACS and PECAN
- Work in progress promises further insights into
connections between chemical shifts and structure - These algorithms and associated software are
being made available from the NMRFAM website
(www.nmrfam.wisc.edu)
50Near-term This year
- HIFI-NMR
- Disseminate HIFI for backbone experiments
algorithms. Add visualization. - Disseminate HIFI-RDC Incorporate side-chain
experiments into HIFI package. - PINE
- Make experimental PINE server available Make
faster server visualizations available (in
collaboration with BMRB)
- HIFI-NMR
- Faster better resolved
- Larger proteins
- Other applications
- ALMOND
- Probabilistic restraint model coupled to chemical
shifts - TONES
- Disseminate, with additional application
- PINE
- More detailed information about secondary
structure
51Adaptive probabilistic approach
Combine informatics, modeling, and experimental
data to achieve fast and robust analysis of
biological systems
Integrating data collection and analysis
52Progress toward automated probabilistic structure
determination
BACKBONE
SIDE CHAINS
NOESY
STRUCTURE REFINEMENT
53Almonds A probabilistic relationship between
chemical shifts and conformation space
- We have carefully refined the relationship of
sequence and torsion angles - specifically
triples. - To establish the probabilistic relationship, we
need a more precise understanding of the
relationship between chemical shifts and the
assembly of tripeptides. This is particularly
crucial in the difficult parts of the 2o
structure. - We have made a lot of progress in
deconstructing the relationship between
sequence, chemical shifts, and torsion angles.
54What do we mean by random coil
The region of (?,?)-space sampled in the absence
of any dominant stabilizing interactions The
experimental random coil state is the
energy-weighted distribution of the ensemble of
such conformational states We can use the LACS
approach to remove bias in the reference state
introduced by stabilizing interactions ---
result unbiased random coil chemical shift
(uRCCS) values
55Derivation of unbiased random coil chemical shift
(uRCCS) values LACS plot of adjusted RefDB
values for Val
L. Wang, H. R. Eghbalnia et al., manuscript in
press
56Stepwise refinement of the model
57Establishing a simple model to incorporate into
ALMONDS
We want to build a simple model where the
parameters are related to the observed effects A
multivariate fitting of database values will not
be useful for our application
Wang et al, J. Biomol. NMR, in press
58Acknowledgments
NIH Grants 1K22 LM8992 NIH Grants U54 GM074901
P50 GM64598 NIH Grant P41 RR02301