Title: Conclusions
1Prediction of Protein Crystallization using
Recursive Partition Trees
Arun Sethuraman1, Sweta Vangaveti1, Jesse
Walsh1, Scott Boyken1, Erin Boggess1, Yves
Sucaet1, Julie Hoy2 1) Bioinformatics and
Computational Biology Laboratory, Iowa State
University, Ames, IA USA http//lab.bcb.iastate.
edu 2) Macromolecular X-ray Crystallography
Facility, Iowa State University, Ames, IA USA -
http//www.biotech.iastate.edu/facilities/XRAY/xra
y.html
Conclusions Our
model presents an accuracy of around 38 when
predicting the actual states and around 66 when
predicting crystallization. We hope that our work
in the future would aid us develop better
prediction models, and eventually, integrate the
same with our ongoing project for developing an
X-Ray Crystallography database and interface.
Although predicting the actual state of
crystallization is an extremely difficult task,
we hope that at least a clear statistically
significant idea of which are the most probable
ideal conditions for crystallizing a protein
would help crystallographers save a lot of money
in setting up trial and error experiments. About
the team - BCBLab In the summer of 2006, four
graduate students in the BCB (Bioinformatics
Computational Biology http//bcb.iastate.edu)
program convened to discuss the future prospects
of bioinformatics. One conclusion was that a PhD
program provided many opportunities to specialize
in a specific research topic, yet was somewhat
lacking when trying to achieve a top-level
overview of the field in general. The
Bioinformatics Computational Biology Laboratory
(BCBLab) came out as a result a novel on-campus
consultancy organization, driven by graduate
students with two main goals to allow
undergraduate and graduate students to develop
broad bioinformatics skills, and to match non-BCB
PIs with on-campus specialists Acknowledgements
Prof. Heike Hofmann, Associate Professor,
Department of Statistics, Iowa State
University. NSF-IGERT, project in Computational
Molecular Biology The following people were
instrumental in providing general support to the
BCBLab Dr. Chris Tuggle, Chair, BCB Program
Dr. Drena Dobbs, Head of NSF-IGERT Computational
Molecular Biology Training Program Dr. Volker
Brendel, Associate Chair, BCB Program Dr. Robert
Jernigan, Director, L.A. Baker Center for
Bioinformatics and Statistics
p
Abstract The formation of protein crystals in
an X-ray crystallography experiment depends on
many conditions, such as temperature, the
concentration of salts in the solution, buffer
pH, etc. The principles governing crystallization
are unclear and most experimental designs depend
on past experience or luck. Hence the need to
design strategies to find correlations among
experimental variables and use them in
formulating optimal conditions for crystal
formation. We use existing crystallization
information to model recursive partition trees,
describing relationships between the conditions
involved, to predict the crystallization state of
a protein. Statistical analysis using R on data
for a class of membrane proteins show an accuracy
of 66 in predicting whether or not a crystal
will form and 38 in predicting the specific
crystallization state. With a more robust
training set, our model could offer better
predictions and in turn save money and effort in
setting up experiments. Introduction Macromolecu
lar X-ray crystallography is a method for
determining the molecular structure of
proteins and nucleic acids
Data Analysis and Prediction
Modeling the Recursive Partition Tree in R
- Step1 Growing the Tree
- While designing the model the complexity
parameter (cp) was set to 0.000001. If the
overall lack of fit does not decrease by a
minimum factor of cp for a split, the split is
not attempted. This way those splits that are not
worthwhile are pruned off, thus saving computing
time. Default values  were used for other control
parameters. - Step 2 Pruning the Tree
- The best fitting model would be one for which the
xerror has been minimized As observed above, the
minimum xerror occurs at nsplit less than the
maximum nsplit. This is a consequence of
over-fitting of the data. Hence, to counter this,
the partition tree is pruned back to nsplit 9 by
using the complexity parameter associated with
the minimum xerror (0.01 in this case). The new
model now has a value of nsplit10 as shown in
the figure.
- Dataset Used 5 different screens of 50
conditions each for a class of hemoglobin
proteins.Macromolecular X-Ray Crystallography
Facility, Iowa State University - The data obtained for protein crystallization
effectively comprises the various conditions in
each well of the crystal screen. Observations are
made over a period of time and classified by
crystallization state figure 2
Recursive Partition Trees
- Decision tree that helps classify a multivariate
response variable based on the several (clusters
of) dichotomous dependents. - Describes the conditional distribution of a
response variable Y given the status of m
covariates by means of a tree-structured
recursive partitioning.Hothorn et al, Journal of
Computational and Graphical Statistics, Vol. 15,
No. 3, pp. 651-674 ? what year?? - Recursively sub-divide the predictor space into
homogenous regions using two steps - Grow a large tree by subdividing the data into
very small pieces. - Based on the defined threshold complexity
parameter, prune the large tree to balance
complexity and fit.
Accuracy of the model
- Cross-validation on the training data yields an
accuracy of 68 - Test data comprised of 50 different conditions in
a crystal screen for another hemoglobin protein
and the crystallization states were predicted - Predicting crystallization state (1-9) yields an
accuracy of 38 - We simplified this model by making the state of
the protein as a dichotomous variable
(Considering only if the protein had crystallized
or not, instead of the 9 separate states) which
yielded an accuracy of 66.
Crystallization States
- X-Ray crystallography steps
- Grow suitable crystals from purified sample
- Obtain X-ray diffraction data set from which to
solve the structure - 1st step is a bottle neck
- Optimal conditions for crystal nucleation and
growth are difficult to predict - Trial and error method used instead
- Sparse Matrix Sampling Technology
- Quickly test a wide range of pH, salts, and
precipitants using a very small sample of the
protein.Jancarik et al, Journal of Applied
Crystallography, Vol. 24 (1991), pp. 409-411 - Evaluates unique combinations of pH, buffers,
salts, precipitants, and their ability to promote
crystal growth, each well being a different
condition, as shown in Figure 1. -
- Clear Drop
- Phase Separation
- Regular Granular Precipitate
- Birefringent Precipitate
- Posettes or Spherulites
- Needles, 1D Growth
- Plates, 2D Growth
- Single crystals, 3D Growth, crystal lt 0.2mm
- Single crystals, 3D Growth, crystal gt 0.2mm
Improvements
- In the future, expand to larger datasets, with a
greater number of conditions being represented in
the model, hence giving rise to better
predictions. - Some states are more preferably expressed than
others, and these hidden relationships cannot
be represented using a partition tree. We hence
suggest building a Bayesian network or a Fuzzy
network to represent the conditional
relationships between the variables in the
screen. These networks would hence be populated
based on the conditional probabilities of
reaching different states, instead of dichotomous
conditions as in the recursive partition tree.
Figure 3. Our Pruned Recursive Partition Tree
- We modeled our data into a recursive partition
tree Figure 3 in R Ihaka et al., Journal of
Computational and Graphical Statistics, 1996,
Vol. 5, Num. 3, 299-314 - Weights on the partition tree are the probability
of obtaining a particular state - Trained model on a set of 200 data points and
tested their efficacy on a set of 50 data points
containing a categorically similar protein with
crystallization performed using the same crystal
screen (i.e. the crystallization conditions were
the same as in the training set). - The tree is interpreted as follows start reading
the tree from the root node, Buffer. If buffer
is any one of the buffers a/d/e//p, then the
right node condition checked next. If buffer is
not present in a/d/e//p, then precipitant (I
think this was a correction that wasn't fully
integrated preciptant, all italic, unbold 't')
is checked. The left child node hence mentions
the conditions to be checked on an affirmative
parent node condition. The right child node
describes conditions to be checked on a negative
parent node condition.
Figure 1. Sparse Matrix Sampling Technology
plate (left) and close-up of well (right)
Figure 2. Scoring Scheme 9 observable
crystallization states
Our Goal Predict crystallization states of
proteins based on known conditions and results
Methodology
Recording Observations
Importing the data
r
Determine key variables
Predict Crystallization of a new protein
Build Recursive Partition Tree
Figure 4. Key variables used in analysis