Conclusions - PowerPoint PPT Presentation

1 / 1
About This Presentation
Title:

Conclusions

Description:

The Bioinformatics & Computational Biology Laboratory (BCBLab) came out as a ... Dr. Robert Jernigan, Director, L.A. Baker Center for Bioinformatics and Statistics ... – PowerPoint PPT presentation

Number of Views:25
Avg rating:3.0/5.0
Slides: 2
Provided by: arunset
Category:

less

Transcript and Presenter's Notes

Title: Conclusions


1
Prediction of Protein Crystallization using
Recursive Partition Trees
Arun Sethuraman1, Sweta Vangaveti1, Jesse
Walsh1, Scott Boyken1, Erin Boggess1, Yves
Sucaet1, Julie Hoy2 1) Bioinformatics and
Computational Biology Laboratory, Iowa State
University, Ames, IA USA http//lab.bcb.iastate.
edu 2) Macromolecular X-ray Crystallography
Facility, Iowa State University, Ames, IA USA -
http//www.biotech.iastate.edu/facilities/XRAY/xra
y.html
Conclusions Our
model presents an accuracy of around 38 when
predicting the actual states and around 66 when
predicting crystallization. We hope that our work
in the future would aid us develop better
prediction models, and eventually, integrate the
same with our ongoing project for developing an
X-Ray Crystallography database and interface.
Although predicting the actual state of
crystallization is an extremely difficult task,
we hope that at least a clear statistically
significant idea of which are the most probable
ideal conditions for crystallizing a protein
would help crystallographers save a lot of money
in setting up trial and error experiments. About
the team - BCBLab In the summer of 2006, four
graduate students in the BCB (Bioinformatics
Computational Biology http//bcb.iastate.edu)
program convened to discuss the future prospects
of bioinformatics. One conclusion was that a PhD
program provided many opportunities to specialize
in a specific research topic, yet was somewhat
lacking when trying to achieve a top-level
overview of the field in general. The
Bioinformatics Computational Biology Laboratory
(BCBLab) came out as a result a novel on-campus
consultancy organization, driven by graduate
students with two main goals to allow
undergraduate and graduate students to develop
broad bioinformatics skills, and to match non-BCB
PIs with on-campus specialists Acknowledgements
Prof. Heike Hofmann, Associate Professor,
Department of Statistics, Iowa State
University. NSF-IGERT, project in Computational
Molecular Biology The following people were
instrumental in providing general support to the
BCBLab Dr. Chris Tuggle, Chair, BCB Program
Dr. Drena Dobbs, Head of NSF-IGERT Computational
Molecular Biology Training Program Dr. Volker
Brendel, Associate Chair, BCB Program Dr. Robert
Jernigan, Director, L.A. Baker Center for
Bioinformatics and Statistics
p
Abstract The formation of protein crystals in
an X-ray crystallography experiment depends on
many conditions, such as temperature, the
concentration of salts in the solution, buffer
pH, etc. The principles governing crystallization
are unclear and most experimental designs depend
on past experience or luck. Hence the need to
design strategies to find correlations among
experimental variables and use them in
formulating optimal conditions for crystal
formation. We use existing crystallization
information to model recursive partition trees,
describing relationships between the conditions
involved, to predict the crystallization state of
a protein. Statistical analysis using R on data
for a class of membrane proteins show an accuracy
of 66 in predicting whether or not a crystal
will form and 38 in predicting the specific
crystallization state. With a more robust
training set, our model could offer better
predictions and in turn save money and effort in
setting up experiments. Introduction Macromolecu
lar X-ray crystallography is a method for
determining the molecular structure of
proteins and nucleic acids
Data Analysis and Prediction
Modeling the Recursive Partition Tree in R
  • Step1 Growing the Tree
  • While designing the model the complexity
    parameter (cp) was set to 0.000001. If the
    overall lack of fit does not decrease by a
    minimum factor of cp for a split, the split is
    not attempted. This way those splits that are not
    worthwhile are pruned off, thus saving computing
    time. Default values  were used for other control
    parameters.
  • Step 2 Pruning the Tree
  • The best fitting model would be one for which the
    xerror has been minimized As observed above, the
    minimum xerror occurs at nsplit less than the
    maximum nsplit. This is a consequence of
    over-fitting of the data. Hence, to counter this,
    the partition tree is pruned back to nsplit 9 by
    using the complexity parameter associated with
    the minimum xerror (0.01 in this case). The new
    model now has a value of nsplit10 as shown in
    the figure.
  • Dataset Used 5 different screens of 50
    conditions each for a class of hemoglobin
    proteins.Macromolecular X-Ray Crystallography
    Facility, Iowa State University
  • The data obtained for protein crystallization
    effectively comprises the various conditions in
    each well of the crystal screen. Observations are
    made over a period of time and classified by
    crystallization state figure 2

Recursive Partition Trees
  • Decision tree that helps classify a multivariate
    response variable based on the several (clusters
    of) dichotomous dependents.
  • Describes the conditional distribution of a
    response variable Y given the status of m
    covariates by means of a tree-structured
    recursive partitioning.Hothorn et al, Journal of
    Computational and Graphical Statistics, Vol. 15,
    No. 3, pp. 651-674 ? what year??
  • Recursively sub-divide the predictor space into
    homogenous regions using two steps
  • Grow a large tree by subdividing the data into
    very small pieces.
  • Based on the defined threshold complexity
    parameter, prune the large tree to balance
    complexity and fit.

Accuracy of the model
  • Cross-validation on the training data yields an
    accuracy of 68
  • Test data comprised of 50 different conditions in
    a crystal screen for another hemoglobin protein
    and the crystallization states were predicted
  • Predicting crystallization state (1-9) yields an
    accuracy of 38
  • We simplified this model by making the state of
    the protein as a dichotomous variable
    (Considering only if the protein had crystallized
    or not, instead of the 9 separate states) which
    yielded an accuracy of 66.

Crystallization States
  • X-Ray crystallography steps
  • Grow suitable crystals from purified sample
  • Obtain X-ray diffraction data set from which to
    solve the structure
  • 1st step is a bottle neck
  • Optimal conditions for crystal nucleation and
    growth are difficult to predict
  • Trial and error method used instead
  • Sparse Matrix Sampling Technology
  • Quickly test a wide range of pH, salts, and
    precipitants using a very small sample of the
    protein.Jancarik et al, Journal of Applied
    Crystallography, Vol. 24 (1991), pp. 409-411
  • Evaluates unique combinations of pH, buffers,
    salts, precipitants, and their ability to promote
    crystal growth, each well being a different
    condition, as shown in Figure 1.
  • Clear Drop
  • Phase Separation
  • Regular Granular Precipitate
  • Birefringent Precipitate
  • Posettes or Spherulites
  • Needles, 1D Growth
  • Plates, 2D Growth
  • Single crystals, 3D Growth, crystal lt 0.2mm
  • Single crystals, 3D Growth, crystal gt 0.2mm

Improvements
  • In the future, expand to larger datasets, with a
    greater number of conditions being represented in
    the model, hence giving rise to better
    predictions.
  • Some states are more preferably expressed than
    others, and these hidden relationships cannot
    be represented using a partition tree. We hence
    suggest building a Bayesian network or a Fuzzy
    network to represent the conditional
    relationships between the variables in the
    screen. These networks would hence be populated
    based on the conditional probabilities of
    reaching different states, instead of dichotomous
    conditions as in the recursive partition tree.

Figure 3. Our Pruned Recursive Partition Tree
  • We modeled our data into a recursive partition
    tree Figure 3 in R Ihaka et al., Journal of
    Computational and Graphical Statistics, 1996,
    Vol. 5, Num. 3, 299-314
  • Weights on the partition tree are the probability
    of obtaining a particular state
  • Trained model on a set of 200 data points and
    tested their efficacy on a set of 50 data points
    containing a categorically similar protein with
    crystallization performed using the same crystal
    screen (i.e. the crystallization conditions were
    the same as in the training set).
  • The tree is interpreted as follows start reading
    the tree from the root node, Buffer. If buffer
    is any one of the buffers a/d/e//p, then the
    right node condition checked next. If buffer is
    not present in a/d/e//p, then precipitant (I
    think this was a correction that wasn't fully
    integrated preciptant, all italic, unbold 't')
    is checked. The left child node hence mentions
    the conditions to be checked on an affirmative
    parent node condition. The right child node
    describes conditions to be checked on a negative
    parent node condition.

Figure 1. Sparse Matrix Sampling Technology
plate (left) and close-up of well (right)
Figure 2. Scoring Scheme 9 observable
crystallization states
Our Goal Predict crystallization states of
proteins based on known conditions and results
Methodology
Recording Observations
Importing the data
r
Determine key variables
Predict Crystallization of a new protein
Build Recursive Partition Tree
Figure 4. Key variables used in analysis
Write a Comment
User Comments (0)
About PowerShow.com