Title: Identifying Structural Motifs in Proteins
1Identifying Structural Motifs in Proteins
- Rohit Singh
- Joint work with Mitul Saha
2The Big Picture small motifs
Active Sites are preserved across proteins
with similar functions
3The Big Picture large motifs
Even bigger motifs are often conserved.
4Oh, BTW
- There are two different issues here
- Find the best match for the motif in the protein
- Extensively studied in vision/graphics
- Is the match significant ?
- For small motifs a good match is more likely
- What is probability of a match against a random
protein being this good ? (cf. BLAST)
5Whats in it for a CS guy ?
- The problem of matching two point-sets has many
applications - Most current algorithms geared towards points
that are indistinguishable (e.g. points on a
mesh) - There are few rigorous results on the
significance of matches
6So what have we done ?
- Towards a more rigorous approach for scoring the
quality of a match (between motif and protein) - Provide a method that is capable of finding the
optimum match based on these criteria
7Problem Description
- Given a motif and a protein, for each point in
the motif, find a corresponding point in the
protein. - Given these correspondences, find the best
transformation (rotation and translation only) of
the motif that aligns it to the protein. - Optimize over all possible correspondences
8Oh, BTW
- Given two sets of k points, easy to find the
optimal rotation and translation that minimizes
the least sum-of-squared error (also RMSD). - Boils down to finding the largest eigenvalue of a
4x4 matrix.
9Previous Work
- Brute Force approach match edges of same length.
- Geometric Hashing
Pennec Ayache, Bioinformatics, 1998
10What is missing ?
- Ad hoc Try to minimize a quantity that is only
indirectly related to the least square error or
RMSD. - Hard to evaluate the quality of partial matches
- Brute Force methods infeasible for larger motifs
- Geometric Hashing requires significant
preprocessing
11Estimating the error
- Model the alignment problem as a regression
problem
Y model set (protein) T data set (motif) g
transformation (rottrans)
- Which error criterion to use ?
- Least Mean Squared Error (also RMSD)
- LSE is not good when you have outliers.
- what to do ?
12Robust error estimation
- LSE larger error terms have disproportionate
influence. - Use a function to reduce the effect of larger
error terms (M-estimators)
13Its an optimization problem!
- Consider the case of full matching
- Domain set of all possible correspondences
between points on the motif and points on the
protein - Range given a particular set of corresponding
points, the minimum error in aligning those point
sets. - Goal find the global minimum of this function!
14Looking for global minimum
- Our approach
- Prune the search space to a small and plausible
sub-space - Find (most) of the local minima in this sub-space
quickly - Choose the minimum over these local minima
15Finding local minima is easyICP
- Iterative Closest Point (Besl-McKay)
16ICP contd
- ICP is guaranteed to converge to a local minimum
- But depends a lot on initial seeding
- Convergence is quick 4-5 iterations
- ICP movie
17Pruning the search space
- Every point in motif/protein has some features
- Amino acid type, element type, sec. structure,
hydrophobic/polar, substitutable - Assume a point with feature X can only match
another point with feature X (or Y,Z,W) - Assume some features are more frequent than
others -
18Our Approach
- Find the feature that is least frequent in
protein. - For each occurrence of the feature
- Seed ICP appropriately. Find local minimum.
- Look around a few more times
- Return the best answer you have
19Observations
- Will always find a perfect match, if it exists.
- Moreover, will find such a match quickly.
- The error is directly interpretable in RMSD terms
20Does it work ?
21contd
Trypsin active site against Trypsin like proteins
22contd
Trypsin active site against kinases
23What about partial matching ?
- Basic idea is the same pruningICP
- Replace least squared error estimates by
M-estimator based errors. - Problem How to find the optimal
rotation/translation that minimizes this new
variety of error criterion? - Answer weighted LSE ?
-
- Is there a better way ?
24RANSAC
Choice of the parameters has statistical
justification
25Plain Vanilla (Least Squares)
26M-estimator weighted LSE
27M-estimator RANSAC
28contd
- Data for distorted trypsin active site against
ten different trypsins
29Future Work
- Test on larger motifs secondary structure
elements - Choice of better features
- A theoretical guarantee about the quality of
results - Explore different criteria for partial matching
30Thanks!