Identifying Structural Motifs in Proteins - PowerPoint PPT Presentation

About This Presentation
Title:

Identifying Structural Motifs in Proteins

Description:

ICP movie. Pruning the search space. Every point in motif ... Find local minimum. Look around a few more times. Return the best answer you have. Observations ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 31
Provided by: robotics8
Category:

less

Transcript and Presenter's Notes

Title: Identifying Structural Motifs in Proteins


1
Identifying Structural Motifs in Proteins
  • Rohit Singh
  • Joint work with Mitul Saha

2
The Big Picture small motifs
Active Sites are preserved across proteins
with similar functions
3
The Big Picture large motifs
Even bigger motifs are often conserved.
4
Oh, BTW
  • There are two different issues here
  • Find the best match for the motif in the protein
  • Extensively studied in vision/graphics
  • Is the match significant ?
  • For small motifs a good match is more likely
  • What is probability of a match against a random
    protein being this good ? (cf. BLAST)

5
Whats in it for a CS guy ?
  • The problem of matching two point-sets has many
    applications
  • Most current algorithms geared towards points
    that are indistinguishable (e.g. points on a
    mesh)
  • There are few rigorous results on the
    significance of matches

6
So what have we done ?
  • Towards a more rigorous approach for scoring the
    quality of a match (between motif and protein)
  • Provide a method that is capable of finding the
    optimum match based on these criteria

7
Problem Description
  • Given a motif and a protein, for each point in
    the motif, find a corresponding point in the
    protein.
  • Given these correspondences, find the best
    transformation (rotation and translation only) of
    the motif that aligns it to the protein.
  • Optimize over all possible correspondences

8
Oh, BTW
  • Given two sets of k points, easy to find the
    optimal rotation and translation that minimizes
    the least sum-of-squared error (also RMSD).
  • Boils down to finding the largest eigenvalue of a
    4x4 matrix.

9
Previous Work
  • Brute Force approach match edges of same length.
  • Geometric Hashing

Pennec Ayache, Bioinformatics, 1998
10
What is missing ?
  • Ad hoc Try to minimize a quantity that is only
    indirectly related to the least square error or
    RMSD.
  • Hard to evaluate the quality of partial matches
  • Brute Force methods infeasible for larger motifs
  • Geometric Hashing requires significant
    preprocessing

11
Estimating the error
  • Model the alignment problem as a regression
    problem

Y model set (protein) T data set (motif) g
transformation (rottrans)
  • Which error criterion to use ?
  • Least Mean Squared Error (also RMSD)
  • LSE is not good when you have outliers.
  • what to do ?

12
Robust error estimation
  • LSE larger error terms have disproportionate
    influence.
  • Use a function to reduce the effect of larger
    error terms (M-estimators)

13
Its an optimization problem!
  • Consider the case of full matching
  • Domain set of all possible correspondences
    between points on the motif and points on the
    protein
  • Range given a particular set of corresponding
    points, the minimum error in aligning those point
    sets.
  • Goal find the global minimum of this function!

14
Looking for global minimum
  • Our approach
  • Prune the search space to a small and plausible
    sub-space
  • Find (most) of the local minima in this sub-space
    quickly
  • Choose the minimum over these local minima

15
Finding local minima is easyICP
  • Iterative Closest Point (Besl-McKay)

16
ICP contd
  • ICP is guaranteed to converge to a local minimum
  • But depends a lot on initial seeding
  • Convergence is quick 4-5 iterations
  • ICP movie

17
Pruning the search space
  • Every point in motif/protein has some features
  • Amino acid type, element type, sec. structure,
    hydrophobic/polar, substitutable
  • Assume a point with feature X can only match
    another point with feature X (or Y,Z,W)
  • Assume some features are more frequent than
    others

18
Our Approach
  • Find the feature that is least frequent in
    protein.
  • For each occurrence of the feature
  • Seed ICP appropriately. Find local minimum.
  • Look around a few more times
  • Return the best answer you have

19
Observations
  • Will always find a perfect match, if it exists.
  • Moreover, will find such a match quickly.
  • The error is directly interpretable in RMSD terms

20
Does it work ?
21
contd
Trypsin active site against Trypsin like proteins
22
contd
Trypsin active site against kinases
23
What about partial matching ?
  • Basic idea is the same pruningICP
  • Replace least squared error estimates by
    M-estimator based errors.
  • Problem How to find the optimal
    rotation/translation that minimizes this new
    variety of error criterion?
  • Answer weighted LSE ?
  • Is there a better way ?

24
RANSAC
Choice of the parameters has statistical
justification
25
Plain Vanilla (Least Squares)
26
M-estimator weighted LSE
27
M-estimator RANSAC
28
contd
  • Data for distorted trypsin active site against
    ten different trypsins

29
Future Work
  • Test on larger motifs secondary structure
    elements
  • Choice of better features
  • A theoretical guarantee about the quality of
    results
  • Explore different criteria for partial matching

30
Thanks!
Write a Comment
User Comments (0)
About PowerShow.com