Title: Approximation of Protein Structure for Fast Similarity Measures
1Approximation of Protein Structure for Fast
Similarity Measures
- Fabian Schwarzer
- Itay Lotan
- Stanford University
2Comparing Protein Structures
Same protein
vs.
Analysis of MDS and MCS trajectories
Graph-based methods
Structure prediction applications
- Evaluating decoy sets
- Clustering predictions (Shortle et al,
Biophysics 98)
Stochastic Roadmap Simulation (Apaydin et al,
RECOMB 02)
http//folding.stanford.edu
3k Nearest-Neighbors Problem
Given a set S of conformations of a protein and a
query conformation c, find the k conformations
in S most similar to c.
Can be done in N
size of S L time to compare two conformations
4k Nearest-Neighbors Problem
What if needed for all c in S ?
- too much time
- Can be improved by
- Reducing L
- A more efficient algorithm
5Our Solution
- Reduce structure description
Approximate but fast similarity measures
Reduce description further
Efficient nearest-neighbor algorithms can be used
6Description of a Proteins Structure
3n coordinates of Ca atoms (n Number of
residues)
7Similarity Measures - cRMS
- The RMS of the distances between corresponding
atoms after the two conformations are optimally
aligned
Computed in O(n) time
8Similarity Measures - dRMS
- The Euclidean distance between the
intra-molecular distances matrices of the two
conformations
Computed in O(n2) time
9m-Averaged Approximation
- Cut chain into m pieces
- Replace each sequence of n/m Ca atoms by its
centroid
3n coordinates
3m coordinates
10Why m-Averaging?
- Averaging reduces description of random chains
with small error - Demonstrated through Haar wavelet analysis
- Protein backbones behave on average like random
chains - Chain topology
- Limited compactness
11Evaluation Test Sets
9 structurally diverse proteins of size 38 -76
residues
- Decoy sets conformations from the Park-Levitt
set (Park Levitt, JMB 96), N 10,000 - Random sets conformations generated by the
program FOLDTRAJ (Feldman Hogue, Proteins 00),
N 5000
12Decoy Sets Correlation
0.37 0.73
0.40 0.86
0.84 0.98
0.70 0.94
0.98 0.99
0.92 0.96
0.98 0.99
0.92 0.98
0.98 0.99
0.93 0.97
Higher Correlation for random sets!
13Speed-up for Decoy Sets
- Between 5X and 8X for cRMS (m 8)
- Between 9X and 36X for dRMS (m 12)
- with very small error
For random sets the speed-up for dRMS was between
25X and 64X (m 8)
14Efficient Nearest-Neighbor Algorithms
- There are efficient nearest-neighbor algorithms,
but they are not compatible with similarity
measures - cRMS is not a Euclidean metric
- dRMS uses a space of dimensionality
n(n-1)/2
15Further Dimensionality Reduction of dRMS
- kd-trees require dimension ? 20
- m-averaging with dRMS is not enough
- Reduce further using SVD
SVD A tool for principal component analysis.
Computes directions of greatest variance.
16Reduction Using SVD
- Stack m-averaged distances matrices as vectors
- Compute the SVD of entire set
- Project onto most important singular vectors
dRMS is thus reduced to ?20 dimensions
Without m-averaging SVD can be too costly
17Testing the Method
- Use decoy sets (N 10,000)
- m-averaging with (m 16)
- Project onto 20 largest PCs (more than 95 of
variance) - Each conformation represented by 20 numbers
18Results
- For k 10, 25, 100
- Decoy sets 80 correct furthest
NN off by 10 - 20 (0.7Ã… 1.5Ã…) - 1CTF, with N 100,000 ? similar results
- Random sets ? 90 correct with smaller error (5
- 10)
When precision is important use as pre-filter
with larger k than needed
19Running Time
- N 100,000
- k 100, for each conformation
Brute-force
84 hours Brute-force m-averaging
4.8 hours Brute-force m-averaging SVD 41
minutes Kd-tree m-averaging SVD 19
minutes
kd-trees will have more impact for larger sets
20Structural Classification
Computing the similarity between structures of
two different proteins is more involved
2MM1
1IRD
vs.
The correspondence problem Which parts of the
two structures should be compared?
21STRUCTAL (Gerstein Levitt 98)
- Compute optimal correspondence using dynamic
programming - Optimally align the corresponding parts in space
to minimize cRMS - Repeat until convergence
O(n1n2) time
Result depends on initial correspondence!
22STRUCTAL m-averaging
- Compute similarity for structures of same SCOP
super-family with and without m-averaging
correlation
speed-up
n/m
3
0.60 0.66
7
0.44 0.58
19
5
8
0.35 0.57
46
NN results were disappointing
23Conclusion
- Fast computation of similarity measures
- Trade-off between speed and precision
- Exploits chain topology and limited compactness
of proteins - Allows use of efficient nearest-neighbor
algorithms - Can be used as pre-filter when precision is
important
24Random Chains
c5
c7
c2
c6
c8
cn-1
c0
c4
c1
c3
- The dimensions are uncorrelated
- Average behavior can be approximated by normal
variables
251-D Haar Wavelet Transform
- Recursive averaging and differencing of the values
Detail Coefficients
Level
Averages
9 7 2 6 5 1 4 6
3
2
8 4 3 5
1 -2 2 -1
1
6 4
-2 -1
0
5
1
9 7 2 6 5 1 4 6
5 1 -2 -1 1 -2 2 1
26Haar Wavelets and Compression
When discarding detail coefficients the
approximation error is the root of the sum of the
squares of the discarded coefficients
- Compress by discarding smallest coefficients
27Transform of Random Chains
For random chains the pdf of the detail
coefficients is Coefficients expected to be
ordered!
Discard coefficients starting at lowest level
28Random Chains and Proteins