Protein Structure Similarity

About This Presentation

Title:

Protein Structure Similarity

Description:

Protein Structure Similarity Computation of Best Matches Two simultaneous subproblems Find maximal correspondence set C Find alignment transform T Chicken-and ... – PowerPoint PPT presentation

Number of Views:153

Avg rating:3.0/5.0

Slides: 72

Provided by: latombe

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Protein Structure Similarity

1
Protein Structure Similarity
2
Computation of Best Matches

Two simultaneous subproblems
Find maximal correspondence set C
Find alignment transform T
Chicken-and-egg issue
Each subproblem is relatively simple
If we knew C, we could compute T
If we knew T, we could get C by proximity
But the combination is hard !!!

3
Computation of Best Matches

Two simultaneous subproblems
Find maximal correspondence set C
Find alignment transform T
Chicken-and-egg issue
Each subproblem is relatively simple
If we knew C, we could compute T
If we knew T, we could get C by proximity
But the combination is hard !!!

4
Find Alignment Transform

Two sets of points A a1,,an and B
b1,,bn
Correspondence pairs (ai, bi)
Find T arg minT RMSD(A,T(B)) ?
O(n) closed-form solution Arun, Huang, and
Blostein, 87 Horn, 87 Horn, Hilden, and
Negahdaripour, 88

5
O(n) SVD-Based Algorithm

T combines translation t and rotation R, such
that T(bi) t R(bi)
b (Si1,...,nbi)/n mean of the bis
Place the origin of coordinate system at b
minT RMSD(A,T(B)) simplifies to (up to some
constants)
t and R can be computed separately
t a mean of the ais

Arun, Huang, and Blostein, 87
6
O(n) SVD-Based Algorithm

A3?n a1-a, ..., an-a B3?n b1-b, ...,
bn-b
Compute SVD decomposition of 33 correlation
matrix BAT BAT UDVT
where D is a diagonal matrices with decreasing
non-negative entries (singular values) along the
diagonal
If det(U)det(V) 1 then S I,
else S diag(1,1,-1)
R USVT

Arun, Huang, and Blostein, 87
7
O(n) SVD-Based Algorithm

A3?n a1-a, ..., an-a B3?n b1-b, ...,
bn-b
Compute SVD decomposition of 33 correlation
matrix BAT BAT UDVT
where D is a diagonal matrices with decreasing
non-negative entries (singular values) along the
diagonal
If det(U)det(V) 1 then S I,
else S diag(1,1,-1)
R USVT

Arun, Huang, and Blostein, 87
8

Arun, Huang, and Blostein, 87
? rotation matrix
Horn, 87 ? quaternion

9
? Trial-and-Error Approach to Protein Structure
Comparison
10
? Trial-and-Error Approach to Protein Structure
Comparison

Set CS to a seed correspondence set (small set
sufficient to generate an alignment transform)
Compute the alignment transform T for CS and
apply T to the second protein B
Update CS to include all pairs of features that
are close apart
If CS has changed, then return to Step 2 else
return (CS,T)

11
? Trial-and-Error Approach to Protein Structure
Comparison

- result nil
- Iterate N times
Set CS to a seed correspondence set (small set
sufficient to generate an alignment transform)
Compute the alignment transform T for CS and
apply T to the second protein B
Update CS to include all pairs of features that
are close apart
If CS has changed, then return to Step 2 else
result ? result ? (CS,T)
- Return result

How to get seed correspondences?

13
Seed Generation from Fragment

From distance matrices
E.g., DALI Holm and Sander, 1996

14
Using Distance Matrices (DALI)

Distances are invariant to rigid-body
transformations
DALI Holm and Sander, 1996 looks for similar
hexapeptides by searching for similar 7x7 Ca-Ca
distance matrices

15
Seed Generation from Fragment

From distance matrices
E.g., DALI Holm and Sander, 1996
From secondary structure elements (SSEs)
E.g., LOCK Singh and Brutlag, 1996
From voting scheme (using geometric hashing)
E.g., 3dSEARCH Singh and Brutlag, 2000

16
LOCK

A.P. Singh and D.L. Brutlag. Hierarchical
Protein Structure Superposition Using Both
Secondary and Atomic Representations. Proc. ISMB,
pp. 284-293, 1997.
LOCK2J. Shapiro and D.L. Brutlag. FoldMiner
Structural Motif Discovery Using an Improved
Superposition Algorithm. Protein Science,
13278-294, 2004.
http//motif.stanford.edu/lock2/

17
LOCK

Two levels of features SSEs and Ca atoms
Stage 1 (SSE alignment) Initial alignment is
computed using SSEs represented as vectors
Stage 2 (atom alignment) Alignment is refined
using Ca atoms represented as points

18
Rationale for LOCK

Using types of features is an effective way to
reduce combinatorial explosion and computation
SSEs, which are responsible for most of the
stability and functionality of the proteins, are
more meaningful and better conserved than types
of atoms and amino-acids
If 2 structures are similar, some of their SSEs
should form similar substructures
Drawback It narrows down the set of possible
applications, e.g., cant find small motifs at
atomic level

19
Vector-Based Representation
b-strands
loops
a-helices
One vector per SSE (helix, strand, loop)
20
Vector-Based Representation

DSSP Kabsch and Sander, 1983 classifies
residues into helices/strands
For a-helix starting at residue iXorigin
(0.74Xi Xi1 Xi2 0.74Xi3)/3.48where Xi
is the position of the Ca atom of residue i
(angle between two consecutive residues is 100dg
? factor 0.74)
Similar computation for Xend and for b-strand

21
Scoring Similarity
Maximal score

Position-independent differences
angle(i,k)-angle(p,r)
angle(i,j)-angle(p,q)
angle(j,k)-angle(q,r)
distance(i,k)-distance(p,r)
length(k)-length(r)
Position-dependent differences
angle(k,r)
distance(k,r)
Scores are additive

Score S S(di)
Value of di forwhich score is 0
22
Stage 1 SSE Alignment

For every pair of SSE vectors of protein A, find
all pairs of vectors in B that align well using
orientation-independent scores ? seed
correspondence sets
For each correspondence set
Find alignment transform and apply it to B
Find correspondence set with maximal score
(record transform T and correspondence set CS
that yields maximal score)

23
Stage 1 SSE Alignment

A (i, j, k, l, m)
B (p, q, r, s, t)
Seed correspondence (i,p),(j,q)

Simultaneous gaps in both structures are not
allowed (not in SCOP2)
Terminate a path when score of new
correspondence is negative
Re-compute new transform with each new
correspondence (?)

24
Stage 2 Atom (Core) Alignment

Construct correspondence pairs of atoms
Atom i of A corresponds to atom j of T(B) iff i
is the closest atom in A to j and j is the
closest atom in T(B) to i
The distance between i and T(j) is lt e (3Å)
Prune correspondence set to largest subset of
correspondence pairs that follow backbone
alignment constraint
Re-compute T to be the transform that minimizes
the RMSD of the atoms in the correspondence set
Iterate 1-2-3 until RSMD converges

25
Experimental Results

685 protein structures from PDB such that each
pair has less than 25 sequence identity
3 families of folds (based on SCOP
classification) - myoglobins (11 structures)
20 amino acid identity- TIM barrels (50
structures)- immunoglobulins (38 structures)
Goal Given one query protein in each family,
find the other members of the family (3685
2055 alignments)
Method For each query, sort the 685 structures
by score (computed by LOCK). Select the top k
proteins. Count members of family (true
positives) and non-members (false positives)

26
Myoglobins (11)
TIM-barrels (50)
Immunoglobulins (38)
True positives False positives
11 0
True positives False positives
40 0
45 1
50 5
True positives False positives
20 0
25 1
30 2
35 11
38 383
27
Alignment of 11 Myoglobins
28
Alignment of 50 TIM barrels
a-helices in red b-strands in yellow
29
Alignments of 31 Immunoglobulins
Only b-strands are shown
30
ROC Curves
31
Running Time

1ms per seed correspondence
1h to search 10,000 protein structures
100s of days to compare all pairs of proteins
in PDB
? Geometric hashing to speedup stage 1

32
Seed Generation from Fragment

From distance matrices
E.g., DALI Holm and Sander, 1996
From secondary structure elements (SSEs)
E.g., LOCK Singh and Brutlag, 1996
From voting scheme (using geometric hashing)
E.g., 3dSEARCH Singh and Brutlag, 2000

33
Voting Scheme with Hash Table

Many-to-many comparison requires a better
organization of computation to avoid repeating
the same computation again and again
Pre-computation Index proteins in hash table
Query phase Voting scheme using hash table
Several variants on this theme
3d-Lookup Holm and Sander, 1995
3dSEARCH Singh 2002

34
Voting Scheme with Hash Table

Many-to-many comparison requires a better
organization of computation to avoid repeting the
same computation again and again
Pre-computation Index proteins in hash table
Query phase Voting scheme using hash table
Several variants on this theme
3d-Lookup Holm and Sander, 1995
3dSEARCH Singh 2002

35
Indexing Target Structures in Hash Table
(3dSEARCH Singh 2002)

Hash table 3-D regular grid of cubic bins (2Å)
For each target structure
For each pair of vectors (i,j)
Compute a coordinate system
Place an entry for each other vectork into the
bin containing the coordinates of the midpoint of
the vector (or average of coordinates of origin,
middle, and end points). Store ID of coordinate
system ks orientation and type (a or b) in the
entry.

36
v
u
Grid is same for all coordinate systems
37
v
v
u
u
Grid is same for all coordinate systems
38
Indexing Target Structures in Hash Table
(3dSEARCH Singh 2002)

Hash table 3-D regular grid of cubic bins (2Å)
For each target structure
For each pair of vectors (i,j)
Compute a coordinate system
Place an entry for each other vectork into the
bin containing the coordinates of the midpoint of
the vector (or average of coordinates of origin,
middle, and end points). Store ID of coordinate
system ks orientation and type (a or b) in
the entry.
Grid is sparsely occupied ? hash table
A structure with n SSEs contributes n(n-1)(n-2)
entries. Each vector is represented (n-1)(n-2)
times
10,000 structures with 10 SSEs each yield 7M
entries

39
Voting Using Hash Table

Given a query structure
For each pair of vectors (i,j)
Compute a coordinate system
For each other vector k
Retrieve the bin accessed by this vector and the
neighboring bins
For every entry (vector) in those bins that has
the same orientation and type as k, add a vote
for the coordinate system stored in the entry
Sort target structures based on max number of
votes received by any of its coordinate systems
? Small number of target structures. Use LOCK for
better alignment
Hours of pure LOCK are reduced to seconds

40
Advantages of Voting System

Very efficient in practice for many-to-many
comparisons
Can establish correspondence between partial,
disconnected substructures
Parallel implementation is straightforward
Independent of the order in which vectors are
considered
Drawback (?) May establish correspondences that
do not satisfy the backbone sequence constraint

41
Problem 4 Find Pharmacophore in Ligands

Given
Collection of N ( 5 to 10) small flexible
ligands with similar activity (binding at same
sites)

Benzamidine binding to beta-Trypsin (3ptb)
Inhibitor binding to HIV protease
42
(No Transcript)
43
Problem 4 Find Pharmacophore in Ligands

Given
Collection of N ( 5 to 10) small flexible
ligands with similar activity (binding at same
sites)
A set of low-energy conformations (dozens to few
hundreds) for each ligand

44
Problem 4 Find Pharmacophore in Ligands

Given
Collection of N ( 5 to 10) small flexible
ligands with similar activity (binding at same
sites)
A set of low-energy conformations (dozens to few
hundreds) for each ligand
Find a substructure (pharmacophore) that has a
match in at least one conformation of each ligand

45
(No Transcript)
46
(No Transcript)
47
(No Transcript)
48
(No Transcript)
49
Pharmacophore and Rational Drug Design

Pharmacophore identification is a form of
reverse engineering to get a model of a binding
site
A pharmacophore can be used to modify ligands
into more potent drugs and/or to screen large
databases of ligands for leads

50
Three Simultaneous Problems

Conformations?
Correspondence?
Transform?
But ligands are small molecules

51
Software

DISCO Martin et al., 1993
DISCOtech and GASP Tripos, Inc.
CATALYST and HIPHOP Accelrys et al. Green et
al., 1994 Barnum et al., 1996
RAPID P.W. Finn, L.E. Kavraki, J.C. Latombe, R.
Motwani, C. Shelton, S. Venkatasubramanian, and
A. Yao. RAPID Randomized Pharmacophore
Identification for Drug Design. Computational
Geometry Theory and Applications, 10, pp.
263-272, 1998

52
(No Transcript)
53
Pairwise Comparison

Multi-Probe(M1,,MN)
Extract invariants from M1 and M2 by calling
Pair-Probe(P1,P2) on every pair of conformations
of the two ligands
Test each candidate invariant S obtained at Step
1 against every ligand Mi, i 3,,N by calling
Pair-Probe(S,P) on S and each conformation P of Mi

54
Pair-Probe

n smallest number of atoms/features in a
liganda given constant (0 lt a 1) P1 and P2
Conformations of two distinct ligands (or
candidate invariant)
Pair-Probe(P1,P2)
Perform s times
Pick a triplet of atoms at random from P1
Determine three atoms in P2 congruent to this
triplet compute the alignment transform T
Iterate Apply T to P2 determine the atoms in P1
matching those in P2 update T
If the number of matching atoms exceed an, then
return this atom set as a candidate invariant S

55
Magnitude of s

Prpicking 3 atoms in invariant ? a3
Prfailing to find invariant ? (1 - a3)s
We want (1-a3)s ? g (g is acceptable
probability of failure)
s ? ln(g)/ln(1-a3)
Since x lt -ln(1-x) for 0 lt x lt 1, we get s ?
ln(1/g)/a3
For g 10-2 and a 0.3, we get s ? 180

56
Some Results

63 to 69 atoms with 10 to 15 torsional degrees
of freedom
Feature every non-H atom ? 30 features of 6
types(atom types)
Invariant in active conformations 7-atom
pharmacophore 7-atom scaffolding

conf t(s) 4 5 6 7 8 9
10 11 12 13 14
11 800 44 20 10 5 2 1 0 0 1 0 0
57
Fuel for Thoughts
58
Idea Many-to-many correspondence may be more
robust
Example Hausdorf distance
59
Hausdorf Distance

Two sets of points A a1,...,an and B
b1,...,bm in ?k
dH(A,B) maxa?A minb?B a-b
DH(A,B) max dH(A,B), dH(B,A)
Variation for shape similarity?H(A,B) minT
DH(A,T(B))
But efficient algorithms only exist for planar
sets of points

60
Other Idea Minimize cost of transforming A into
B

Old idea
Graphics Morphing distance
Computer vision Earth Movers distanceRubner,
Tomasi, and Guibas, 1998
Protein similarity
Isotopic distance Erdmann, 2004

61
Structure Alignment Isotopies

Two curves are isotopic if one can be deformed
into the other without self-collision
Example Polygonal curve with n vertices
One may think of structure alignment as an
isotopy deforming one structure into the other
Two structures are similar if the isotopy is
small

M.A. Erdmann. Protein Similarity from Knot
Theory GeometricConvolution and Line Weavings,
CMU Tech. Rep. CMU-CS-04-138.
62
Small Isotopy

Model a structure as a set of polygonal lines
(e.g., vertices are Ca atoms)
Two structures A and B are (T,d)-isotopic if
there exists an isotopy deforming A into T(B) in
such a way that no vertices of A moves further
away than some d from its initial or final
location

Erdmann 2004
63
Similarity Measure

dT(A,B) inf d A is (T,d)-isotopic to B
d(A,B) infT dT(A,B)
d is computable Erdmann,2004
But as complex as path planning, hence
exponential in the number of degrees of freedom
Possibility of approximating d using
probabilistic roadmaps?

64
Topology of Line Weavings
1xis 1nar
a helix axes
M.A. Erdmann. Protein Similarity from Knot
Theory GeometricConvolution and Line Weavings,
CMU Tech. Rep. CMU-CS-04-138.
65
(No Transcript)
66
? 2 topologically equivalent line weavings
3 equivalent classes for 4 lines
Erdmann 2004
67
(No Transcript)
68
Another (incorrect) alignment of 1xis and 1nar
69
? 2 non-equivalent line weavings
70
Why topology is interesting?

Two conformations may be geometrically close
(small RMSD) may require a long continuous
deformation to map one into the other (without
steric clashes)

71
Conclusion

Automatic computation of structure similarity is
essential due to the rapid growth of the PDB and
other molecule (e.g., ligand) libraries
As the growth of new protein structures outpaces
that of new folds, detecting structural
similarity will have to be much more fine-grained
than it is today
Biological discoveries will likely lie in local,
possibly rare structure similarities, rather than
in global fold-level classification
Need for better understanding of applications
and radically new approaches
Still a lot of work ...

Write a Comment

User Comments (0)

About PowerShow.com

Protein Structure Similarity - PowerPoint PPT Presentation

Protein Structure Similarity

Protein Structure Similarity Computation of Best Matches Two simultaneous subproblems Find maximal correspondence set C Find alignment transform T Chicken-and ... – PowerPoint PPT presentation