Protein Structure Similarity - PowerPoint PPT Presentation

About This Presentation

Title:

Protein Structure Similarity

Description:

... SCOP: http://scop.berkeley.edu/ CATH http://www.biochem.ucl.ac.uk/bsm/cath/ Protein alignment: DALI: http://www.ebi.ac.uk/dali/ LOCK: ... – PowerPoint PPT presentation

Number of Views:148

Avg rating:3.0/5.0

Slides: 60

Provided by: latombe

Learn more at: http://web.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Protein Structure Similarity

1
Protein Structure Similarity
2
Secondary Structure Elements a helices, b
strands/sheets, loops
3
Structure Prediction/Determination

Computational tools
Homology, threading
Molecular dynamics
Experimental tools

X-ray crystallography
4
Protein Structure Determination (1)

X-ray diffraction crystallography

5
Protein Structure Determination (2)

Nuclear magnetic resonance spectroscopy

6
Protein Data Bank
1990 ? 250 new structures 1999 ? 2500 new
structures 2000 ? gt20,000 structures total 2004 ?
30,000 structures total
7
Protein Data Bank
Only about 10 of structures have been
determined for known protein sequences ?
Protein Structure Initiative (PSI)
1990 ? 250 new structures 1999 ? 2500 new
structures 2000 ? gt20,000 structures total 2004 ?
30,000 structures total
8
Structure Similarity

Refers to how well (or poorly) 3D folded
structures of proteins can be aligned
Expected to reflect functional similarities
(interaction with other molecules)

Proteins in the TIM barrel fold family
9
Alignment of 1xis and 1nar (TIM-Barrels)
Sayle, R. RasMol. A protein visualization
tool. http//www.umass.edu/microbio/rasmol/index2.
htm.
ribbon format
1xis 1nar
backbone format
Alignment computed by DALI
a helix axes
10
Structure Similarity

Refers to how well (or poorly) 3D folded
structures of proteins can be aligned
Is expected to reflect functional similarities
(interaction with other molecules)
2000 20,000 structures in PDB
4,000 different folds (15 ratio)

11
(No Transcript)
12
(No Transcript)
13
Structure Similarity

Refers to how well (or poorly) 3D folded
structures of proteins can be aligned
Is expected to reflect functional similarities
(interaction with other molecules)
2000 20,000 structures in PDB
4,000 different folds (15 ratio)
Three possible reasons - evolution, - physical
constraints (e.g., few ways to maximize
hydrophobic interactions), - limits in
techniques used for structure determination
Given a new structure, the probability is high
that it is similar to an existing one

14
Why Comparing Protein Folded Structures?
Sequence
Structure
Function

Low sequence similarity may yield very similar
structures
Sometimes high sequence similarity yields
different structures

15
Alignment of 1xis and 1nar (TIM-Barrels)
1xis and 1nar have only 7 sequenceidentity, but
approximately 70 of the residues are
structurally similar
16
Why Comparing Protein Folded Structures?
Sequence
Structure
Function

Low sequence similarity may yield very similar
structures
Sometimes high sequence similarity yields
different structures
Structure comparison is expected to provide more
pertinent information about functional
(dis-)similarity among proteins, especially with
non-evolutionary relationships or non-detectable
evolutionary relationships

17
Ill-Posed Problem? Multiple Terminology

(Dis-)similarity analysis
Structure comparison
Alignment, superposition, matching
Classification
Applications
Definitions and issues
Methods

18
A Few Web Sites

Protein Data Bank (PDB)http//www.rcsb.org/pdb/
Protein classification
SCOPhttp//scop.berkeley.edu/
CATHhttp//www.biochem.ucl.ac.uk/bsm/cath/
Protein alignment
DALIhttp//www.ebi.ac.uk/dali/
LOCKhttp//motif.stanford.edu/lock2/

19
Application 1 Find Global Similarities Among
Protein Structures

Given two protein structures, find the largest
similar substructures
For example, a substructure is a subset of Ca
atoms or a subset of secondary structure elements
in each molecule
Several possible similarity measures
Variants 1-to-1, 1-to-many, many-to-many (PDB)
Must be automatic (and fast)

20
Application 2 Classify Proteins

Many proteins, but relatively few distinct fold
families Chotia, 1992 Holm and Sander, 1996
Brenner et al. 1997
Hierarchical classification
Insight into functions and structure
stabilization
Basis for homology and threading
Manual classification ? SCOP Murzin et al.,
1995

21
Application 2 Classify Proteins
Class Similar secondary structure content

Many proteins, but relatively few distinct fold
families Chotia, 1992 Holm and Sander, 1996
Brenner et al. 1997
Hierarchical classification
Insight into functions and structure
stabilization
Basis for homology and threading
Manual classification ? SCOP Murzin et al.,
1995
Increasing size of PDB ? Automatic classifiers
CATH Orengo et al., 1997 Pclass Singh et
al. FSSP Holm and Sander

Fold SSEs in similar arrangement
Family Clear evolutionary relationship
22
Manuel vs. Automatic Classification
23
Application 3 Find Motif in Protein Structure

Given a protein structure and a motif (e.g., a
small collection of atoms corresponding to a
binding site)
Find whether the motif matches a substructure of
the protein
Variant One motif against many proteins

Active sites of 1PIP and 5PAD. Only 3 amino-acids
participate in the motif
24
Application 4 Find Pharmacophore

Given
Small collection (5-10) of small flexible ligands
with similar activity (hence, assumed to bind at
same protein site)
Low-energy conformations (several dozens to few
100s) for each ligand
Find substructure (pharmacophore) that occurs in
at least one conformation of each ligand
Key problem in drug design when binding site is
unknown

25
Application 4 Find Pharmacophore
Inhibitors of thermolysin
26
Application 5 Search for Ligands Containing a
Pharmacophore

Given
Database containing several 100,000, or more,
small ligands
A pharmacophore P
Find all ligands that have a low-energy
conformation containing P
Data mining of pharmaceutical databases (lead
generation)

S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C.
Latombe. A Randomized Kinematics-Based Approach
to Pharmacophore-Constrained Conformational
Search and Database Screening. J. of
Computational Chemistry, 21(9)731-747, July 2000
27

Applications
Definitions and issues
Methods

28
3D Molecular Structure

Collection of (possibly typed) atoms or groups of
atoms in some given 3D relative placement
The placement of a group of atoms is defined by
the position of a reference point (e.g., the
center of an atom) and the orientation of a
reference direction
The type can be the atom ID, the amino-acid ID,
etc

29
Matching of Structures

Two structures A and B match iff
Correspondence There is a one-to-one map
between their elements
AlignmentThere exists a rigid-body transform T
such that the RMSD between the elements in A and
those in T(B) is less than some threshold e.

30
Complete Match
31
Alignment of 3adk and 1gky

Both matching and non-matching secondary
structure elements

32
Partial Match

Notion of support s of the match the match is
between s(A) and s(B)
? Dual problem - What is the support?
- What is the transform?
Often several (many) possible supports
Small supports ? motifs

33
Mathematical Relative
g
f
s
f - g2
Over which support?
34
Mathematical Relative
g
f
s
f - g2
Over which support?
35
Multiple Partial Matches
36
Distributed Support
37
What is Best?
Should gaps be penalized?
38
What About This?
Sequence along backbone is not preserved
39
Similarity measure is unlikely to satisfy
triangular inequality for partial match
40
Scoring Issues

Trade-off between size of s and RMSD
How should gaps be counted?
Is there a quality of the correspondence?
The correspondence may, or may not, satisfy
type and/or backbone sequence preferences
Should accessible surface be given more
importance?
? Similarity measure may be different from the
inverse of RSMD (though no consensus on best
measure!)
But RMSD is computationally very convenient!

41
Examples
RMSD dissimilarity measure ? emphasizes
differences ? smaller support
STRUCTALs similarity measure? emphasizes
similarities ? larger support
42
Comparison of Similarity Measures

A.C.M. May. Toward more meaningful hierarchical
classification of amino acids scoring functions.
Protein Engineering, 12707-712, 1999reviews 37
protein structure similarity measures
The difficulty of defining a similarity score is
probably due to the facts that structure
comparison is an ill-posed problem and has
multiple solutions

43
Bottom Line

Finding an optimal partial match is NP-hard
No fast algorithm is guaranteed to give an
optimal answer for any given measure Godzik,
1996
? Heuristic/approximate algorithms
? Probably not a single solution, but
application- dependent solutions
? But there exist general algorithmic principles

44
Computational Questions

Given a (dis)similarity measure and two
proteins, compute the best match
Which support?
Which correspondence?
Which alignment transform?

Applications
Definitions and issues
Methods

46
Find Global Similarities Among Protein Structures

Input Two sets of features (atoms or groups of
atoms) a1,,an and b1,,bm belonging to two
different proteins A and B
Output - Maximal correspondence set C of pairs
(ai,bj), where all ai and all bj are distinct-
Alignment transform T such that the RMSD of the
pairs (ai,T(bj)) is less than a given e
Several possible outputs

Variant of the Largest Common Point Set
problemAkutsu and Halldorsson, 1994
47
Possible Correspondence Constraints

Typed features(ai,bj) is a possible
correspondence pair iff Type(ai) Type(bj)
Ordered features(ai,bj) and (ai,bj), where
igti, are possible correspondence pairs iff
jgtjE.g., sequence along backbone

48
Some Existing Software

Ca atoms
DALI Holm and Sander, 1993
STRUCTAL Gerstein and Levitt, 1996
MINAREA Falicov and Cohen, 1996
CE Shindyalov and Bourne, 1998
ProtDex Aung,Fu and Tan, 2003
Secondary structure elements and Ca atoms
VAST Gibrat et al., 1996
LOCK Singh and Brutlag, 1996
3dSEARCH Singh and Brutlag, 1999

49
RMSD ? Similarity
But matches and RMSDs are not exactly what we
need In general, we need to computea similarity
measure of the form maxT S(A,T(B))
where S is more complex than RMSD Two-step
approach 1. Compute best matches using
RMSD 2. Adjust transform to maximize
similarity measure
50
Computation of Best Matches

Two simultaneous subproblems
Find maximal correspondence set C
Find alignment transform T
Chicken-and-egg issue
Each subproblem is relatively simple
If we knew C, we could compute T
If we knew T, we could get C by proximity
But the combination is hard !!!

51
Computation of Best Matches

Two simultaneous subproblems
Find maximal correspondence set C
Find alignment transform T
Chicken-and-egg issue
Each subproblem is relatively simple
If we knew C, we could compute T
If we knew T, we could get C by proximity
But the combination is hard !!!

52
Find Alignment Transform

Two sets of points A a1,,an and B
b1,,bn
Correspondence pairs (ai, bi)
Find T arg minT RMSD(A,T(B)) ?
O(n) closed-form solution Arun, Huang, and
Blostein, 87 Horn, 87 Horn, Hilden, and
Negahdaripour, 88

53
O(n) SVD-Based Algorithm

T combines translation t and rotation R, such
that T(bi) t R(bi)
b (Si1,...,nbi)/n mean of the bis
Place the origin of coordinate system at b
minT RMSD(A,T(B)) simplifies to (up to some
constants)
t and R can be computed separately
t a mean of the ais

Arun, Huang, and Blostein, 87
54
O(n) SVD-Based Algorithm

A3?n a1-a, ..., an-a B3?n b1-b, ...,
bn-b
Compute SVD decomposition of 33 correlation
matrix BAT BAT UDVT
where D is a diagonal matrices with decreasing
non-negative entries (singular values) along the
diagonal
If det(U)det(V) 1 then S I,
else S diag(1,1,-1)
R USVT

Arun, Huang, and Blostein, 87
55

Arun, Huang, and Blostein, 87
? rotation matrix
Horn, 87 ? quaternion

56
? Trial-and-Error Approach to Protein Structure
Comparison
57
? Trial-and-Error Approach to Protein Structure
Comparison

Set CS to a seed correspondence set (small set
sufficient to generate an alignment transform)
Compute the alignment transform T for CS and
apply T to the second protein B
Update CS to include all pairs of features that
are close apart
If CS has changed, then return to Step 2 else
return (CS,T)

58
? Trial-and-Error Approach to Protein Structure
Comparison

- result nil
- Iterate N times
Set CS to a seed correspondence set (small set
sufficient to generate an alignment transform)
Compute the alignment transform T for CS and
apply T to the second protein B
Update CS to include all pairs of features that
are close apart
If CS has changed, then return to Step 2 else
result ? result ? (CS,T)
- Return result