Protein Structure Similarity - PowerPoint PPT Presentation

About This Presentation
Title:

Protein Structure Similarity

Description:

... SCOP: http://scop.berkeley.edu/ CATH http://www.biochem.ucl.ac.uk/bsm/cath/ Protein alignment: DALI: http://www.ebi.ac.uk/dali/ LOCK: ... – PowerPoint PPT presentation

Number of Views:146
Avg rating:3.0/5.0
Slides: 60
Provided by: latombe
Learn more at: http://web.stanford.edu
Category:

less

Transcript and Presenter's Notes

Title: Protein Structure Similarity


1
Protein Structure Similarity
2
Secondary Structure Elements a helices, b
strands/sheets, loops
3
Structure Prediction/Determination
  • Computational tools
  • Homology, threading
  • Molecular dynamics
  • Experimental tools

X-ray crystallography
4
Protein Structure Determination (1)
  • X-ray diffraction crystallography

5
Protein Structure Determination (2)
  • Nuclear magnetic resonance spectroscopy

6
Protein Data Bank
1990 ? 250 new structures 1999 ? 2500 new
structures 2000 ? gt20,000 structures total 2004 ?
30,000 structures total
7
Protein Data Bank
Only about 10 of structures have been
determined for known protein sequences ?
Protein Structure Initiative (PSI)
1990 ? 250 new structures 1999 ? 2500 new
structures 2000 ? gt20,000 structures total 2004 ?
30,000 structures total
8
Structure Similarity
  • Refers to how well (or poorly) 3D folded
    structures of proteins can be aligned
  • Expected to reflect functional similarities
    (interaction with other molecules)

Proteins in the TIM barrel fold family
9
Alignment of 1xis and 1nar (TIM-Barrels)
Sayle, R. RasMol. A protein visualization
tool. http//www.umass.edu/microbio/rasmol/index2.
htm.
ribbon format
1xis 1nar
backbone format
Alignment computed by DALI
a helix axes
10
Structure Similarity
  • Refers to how well (or poorly) 3D folded
    structures of proteins can be aligned
  • Is expected to reflect functional similarities
    (interaction with other molecules)
  • 2000 20,000 structures in PDB
    4,000 different folds (15 ratio)

11
(No Transcript)
12
(No Transcript)
13
Structure Similarity
  • Refers to how well (or poorly) 3D folded
    structures of proteins can be aligned
  • Is expected to reflect functional similarities
    (interaction with other molecules)
  • 2000 20,000 structures in PDB
    4,000 different folds (15 ratio)
  • Three possible reasons - evolution, - physical
    constraints (e.g., few ways to maximize
    hydrophobic interactions), - limits in
    techniques used for structure determination
  • Given a new structure, the probability is high
    that it is similar to an existing one

14
Why Comparing Protein Folded Structures?
Sequence
Structure
Function
  • Low sequence similarity may yield very similar
    structures
  • Sometimes high sequence similarity yields
    different structures

15
Alignment of 1xis and 1nar (TIM-Barrels)
1xis and 1nar have only 7 sequenceidentity, but
approximately 70 of the residues are
structurally similar
16
Why Comparing Protein Folded Structures?
Sequence
Structure
Function
  • Low sequence similarity may yield very similar
    structures
  • Sometimes high sequence similarity yields
    different structures
  • Structure comparison is expected to provide more
    pertinent information about functional
    (dis-)similarity among proteins, especially with
    non-evolutionary relationships or non-detectable
    evolutionary relationships

17
Ill-Posed Problem? Multiple Terminology
  • (Dis-)similarity analysis
  • Structure comparison
  • Alignment, superposition, matching
  • Classification
  • Applications
  • Definitions and issues
  • Methods

18
A Few Web Sites
  • Protein Data Bank (PDB)http//www.rcsb.org/pdb/
  • Protein classification
  • SCOPhttp//scop.berkeley.edu/
  • CATHhttp//www.biochem.ucl.ac.uk/bsm/cath/
  • Protein alignment
  • DALIhttp//www.ebi.ac.uk/dali/
  • LOCKhttp//motif.stanford.edu/lock2/

19
Application 1 Find Global Similarities Among
Protein Structures
  • Given two protein structures, find the largest
    similar substructures
  • For example, a substructure is a subset of Ca
    atoms or a subset of secondary structure elements
    in each molecule
  • Several possible similarity measures
  • Variants 1-to-1, 1-to-many, many-to-many (PDB)
  • Must be automatic (and fast)

20
Application 2 Classify Proteins
  • Many proteins, but relatively few distinct fold
    families Chotia, 1992 Holm and Sander, 1996
    Brenner et al. 1997
  • Hierarchical classification
  • Insight into functions and structure
    stabilization
  • Basis for homology and threading
  • Manual classification ? SCOP Murzin et al.,
    1995

21
Application 2 Classify Proteins
Class Similar secondary structure content
  • Many proteins, but relatively few distinct fold
    families Chotia, 1992 Holm and Sander, 1996
    Brenner et al. 1997
  • Hierarchical classification
  • Insight into functions and structure
    stabilization
  • Basis for homology and threading
  • Manual classification ? SCOP Murzin et al.,
    1995
  • Increasing size of PDB ? Automatic classifiers
    CATH Orengo et al., 1997 Pclass Singh et
    al. FSSP Holm and Sander

Fold SSEs in similar arrangement
Family Clear evolutionary relationship
22
Manuel vs. Automatic Classification
23
Application 3 Find Motif in Protein Structure
  • Given a protein structure and a motif (e.g., a
    small collection of atoms corresponding to a
    binding site)
  • Find whether the motif matches a substructure of
    the protein
  • Variant One motif against many proteins

Active sites of 1PIP and 5PAD. Only 3 amino-acids
participate in the motif
24
Application 4 Find Pharmacophore
  • Given
  • Small collection (5-10) of small flexible ligands
    with similar activity (hence, assumed to bind at
    same protein site)
  • Low-energy conformations (several dozens to few
    100s) for each ligand
  • Find substructure (pharmacophore) that occurs in
    at least one conformation of each ligand
  • Key problem in drug design when binding site is
    unknown

25
Application 4 Find Pharmacophore
Inhibitors of thermolysin
26
Application 5 Search for Ligands Containing a
Pharmacophore
  • Given
  • Database containing several 100,000, or more,
    small ligands
  • A pharmacophore P
  • Find all ligands that have a low-energy
    conformation containing P
  • Data mining of pharmaceutical databases (lead
    generation)

S.M. LaValle, P.W. Finn, L.E. Kavraki, and J.C.
Latombe. A Randomized Kinematics-Based Approach
to Pharmacophore-Constrained Conformational
Search and Database Screening. J. of
Computational Chemistry, 21(9)731-747, July 2000
27
  • Applications
  • Definitions and issues
  • Methods

28
3D Molecular Structure
  • Collection of (possibly typed) atoms or groups of
    atoms in some given 3D relative placement
  • The placement of a group of atoms is defined by
    the position of a reference point (e.g., the
    center of an atom) and the orientation of a
    reference direction
  • The type can be the atom ID, the amino-acid ID,
    etc

29
Matching of Structures
  • Two structures A and B match iff
  • Correspondence There is a one-to-one map
    between their elements
  • AlignmentThere exists a rigid-body transform T
    such that the RMSD between the elements in A and
    those in T(B) is less than some threshold e.

30
Complete Match
31
Alignment of 3adk and 1gky
  • Both matching and non-matching secondary
    structure elements

32
Partial Match
  • Notion of support s of the match the match is
    between s(A) and s(B)
  • ? Dual problem - What is the support?
    - What is the transform?
  • Often several (many) possible supports
  • Small supports ? motifs

33
Mathematical Relative
g
f
s
f - g2
Over which support?
34
Mathematical Relative
g
f
s
f - g2
Over which support?
35
Multiple Partial Matches
36
Distributed Support
37
What is Best?
Should gaps be penalized?
38
What About This?
Sequence along backbone is not preserved
39
Similarity measure is unlikely to satisfy
triangular inequality for partial match
40
Scoring Issues
  • Trade-off between size of s and RMSD
  • How should gaps be counted?
  • Is there a quality of the correspondence?
  • The correspondence may, or may not, satisfy
    type and/or backbone sequence preferences
  • Should accessible surface be given more
    importance?
  • ? Similarity measure may be different from the
    inverse of RSMD (though no consensus on best
    measure!)
  • But RMSD is computationally very convenient!

41
Examples
RMSD dissimilarity measure ? emphasizes
differences ? smaller support
STRUCTALs similarity measure? emphasizes
similarities ? larger support
42
Comparison of Similarity Measures
  • A.C.M. May. Toward more meaningful hierarchical
    classification of amino acids scoring functions.
    Protein Engineering, 12707-712, 1999reviews 37
    protein structure similarity measures
  • The difficulty of defining a similarity score is
    probably due to the facts that structure
    comparison is an ill-posed problem and has
    multiple solutions

43
Bottom Line
  • Finding an optimal partial match is NP-hard
  • No fast algorithm is guaranteed to give an
    optimal answer for any given measure Godzik,
    1996
  • ? Heuristic/approximate algorithms
  • ? Probably not a single solution, but
    application- dependent solutions
  • ? But there exist general algorithmic principles

44
Computational Questions
  • Given a (dis)similarity measure and two
    proteins, compute the best match
  • Which support?
  • Which correspondence?
  • Which alignment transform?

45
  • Applications
  • Definitions and issues
  • Methods

46
Find Global Similarities Among Protein Structures
  • Input Two sets of features (atoms or groups of
    atoms) a1,,an and b1,,bm belonging to two
    different proteins A and B
  • Output - Maximal correspondence set C of pairs
    (ai,bj), where all ai and all bj are distinct-
    Alignment transform T such that the RMSD of the
    pairs (ai,T(bj)) is less than a given e
  • Several possible outputs

Variant of the Largest Common Point Set
problemAkutsu and Halldorsson, 1994
47
Possible Correspondence Constraints
  • Typed features(ai,bj) is a possible
    correspondence pair iff Type(ai) Type(bj)
  • Ordered features(ai,bj) and (ai,bj), where
    igti, are possible correspondence pairs iff
    jgtjE.g., sequence along backbone

48
Some Existing Software
  • Ca atoms
  • DALI Holm and Sander, 1993
  • STRUCTAL Gerstein and Levitt, 1996
  • MINAREA Falicov and Cohen, 1996
  • CE Shindyalov and Bourne, 1998
  • ProtDex Aung,Fu and Tan, 2003
  • Secondary structure elements and Ca atoms
  • VAST Gibrat et al., 1996
  • LOCK Singh and Brutlag, 1996
  • 3dSEARCH Singh and Brutlag, 1999

49
RMSD ? Similarity
But matches and RMSDs are not exactly what we
need In general, we need to computea similarity
measure of the form maxT S(A,T(B))
where S is more complex than RMSD Two-step
approach 1. Compute best matches using
RMSD 2. Adjust transform to maximize
similarity measure
50
Computation of Best Matches
  • Two simultaneous subproblems
  • Find maximal correspondence set C
  • Find alignment transform T
  • Chicken-and-egg issue
  • Each subproblem is relatively simple
  • If we knew C, we could compute T
  • If we knew T, we could get C by proximity
  • But the combination is hard !!!

51
Computation of Best Matches
  • Two simultaneous subproblems
  • Find maximal correspondence set C
  • Find alignment transform T
  • Chicken-and-egg issue
  • Each subproblem is relatively simple
  • If we knew C, we could compute T
  • If we knew T, we could get C by proximity
  • But the combination is hard !!!

52
Find Alignment Transform
  • Two sets of points A a1,,an and B
    b1,,bn
  • Correspondence pairs (ai, bi)
  • Find T arg minT RMSD(A,T(B)) ?
  • O(n) closed-form solution Arun, Huang, and
    Blostein, 87 Horn, 87 Horn, Hilden, and
    Negahdaripour, 88

53
O(n) SVD-Based Algorithm
  • T combines translation t and rotation R, such
    that T(bi) t R(bi)
  • b (Si1,...,nbi)/n mean of the bis
  • Place the origin of coordinate system at b
  • minT RMSD(A,T(B)) simplifies to (up to some
    constants)
  • t and R can be computed separately
  • t a mean of the ais

Arun, Huang, and Blostein, 87
54
O(n) SVD-Based Algorithm
  • A3?n a1-a, ..., an-a B3?n b1-b, ...,
    bn-b
  • Compute SVD decomposition of 33 correlation
    matrix BAT BAT UDVT
    where D is a diagonal matrices with decreasing
    non-negative entries (singular values) along the
    diagonal
  • If det(U)det(V) 1 then S I,
    else S diag(1,1,-1)
  • R USVT

Arun, Huang, and Blostein, 87
55
  • Arun, Huang, and Blostein, 87
  • ? rotation matrix
  • Horn, 87 ? quaternion

56
? Trial-and-Error Approach to Protein Structure
Comparison
57
? Trial-and-Error Approach to Protein Structure
Comparison
  • Set CS to a seed correspondence set (small set
    sufficient to generate an alignment transform)
  • Compute the alignment transform T for CS and
    apply T to the second protein B
  • Update CS to include all pairs of features that
    are close apart
  • If CS has changed, then return to Step 2 else
    return (CS,T)

58
? Trial-and-Error Approach to Protein Structure
Comparison
  • - result nil
  • - Iterate N times
  • Set CS to a seed correspondence set (small set
    sufficient to generate an alignment transform)
  • Compute the alignment transform T for CS and
    apply T to the second protein B
  • Update CS to include all pairs of features that
    are close apart
  • If CS has changed, then return to Step 2 else
    result ? result ? (CS,T)
  • - Return result

59
  • How to get seed correspondences?
Write a Comment
User Comments (0)
About PowerShow.com