Title: Protein Structure Comparison
1Protein Structure Comparison
2Protein Structure Alignment
Human Hemoglobin alpha-chain pdb1jebA
Human Myoglobin pdb2mm1
Another example G-Proteins 1c1yA,
1kk1A6-200 Sequence id 18 Structural id 72
3 Protein Structure ComparisonMotivation
- Understanding of the folding process.
- Protein classification
- Finding binding sites of the protein
- Identifying structurally conserved regions in the
protein
4Comparison of proteins by sequence or shape?
- Protein sequence comparison is simpler
- 3D structures are available in a few percent of
all proteins - Shape similarity is detectable even though the
sequences may have changed in the course of
evolution
5Different Instances of Structure Comparison
- All-to-All comparison
- Classify all known structures
- Search for a structural motif
- Study interaction between structures and other
molecules (Protein Docking) - Use known structures to predict structure from
sequence (Protein Threading)
6Sequence order dependence
- Sequence order dependent alignment
- an 1-D task.
- Sequence order independent alignment
- a real 3-D task.
7Why Sequence order dependence
- Substructures preserving sequence order might be
biologically more meaningful - With the sequence order constraint the
computational task is simpler
8The problem
- Given a pair of protein structures, find the
correspondences between the Ca atoms of the
backbone that best align the two structures - Tradeoffs between the number of corresponding
atoms and the lowest distance
9Finding Correspondences
- Point-based approaches
- Geometric Hashing, Indexing (Wolfson et al.,
1998) - Comparison based on 2D or 3D distance matrices
(Holm, Sander, 96) - Dynamic Programming (Gerstein, Levitt, 98)
- Combinatorial Extension (Bourne et al, 96)
10Algorithms for Structure Alignment
- Distance based methods
- DALI(Holm and Sander) Aligning scalar distance
plots - STRUCTAL(Gerstein and Levitt) Dynamic
programming using pairwise inter-molecular
distances - SSAP(Orengo and Taylor) Dynamic programming
using intra-molecular vector distance - Vector based methods
- VAST (Bryant) Graph theory based secondary
structure alignment - 3dSearch (Singh and Brutlag) Fast secondary
structure index lookup - Both vector and distance based
- LOCK (Singh and Brutlag) Hierarchically uses
both secondary structures vectors and atomic
distances
11Hashing function
- From an Object
- To invariant Features
- To t-ple of numbers
- To indeces
- Use the t indeces to access a t-dimensional hash
table -
12Indexing Methodsfor Fast retrieval of 3D
patterns
- Select a set of target proteins
- Create and store a hash table indexed by
invariant geometric properties of the selected
folds - Update the databases as new structures are found
- Use the table to identify the nearest fold for a
target protein.
13Reference Frame
- A 3-D reference frame (r. f.) can be defined by
three non collinear points - Invariant
- the coordinates of any other point in the r.f.
14Secondary Structures Representation
- Secondary structures are represented as linear
vectors (segments) - the axis for the alpha helix and the best fit
segment for a beta strand - A SVD-based alignment algorithm is used to match
an a helix segments with known axes to determine
helix axis. Direct segment fits were made to fit
b-sheet strands.
15Visualization
- Each segment associated to a secondary structure
is displayed as a cylinder
16Secondary Structure-based Approaches
- Geometric Hashing, Indexing (Wolfson et al, Holm
et al, Guerra et al ) - Graph-based (Grindley et al)
- Dynamic Programming (Singh, Brutlag)
-
17Indexing techniques based on Secondary
Structures(Guerra et al)
- Consider all the triplets of secondary structures
and their associated segments - Construct a 3D table indexed by the angles
relating three secondary structures.
18 Table Construction
- For each triplet a1 , a2 , a3 of secondary
structures of protein P - compute the angles between
- (a1 , a2 ), (a1 , a3 ), and (a2 , a3 ),
- and use them as indexes to an entry in the a-a-a
Table where (P, (a1 , a2 , a 3)) is stored. - Each cell of the table at the end contains
information about all triplets that hashed into
it (including distances between secondary
structures)
19Table construction
- Time Complexity O(s3n)
- s is the of secondary structures in a
protein - n the of proteins.
20Searching the table
- For a query protein, compute the same invariants
used for the target proteins. - For each invariant and corresponding indeces,
access the corresponding cell in the table where
a vote is cast - List the target proteins according to their votes
21Distribution of table entries(D. Platt, C.
Guerra, I. Rigoutsos, G. Zanotti, 2003)
- There is a strong preference for triplets to fall
into cells with indexes a,b, g satisfying - a b g
- corresponding to segments lying on parallel planes
22 Analysis of Distribution of globally selected
secondary structures
- Distributions show much stronger preference for
alignment than expected for randomly uniform
vectors. - There is a greater preference for alignment
between any two secondary structure elements if a
third structure element aligns with either of the
first two -- the alignment angles are not
independent variates.
23(No Transcript)
24(No Transcript)
25(No Transcript)
26Geometric Indexing
PDB contains 27000 proteins
Hash Table
Query protein
Proteins Superposition
Search for Similarity
Hypotheses of Similarity
List of Similar Proteins
Pairwise atomic Superimposition with a selected
protein
Alignment by Dynamic Programming
27Refinement of the matching procedure
- Alignment
- Find a collection of corresponding pairs of
secondary structures (SS) which maximizes a
given similarity measure - Dynamic programming
-
28D(i,j) number of times SS i of protein A is
associated to SS j of protein B in a triplet of
equivalent SS
29Problem
- Find the increasing path in the D(i,j) matrix
that maximizes the total similarity measure
30Dynamic Programming
- Let M be a 2D matrix such that M(i,j) is the
similarity measure between s1 s2 .... si and
t1t2 .... tj - Compute
- M(i,j) max M(i-1,j) d(ti , f),
- M(i-1,j-1) d(si tj),
M(i,j-1) d(f, s j ) - The solution
- D(A,B)M(n,m)
- Quadratic time complexity
31Integration of matching strategies
- Using different protein representations at
- atomic level
- secondary structure level
- sequence level
32Superposition
- Find a rigid transformation which optimally
superimposes the atoms of two proteins - Horn method
33Accuracy vs coverage
- Accuracy how many of the solutions found were
correct? - A F intersection T /F
- Coverage How many of the correct solutions were
found? - C F intersection T /T
-
T correct sol.
Fsolutions found
34Evaluating PROuST
- Using as standard of truth SCOP
- Against other existing servers
35Accuracy of results for 1tim chain A
36Another Algorithm DALI
- Based on aligning 2-D intra-molecular distance
matrices - Computes the best subset of corresponding
residues from the two proteins such that
similarity between the 2-D distance matrices is
maximized. - Searches through all possible alignments of
residues using Monte-Carlo algorithms
37DALI
38Distance matrix (2)
- Advantages
- - invariant with respect to rotation and
translation - - can be used to compare proteins
- Disadvantages
- - the distance matrix is O(n2) for a protein
with n residues - - comparing distance matrix is a hard problem
- - insensitive to chirality
39Distance matrix
5.9
2
4
8.1
3
6.0
1
40DALI
- DALI has been used to do an ALL vs. ALL
comparison of proteins in the PDB, and to create
a hierarchical clustering of families. - FSSPFold classification based on
Structure-Structure alignment of Proteins - http//www.ebi.ac.uk/dali/fssp/fssp.html
41VAST-Vector Alignment Search Tool
- Aligns only secondary structure elements (SSE)
- Represents each SSE as a vector
- Finds all possible pairs of vectors from the two
structures that are similar - Uses a graph theory algorithms to find maximal
subset of similar vectors - Overall alignment scores is based on the number
of similar pairs of vectors between the two
structures.
42VAST
- VAST has been used to do an ALL vs. ALL
comparison of proteins in the MMDB (NCBIs
structure database), and to find structure
neighbors for each structure. - MMDB provides service of searching structure
neighbors using VAST. - http//www.ncbi.nlm.nih.gov/Structure/VAST/vast.sh
tml
43LOCK
- Define local secondary structures
- Find an initial superposition by using DP to
align secondary structure vectors. - Use greedy algorithms to find nearest neighbors
and minimize RMSD between the C-? atoms from
query and target. - Find the core of aligned C-? atoms and minimize
RMSD between them.
44Comparison of methods
45Execution times for comparing a query structure
to 27,000 target structures
46Execution times for comparing a query structure
to 685 target structures
47Data
- 30,000 proteins extracted from PDB
- Approx. 27,000 proteins inserted in the geometric
database - 0 to 528 segments per protein
- 13.5 segments on average
- 48,000,000 triplets
- 4x20x20x20 table
48GEOMETRIC PATTERN MATCHING UNDER RIGID MOTION(C.
Guerra, V. Pascucci, 1999)
- Problem 1. Find a transformation T, if it exists
that brings A to within a given distance, say e,
of B, i.e. H(T(A),B) - Problem 2. Find the minimum Hausdorff distance
under a rigid motion - D(A, B) min t (t(A), B)
- where t is a rigid motion
49Hausdorff Distance
- Let A a1, a2, ..., am B b1, b 2, ..., bn
be sets of either points or segments. - Definition. (Hausdorff Distance)
- H(A, B) max (h(A, B), h(B, A))
- where the one-way Hausdorff distance is
- h(A, B) maxa minb r (a, b)
- where a (b) is a point of A (B) and r (a, b), is
a metric.
50Segment Hausdorff distance
- HS(A, B) max (hS(A, B), hS(B, A))
- where
- hS(A, B) max ai (min bj H(ai,bj))
51Oriented Segment Hausdorff distance
- HOS(A, B) max (hOS(A, B), hOS(B, A)),
- where
- hOS(A, B) maxai
- (minbj (max( d(ais,bjs),d(aie,bje)) ))
- ais , aie are the endpoints of ai
52Exact solution in 2D
- This problem is generally solved as a problem of
intersection of unions of disks in the
transformation space. - Time complexity O( m3 n3 log2nm) in R2
53The Matching algorithm
- Find a rigid body transformation (translation
plus rotation) that minimizes the Hausdorff
distance between the segments of A and B. - Derive
- T A?B
- based on three representative segments of A and
all - triplets of segments of B, and choose the best
T.
54Practical Approach
- 1. Select three representative'' segments a,
a' , a' of A as follows - 1.a Choose randomly one representative a for A.
- 1.b Select a' to be the segment containing the
point a'f farthest from the midpoint ac of a .
55Practical approach (contd)
- 1.3. Select a'' as the segment that contains the
point at maximum distance from the line ac ,a'f.
- 2. For each triplet b, b', b'' of elements of B
determine the rotation and translation that maps
a, a' , a'' into b, b' , b''. - 3. Choose the best transformation among the
examined ones.
56Segment Nearest-Neighbor
- The nearest-neighbor among n segments in Rd is
equivalent to a query among 2n points in R2d. - HSS(a, b) min(max(d(as,bs),d(ae,be)), max
(d(as,be), d(a e,be)). - Approximate nearest neighbor of a point q in Rd
(within a factor of (1e )) (Arya et al. ) - Time complexity O(logn) with O(nlogn)
preprocessing.
57Complexity Analysis
- Time complexity O(mn3log n)
- Approximation error factor 8
58Protein 1rpa
59(No Transcript)
60(No Transcript)
61(No Transcript)
62Questions?