Protein Structure Comparison - PowerPoint PPT Presentation

1 / 62
About This Presentation
Title:

Protein Structure Comparison

Description:

Comparison based on 2D or 3D distance matrices (Holm, Sander, 96) ... Geometric Hashing, Indexing (Wolfson et al, Holm et al, Guerra et al ) ... – PowerPoint PPT presentation

Number of Views:432
Avg rating:3.0/5.0
Slides: 63
Provided by: Gue105
Category:

less

Transcript and Presenter's Notes

Title: Protein Structure Comparison


1
Protein Structure Comparison
2
Protein Structure Alignment
Human Hemoglobin alpha-chain pdb1jebA
Human Myoglobin pdb2mm1
Another example G-Proteins 1c1yA,
1kk1A6-200 Sequence id 18 Structural id 72
3
Protein Structure ComparisonMotivation
  • Understanding of the folding process.
  • Protein classification
  • Finding binding sites of the protein
  • Identifying structurally conserved regions in the
    protein

4
Comparison of proteins by sequence or shape?
  • Protein sequence comparison is simpler
  • 3D structures are available in a few percent of
    all proteins
  • Shape similarity is detectable even though the
    sequences may have changed in the course of
    evolution

5
Different Instances of Structure Comparison
  • All-to-All comparison
  • Classify all known structures
  • Search for a structural motif
  • Study interaction between structures and other
    molecules (Protein Docking)
  • Use known structures to predict structure from
    sequence (Protein Threading)

6
Sequence order dependence
  • Sequence order dependent alignment
  • an 1-D task.
  • Sequence order independent alignment
  • a real 3-D task.

7
Why Sequence order dependence
  • Substructures preserving sequence order might be
    biologically more meaningful
  • With the sequence order constraint the
    computational task is simpler

8
The problem
  • Given a pair of protein structures, find the
    correspondences between the Ca atoms of the
    backbone that best align the two structures
  • Tradeoffs between the number of corresponding
    atoms and the lowest distance

9
Finding Correspondences
  • Point-based approaches
  • Geometric Hashing, Indexing (Wolfson et al.,
    1998)
  • Comparison based on 2D or 3D distance matrices
    (Holm, Sander, 96)
  • Dynamic Programming (Gerstein, Levitt, 98)
  • Combinatorial Extension (Bourne et al, 96)

10
Algorithms for Structure Alignment
  • Distance based methods
  • DALI(Holm and Sander) Aligning scalar distance
    plots
  • STRUCTAL(Gerstein and Levitt) Dynamic
    programming using pairwise inter-molecular
    distances
  • SSAP(Orengo and Taylor) Dynamic programming
    using intra-molecular vector distance
  • Vector based methods
  • VAST (Bryant) Graph theory based secondary
    structure alignment
  • 3dSearch (Singh and Brutlag) Fast secondary
    structure index lookup
  • Both vector and distance based
  • LOCK (Singh and Brutlag) Hierarchically uses
    both secondary structures vectors and atomic
    distances

11
Hashing function
  • From an Object
  • To invariant Features
  • To t-ple of numbers
  • To indeces
  • Use the t indeces to access a t-dimensional hash
    table

12
Indexing Methodsfor Fast retrieval of 3D
patterns
  • Select a set of target proteins
  • Create and store a hash table indexed by
    invariant geometric properties of the selected
    folds
  • Update the databases as new structures are found
  • Use the table to identify the nearest fold for a
    target protein.

13
Reference Frame
  • A 3-D reference frame (r. f.) can be defined by
    three non collinear points
  • Invariant
  • the coordinates of any other point in the r.f.

14
Secondary Structures Representation
  • Secondary structures are represented as linear
    vectors (segments)
  • the axis for the alpha helix and the best fit
    segment for a beta strand
  • A SVD-based alignment algorithm is used to match
    an a helix segments with known axes to determine
    helix axis. Direct segment fits were made to fit
    b-sheet strands.

15
Visualization
  • Each segment associated to a secondary structure
    is displayed as a cylinder

16
Secondary Structure-based Approaches
  • Geometric Hashing, Indexing (Wolfson et al, Holm
    et al, Guerra et al )
  • Graph-based (Grindley et al)
  • Dynamic Programming (Singh, Brutlag)

17
Indexing techniques based on Secondary
Structures(Guerra et al)
  • Consider all the triplets of secondary structures
    and their associated segments
  • Construct a 3D table indexed by the angles
    relating three secondary structures.

18
Table Construction
  • For each triplet a1 , a2 , a3 of secondary
    structures of protein P
  • compute the angles between
  • (a1 , a2 ), (a1 , a3 ), and (a2 , a3 ),
  • and use them as indexes to an entry in the a-a-a
    Table where (P, (a1 , a2 , a 3)) is stored.
  • Each cell of the table at the end contains
    information about all triplets that hashed into
    it (including distances between secondary
    structures)

19
Table construction
  • Time Complexity O(s3n)
  • s is the of secondary structures in a
    protein
  • n the of proteins.

20
Searching the table
  • For a query protein, compute the same invariants
    used for the target proteins.
  • For each invariant and corresponding indeces,
    access the corresponding cell in the table where
    a vote is cast
  • List the target proteins according to their votes

21
Distribution of table entries(D. Platt, C.
Guerra, I. Rigoutsos, G. Zanotti, 2003)
  • There is a strong preference for triplets to fall
    into cells with indexes a,b, g satisfying
  • a b g
  • corresponding to segments lying on parallel planes

22
Analysis of Distribution of globally selected
secondary structures
  • Distributions show much stronger preference for
    alignment than expected for randomly uniform
    vectors.
  • There is a greater preference for alignment
    between any two secondary structure elements if a
    third structure element aligns with either of the
    first two -- the alignment angles are not
    independent variates.

23
(No Transcript)
24
(No Transcript)
25
(No Transcript)
26
Geometric Indexing
PDB contains 27000 proteins
Hash Table
Query protein
Proteins Superposition
Search for Similarity
Hypotheses of Similarity
List of Similar Proteins
Pairwise atomic Superimposition with a selected
protein
Alignment by Dynamic Programming
27
Refinement of the matching procedure
  • Alignment
  • Find a collection of corresponding pairs of
    secondary structures (SS) which maximizes a
    given similarity measure
  • Dynamic programming

28
D(i,j) number of times SS i of protein A is
associated to SS j of protein B in a triplet of
equivalent SS
29
Problem
  • Find the increasing path in the D(i,j) matrix
    that maximizes the total similarity measure

30
Dynamic Programming
  • Let M be a 2D matrix such that M(i,j) is the
    similarity measure between s1 s2 .... si and
    t1t2 .... tj
  • Compute
  • M(i,j) max M(i-1,j) d(ti , f),
  • M(i-1,j-1) d(si tj),
    M(i,j-1) d(f, s j )
  • The solution
  • D(A,B)M(n,m)
  • Quadratic time complexity

31
Integration of matching strategies
  • Using different protein representations at
  • atomic level
  • secondary structure level
  • sequence level

32
Superposition
  • Find a rigid transformation which optimally
    superimposes the atoms of two proteins
  • Horn method

33
Accuracy vs coverage
  • Accuracy how many of the solutions found were
    correct?
  • A F intersection T /F
  • Coverage How many of the correct solutions were
    found?
  • C F intersection T /T

T correct sol.
Fsolutions found
34
Evaluating PROuST
  • Using as standard of truth SCOP
  • Against other existing servers

35
Accuracy of results for 1tim chain A
36
Another Algorithm DALI
  • Based on aligning 2-D intra-molecular distance
    matrices
  • Computes the best subset of corresponding
    residues from the two proteins such that
    similarity between the 2-D distance matrices is
    maximized.
  • Searches through all possible alignments of
    residues using Monte-Carlo algorithms

37
DALI
38
Distance matrix (2)
  • Advantages
  • - invariant with respect to rotation and
    translation
  • - can be used to compare proteins
  • Disadvantages
  • - the distance matrix is O(n2) for a protein
    with n residues
  • - comparing distance matrix is a hard problem
  • - insensitive to chirality

39
Distance matrix
5.9
2
4
8.1
3
6.0
1
40
DALI
  • DALI has been used to do an ALL vs. ALL
    comparison of proteins in the PDB, and to create
    a hierarchical clustering of families.
  • FSSPFold classification based on
    Structure-Structure alignment of Proteins
  • http//www.ebi.ac.uk/dali/fssp/fssp.html

41
VAST-Vector Alignment Search Tool
  • Aligns only secondary structure elements (SSE)
  • Represents each SSE as a vector
  • Finds all possible pairs of vectors from the two
    structures that are similar
  • Uses a graph theory algorithms to find maximal
    subset of similar vectors
  • Overall alignment scores is based on the number
    of similar pairs of vectors between the two
    structures.

42
VAST
  • VAST has been used to do an ALL vs. ALL
    comparison of proteins in the MMDB (NCBIs
    structure database), and to find structure
    neighbors for each structure.
  • MMDB provides service of searching structure
    neighbors using VAST.
  • http//www.ncbi.nlm.nih.gov/Structure/VAST/vast.sh
    tml

43
LOCK
  • Define local secondary structures
  • Find an initial superposition by using DP to
    align secondary structure vectors.
  • Use greedy algorithms to find nearest neighbors
    and minimize RMSD between the C-? atoms from
    query and target.
  • Find the core of aligned C-? atoms and minimize
    RMSD between them.

44
Comparison of methods
45
Execution times for comparing a query structure
to 27,000 target structures
46
Execution times for comparing a query structure
to 685 target structures
47
Data
  • 30,000 proteins extracted from PDB
  • Approx. 27,000 proteins inserted in the geometric
    database
  • 0 to 528 segments per protein
  • 13.5 segments on average
  • 48,000,000 triplets
  • 4x20x20x20 table

48
GEOMETRIC PATTERN MATCHING UNDER RIGID MOTION(C.
Guerra, V. Pascucci, 1999)
  • Problem 1. Find a transformation T, if it exists
    that brings A to within a given distance, say e,
    of B, i.e. H(T(A),B)
  • Problem 2. Find the minimum Hausdorff distance
    under a rigid motion
  • D(A, B) min t (t(A), B)
  • where t is a rigid motion

49
Hausdorff Distance
  • Let A a1, a2, ..., am B b1, b 2, ..., bn
    be sets of either points or segments.
  • Definition. (Hausdorff Distance)
  • H(A, B) max (h(A, B), h(B, A))
  • where the one-way Hausdorff distance is
  • h(A, B) maxa minb r (a, b)
  • where a (b) is a point of A (B) and r (a, b), is
    a metric.

50
Segment Hausdorff distance
  • HS(A, B) max (hS(A, B), hS(B, A))
  • where
  • hS(A, B) max ai (min bj H(ai,bj))

51
Oriented Segment Hausdorff distance
  • HOS(A, B) max (hOS(A, B), hOS(B, A)),
  • where
  • hOS(A, B) maxai
  • (minbj (max( d(ais,bjs),d(aie,bje)) ))
  • ais , aie are the endpoints of ai

52
Exact solution in 2D
  • This problem is generally solved as a problem of
    intersection of unions of disks in the
    transformation space.
  • Time complexity O( m3 n3 log2nm) in R2

53
The Matching algorithm
  • Find a rigid body transformation (translation
    plus rotation) that minimizes the Hausdorff
    distance between the segments of A and B.
  • Derive
  • T A?B
  • based on three representative segments of A and
    all
  • triplets of segments of B, and choose the best
    T.

54
Practical Approach
  • 1. Select three representative'' segments a,
    a' , a' of A as follows
  • 1.a Choose randomly one representative a for A.
  • 1.b Select a' to be the segment containing the
    point a'f farthest from the midpoint ac of a .

55
Practical approach (contd)
  • 1.3. Select a'' as the segment that contains the
    point at maximum distance from the line ac ,a'f.
  • 2. For each triplet b, b', b'' of elements of B
    determine the rotation and translation that maps
    a, a' , a'' into b, b' , b''.
  • 3. Choose the best transformation among the
    examined ones.

56
Segment Nearest-Neighbor
  • The nearest-neighbor among n segments in Rd is
    equivalent to a query among 2n points in R2d.
  • HSS(a, b) min(max(d(as,bs),d(ae,be)), max
    (d(as,be), d(a e,be)).
  • Approximate nearest neighbor of a point q in Rd
    (within a factor of (1e )) (Arya et al. )
  • Time complexity O(logn) with O(nlogn)
    preprocessing.

57
Complexity Analysis
  • Time complexity O(mn3log n)
  • Approximation error factor 8

58
Protein 1rpa
59
(No Transcript)
60
(No Transcript)
61
(No Transcript)
62
Questions?
Write a Comment
User Comments (0)
About PowerShow.com