Seminar in structural bioinformatics - PowerPoint PPT Presentation

1 / 65
About This Presentation
Title:

Seminar in structural bioinformatics

Description:

... is reminded of the works of an Origami artist: Certain basic folding patterns ... Pick a target atom (take predefined threshold distance into consideration). 41 ... – PowerPoint PPT presentation

Number of Views:47
Avg rating:3.0/5.0
Slides: 66
Provided by: danatsu
Category:

less

Transcript and Presenter's Notes

Title: Seminar in structural bioinformatics


1
Seminar in structural bioinformatics
  • Pairwise Structural Alignment

Presented by Dana Tsukerman
2
Outline
  • Definitions.
  • What is structural alignment?
  • Why structural alignment?
  • structural alignment vs. sequence alignment
  • Problem definition
  • Background
  • preparing the ground for the algorithm.
  • The algorithm

3
Outline - cont.
  • Implementation of the algorithm and an example of
    using a real software, based on the algorithm
    that will be presented.
  • Method results.
  • Method discussion
  • Method summary.
  • Extensions and additional features - a look
    ahead.
  • Lecture summary.

4
Definitions
  • Sequence alignment (remainder from last lecture),
    unambiguously distinguishes only between protein
    pairs of similar structure and non-similar
    structures when the pairwise sequence identity is
    high.
  • Structure alignment - the precise arrangement of
    the amino acid side chains in the three
    dimensional structure of the protein that
    dictates its function.

5
Quick rehearsal - Basic terms
  • Primary structure refers to the order (and
    sequence) of amino acids along one chain.
  • Some regions form regular local structure
    (folding patterns)
  • Alpha helices
  • Beta strands
  • Collectively called secondary structure elements
    (SSEs).
  • Regions connecting SSEs are loops.
  • Secondary structure is the description of the
    type and locations of the SSEs.
  • Tertiary structure is the 3-D coordinates of the
    atoms in a chain.
  • Quaternary structure describes the spatial
    packing of several folded chains (not all
    proteins have a quaternary structure).

6
Quick rehearsal - Basic terms
regular hydrogen bond patterns of backbone atoms
7
3D observation of proteins
8
3D observation of proteins
  • If one looks at the collection of protein
    structures, one is reminded of the works of an
    Origami artist Certain basic folding patterns
    are used over and over again and cleverly
    modified by minor adaptations to generate a wide
    variety of different protein structures. Where
    one such folding units is insufficient to
    generate the required complexity, multiple
    domains can be combined, such as in the camel or
    giraffe structure on this picture.

9
Comparison in 3D
  • Starting from an example

A
B
B
A
C
C
E
D
D
E
back
10
Comparison in 3D
  • Rotation and translation coordinates - 6 degrees
    of freedom.
  • The method is independent of the amino acid
    sequence.
  • What does it mean?
  • This method is insensitive to insertions,
    deletions and displacements of equivalents
    substructures betweens the molecules being
    compared.
  • Proteins with similar sequences adopt very
    similar structures.

11
Why 3D comparison?
12
Why 3D comparison?
Wait a minute - isnt sequence comparison enough?
13
Why 3D comparison?
  • Structures are more conserved than sequences.
  • Detection of distant evolutionary relationships.
  • Structural alignment can imply a functional
    similarity that isnt detectable from a sequence
    alignment.
  • The protein docking problem.
  • Structure based drug design.
  • Applications and implications to the protein
    folding problem.

14
Why 3D comparison? Cont.
  • For homologous proteins, this provides the gold
    standard for sequence alignment.
  • For nonhomologous proteins, it allows us to
    identify common substructures of interest.
  • Allows us to classify proteins into clusters,
    based on structural similarity.
  • Design and engineering of synthetic proteins.

15
Problem Definition
  • Input 3-D coordinate data of the structures to
    be compared.
  • Output regions of structural similarity (more
    than one, if exists), that lead to the best
    alignment.
  • NP-Hard.

Most atoms matched with the lowest RMSD
Whats best?
16
Our goal
Find out the correspondence between the structures
transformation T
17
Preparing the ground
  • Transformation definition.
  • How can we evaluate the match we found?
  • RMSD rehearsal from the opening lecture.
  • Other methods besides the one we will discuss
    and why our method is better.
  • Progression rule definition.
  • PDB functionality rehearsal.
  • Geometric Hashing introduction.

18
3-D Transformation
  • Rotation - the movement of a body in such a way
    that any given point of that body remains at a
    constant distance from some other fixed point.
    Will be denoted by R.
  • Translation - the transformation of moving every
    point by a fixed distance in the same direction
    (addition of a constant vector to every point).
    Will be denoted by T.
  • What is preserved under translation and rotation?
  • relative distances within an object (e.g.
    Shapes)
  • In total, the 3-D transformation has 6 degrees of
    freedom 3 for rotation and 3 for translation.

19
RMSD - rehearsal
  • A tool we use to evaluate the correspondence we
    found.
  • RMSD - Root Mean Square Deviation
  • Where,
  • n number of atoms
  • x, y the proteins we want to compare
    (structures)
  • We want to find 3-D transformation T, such that
    the RMSD will be minimal, i.e.
  • We know how to do that in O(n).

20
RMSD - Example
21
Other methods for structural alignment
  • Dynamic programming - building a score matrix,
    with a score for each pair of residues.
    or
  • Other improvements of that method.
  • Simplify the problem by moving from 3D space to
    2D space sacrificing the optimum result for the
    speed.
  • Comparing secondary structure elements (SSE)
  • Our method allows access to problems that
    couldnt be approached previously by
    sequence-order-dependent structural comparison
    methods, like the docking problem.

22
Progression rule
  • Rule definition for elements i and k from one
    sequence and elements j and l from the other
    sequence, if element i is matched to element j
    and element k is matched to element l, and if k
    is to the right of i in the first sequence, the l
    must also be to the right of j in the second
    sequence.
  • For example, the structures we saw at the
    beginning couldnt be found similar by a
    progression rule based method (sequence
    -dependent).

Example
23
PDB - Protein Data Bankhttp//www.rcsb.org/pdb/in
dex.html
  • International structure database.
  • Archive of experimentally determined 3-D
    structures of biological macromolecules, together
    with extensive annotation.
  • Established at Brookhaven National Laboratories
    in 1971. in the beginning it held 7 structures.
  • In 2003, 4,831 structures were deposited to the
    PDB archive.
  • January 2004 snapshot 23,792 released atomic
    coordinate entries.

24
Geometric hashing
  • Introduced for model-based object recognition in
    computer vision.
  • Goal identify and locate in an image all the
    instances of models which appear in the systems
    database.
  • Represents the objects to be compared in a
    translational and rotational invariant fashion.
  • On which the first step of the algorithm
    presented today is based.

25
Geometric hashing - cont.
  • We search for a way to represent object in a way
    we will be able to move them, and the
    representation wont change.
  • HOW? Building triangles!
  • for nodes triangles!

The triangles sides length doesnt change when
we move it or rotate it, and thus invariant!
26
  • now please pay attention
  • Wake Up!

27
The algorithm major steps
  • Find (relatively small) subsets of the structures
    that form an initial match
  • Find clusters in initial matches that represent
    similar transformations
  • Extend the clusters to contain additional
    matching pairs of residues.

28
Motivation remainder
  • now let's jump into the water...

29
Step 1 - Finding seed matches
  • Goal search through the structures to find
    candidate initial matches.
  • Those will be referred as seed matches.
  • Most difficult and time consuming step.
  • Extensive search of the structures.
  • How to represent?
  • Remember what we talked about in geometric
    hashing?

30
Finding seed matches - cont.
  • Seed match - list of matching pairs of atoms.
  • Pair - correspondence between atoms from
    different structures.
  • Assumption the structures to be compared are
    described by sets of interest points and their
    3-D coordinates (for example Ca atoms).
  • Model Target.

31
Finding seed matches - cont.
  • Redefinition of the problem is there a rotated
    and translated subset of the interest points of
    the target which matches those of the model?
  • Two phases
  • preprocessing
  • recognition

32
Preprocessing - intro.
  • Goal represent the information about the atoms
    of the model molecule in a rotation and
    translation invariant manner.
  • Off-line. Why?
  • This information will be later used in the
    recognition phase.
  • 3 non-collinear atoms specify a unique
    orthonormal reference frame (unique coordinate
    system).
  • This will be a full reference frame.

33
Preprocessing - intro.
  • We wont use a full reference frame only 2 atoms
    (not unique). Those 2 atoms will be called
    reference set.
  • Each atom b in the molecule is represented by the
    triplet of distances of the sides of the triangle
    formed between b and the atoms of the reference
    set.

Reference set (c,a)
34
Reference frames - clarification
Note the example is in the 2-D case (basic ideas
the same as the 3-D case)
Same shape, different reference frames
35
Preprocessing
36
Preprocessing
  • Hash table
  • representation of each model atom
  • triplets of distances (from the atom to
    reference pair)
  • the corresponding reference pair and the atom
    which obtained this key.
  • Note
  • each atom has a redundant representation in all
    possible reference sets.
  • Many triangles can occupy the same hash table
    entry.

37
Preprocessing Complexity Discussion
  • The complexity is highly dependent on the
    invariants we use for hashing.
  • Complexity O(n3)
  • n is the of atoms in the model.
  • But We can do better!
  • we will later see an optimization that will
    reduce the complexity to O(n2).

38
Preprocessing example
Note the example is in the 2-D case (basic ideas
the same as the 3-D case)
  • Reference frame here is a pair of coordinates.
  • For instance, in cell (3, 2) we find point 2,
    in both reference frames, and so we store those
    reference frames in the hash table H(3, 2).

39
Recognition - intro.
  • Goal discover candidate matching substructures
    in the target and model molecules.
  • Reference set - pair of atoms.
  • Each such matching substructure is based on a
    certain reference set, which appears both in the
    model and target molecules.

40
Recognition algorithm
  • For each reference set of the target
  • Hold a vote counter for each reference set
    appearing in the hash table.
  • any ideas what will it hold?
  • Of course, it will hold the current number of
    matching atoms, and the list of matching pairs.
  • We will call this list the vote list.
  • In the beginning the list is initialized with
    null.
  • Pick a target atom (take predefined threshold
    distance into consideration).

41
Recognition - cont.
  • Use the 3 sides of the triangle formed to compute
    their hash table key.
  • Access the hash table in this key
  • Extract all the model triangles in this entry.
  • For each triangle
  • Vote_counter
  • Vote_list.add(current_triangle)
  • Go back to picking another atom, until we
    considered all of them.

42
Recognition - cont.
  • Check the vote counters of all the entries and
    consider the ones with a large of votes.
  • Verification.
  • Choose another reference set in the target
    molecule and go back to the beginning.
  • Complexity O(n3k)
  • k indicates the of triangles in each hash table
    entry.
  • Can be of order O(n2) after optimizing
    preprocessing.

43
Recognition example
Note the example is in the 2-D case (basic ideas
the same as the 3-D case)
For instance, lets look on point f, its
coordinates are (0, 4) and so this is the key to
H. H(0,4) contains the reference frame (1,3),
thus its counter will be increased (a vote for
the base pairs in H) and the pair (7, f) will be
added to the matched list.
Why (7, f)?
44
Step 2 - Clustering
  • Goal clustering the seed matches that represent
    almost identical transformations.
  • Why clustering? Many of the seed matches obtained
    in step 1 represent the same transformation (but
    contain different pairs of matching atoms).
  • We use the lists of matching atoms to compute the
    3-D rotation and translation, which gives us the
    minimal least squares distance between the target
    and the model.

45
Clustering - cont.
  • The computed 3-D transformation has 6 parameters
    (3 for rotation (angles) and 3 for translation
    (distances)).
  • Join similar transformations into new groups.
  • What's similar?
  • Small 6-D distance between the parameter vectors
    of the transformations.
  • Clustering algorithm (iterative)
  • At the beginning, each seed match forms a group
    represented by 6 parameters of its
    transformation.

46
Clustering - cont.
  • The pair of groups having the minimal distance
    between their transformations is chosen and a new
    group is formed by merging these two groups.
  • Who will be the parameters of the new group?
  • A threshold is defined to determine an end to the
    algorithm.
  • What do we have so far?
  • of groups, each represents one transformation
    obtained by averaging the individual
    transformations that were joined to the group.

47
Clustering - cont.
  • The seed match of a group is obtained by choosing
    matching pairs from the original seed matches
    that composed the group.
  • But, we dont take the union of all pairs!
  • Improve accuracy by choosing pairs that appear in
    at least certain percentage of the seed matches.
  • The new correspondence lists are considered more
    reliable than in step 1.
  • Complexity
  • m of seed matches to be clustered.

48
Step 3 - Extending
  • Goal extend the correspondence lists from step 2
    to contain additional matching pairs.
  • Remember! the transformation representing each
    group was computed by taking the average of the
    initial transformation.
  • How can we find more matches?
  • Compute again a transformation which gives the
    minimal least squares distance between the
    matched pairs.
  • The pairs that survive the second transformation
    are candidate additional matches.

49
Extending - intro.
  • of iterations to extend each seed match
    (small constant).
  • e - maximum allowed distance.
  • At iteration i we extend the match to contain
    pairs of atoms that lie at a maximum distance of

50
Extending - algorithm
  • For iteration i
  • Find the transformation of the current match
    using least squares procedure.
  • Transform the target according to this
    transformation.
  • Remove pairs from the current match that lie in a
    distance larger than
  • Extend the match by heuristic matching algorithm
    (given a threshold value).

51
Extending - cont.
  • After iterations, repeat the first 3 steps
    to refine the last matching.
  • Complexity as the heuristic matching algorithm
  • ( or )
  • Output the best extended matches.
  • A remainder What is best?
  • of matching pairs
  • Minimal RMSD between the matching atoms.

52
Preprocessing Optimization
  • We can do better (complexity wise)!
  • Assumption there is spatial proximity between
    the atoms of the relevant matching substructures.
  • Conclusion the triangles we will consider are
    those composed of three atoms whose atom-to-atom
    distances are below certain threshold.

53
Preprocessing Optimization - complexity discussion
  • Maximum allowed distance between the atoms of the
    reference set r1 5Ã… ( )
  • Maximum allowed distance between a third point
    and the atoms of a reference set r2 20Ã…
  • Theoretically, the complexity is now
  • Practically,
  • Example 138 residues
  • 13,359 triangles

54
Implementation - Examples
  • http//bioinfo3d.cs.tau.ac.il/
  • c_alpha_match/prog.html
  • 6LYZ vs. 2LZM
  • Result 1

55
Implementation - Examples
1pmy vs. 1pza
1pmy vs. 1aaj
56
Rasmol example
57
Results of the algorithm
  • 3-D comparison method that isnt constrained by
    linear order of the amino acid chain.
  • Self comparison - outputs the best match besides
    the trivial one. Could not be obtained in a
    sequence-dependent method.
  • Successful on a wide range of protein comparison
    problems.

58
Method discussion - cont.
  • 2 factors in structural comparison (might be
    conflicting)
  • Sequential order conservation.
  • Geometric pattern conservation.
  • Most of known methods strict constraint has been
    placed on the search - sequential order
    conservation.
  • Much easier (structural alignment is NP-Hard).
  • Linear order conservation isnt necessarily
    undesirable
  • Comparing proteins whose evolutionary relatedness
    is certain
  • But neither desirable
  • If the exact evolutionary relationship between
    the structures is unknown
  • Possible generic mutations could have occurred

59
Method discussion
  • Sequence independent
  • Help find common 3-D folding units
  • Dealing with the question of convergence to a
    similar structure or divergence from a common
    ancestor.
  • Classical example TIM barrel proteins.
  • Demonstrates that a strictly linear match is not
    the best geometrical match between
  • two barrel structures.

60
Method summary
  • Based on the geometric hashing paradigm.
  • Pure 3-D approach (sequence-independent).
  • No a-priori knowledge of the motifs nor an
    initial alignment are required.
  • Not sensitive to insertions, deletions, gaps or
    displacements of equivalent substructures between
    the molecules being compared.
  • Efficient and fully automated.
  • Seconds for typical pairwise comparisons.
  • Successful on a wide range of protein comparison
    problems.

61
Method summary - cont.
  • In most of the examples, the best match
    corresponds to a linear alignment match.
  • Provides a way to compare proteins without the
    bias of other methods (sequence dependent).
  • Capable of discovering partial structural
    similarities.
  • Sole criterion geometry!
  • Complexity O(n3)

62
Extensions and additional features - a look ahead
  • The method can be extended to allow simultaneous
    and efficient comparison of a target structure
    with a data base of many model structure.
  • Protein and amino acid properties can be
    exploited in the definition of the reference
    frame and thus taken into consideration in the
    algorithm.
  • Different choices of interest points.
  • Strategies to reduce the of triangles.
  • Assigning weights to the matches according to
    certain factors (recognition phase change).
  • Extending and adapting the technique to be used
    in the docking problem.

63
Lecture summary
  • 3D observation of proteins.
  • Why structural alignment?
  • Studies of catalogued motifs can aid in
    understanding the evolutionary relationship
    between the proteins.
  • The method presented allows addressing the
    question of of protein structural classes found
    in nature.
  • In particular, the availability of such a library
    is expected to aid in the investigation of the
    protein folding problem.
  • Sequence alignment vs. structure alignment.
  • Geometric hashing and its use in the algorithm.
  • The algorithm and its implementation.
  • Extensions and additional features - a look ahead.

64
  • Questions?

65
Thats it
Write a Comment
User Comments (0)
About PowerShow.com