Title: Bioinformatics Algorithms and Data Structures
1Bioinformatics Algorithms and Data Structures
- Chapter 14.1-5 Multiple String Comparisons
- Lecturer Dr. Rose
- Slides by Dr. Rose
- March 1, 2007
2Multiple String Comparisons
- Q Why are we interesting in multiple string
comparisons? - A At one level we are data-mining.
- Looking for similarities
- Common evolution
- Common functionality
- Significance of similarity may not be clear with
only two strings. - Multiple string comparison is accomplished by
multiple alignment.
3Multiple String Comparisons
- Defn. Global multiple alignment of k 2 strings
is - Generalization of alignment of 2 strings
- Strings S1,S2,,Sk are inflated with spaces to
achieve strings S1,S2,,Sk with uniform
length l. - Strings are arrayed in k rows of l columns.
4Example
AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.
C.TAACCCG ACTA...TAAC...
5Example
6Multiple String Comparisons
- Consider the relation between two-string
comparison and biological function - two-string alignments are used to find
unsuspected biological relationship from apparent
string similarity. - This follows from the first fact of biological
sequence comparison sequence similarity implies
functional or structural similarity.
7Multiple String Comparisons
- Consider the relation between multiple string
comparison and biological function - Multiple string alignments are used to find
unknown string similarities from known biological
relationships. - This isnt as obvious since there is the tendency
to focus on one-dimensional sequences and not the
corresponding three-dimensional structures or
two-dimensional substructures.
8Multiple String Comparisons
- This follows from the second fact of biological
sequences - Strings that are functionally related can appear
very different and yet preserve the same
important three-dimensional and two-dimensional
features. - There are several levels of abstraction entailed
- Three-dimensional structure
- Functionality
- Amino-acid sequence
9Multiple String Comparisons
- These different levels of abstraction are
preserved/conserved to different degrees - Three-dimensional structure is most preserved
- Functionality is somewhat conserved
- Amino-acid sequence less likely to be conserved
- Q What point are we trying to make?
- A The significance is that similarity of
structure may not be blatantly apparent at the
sequence level. - ? Comparison of multiple sequences highlights
less apparent similarity.
10Multiple String Comparisons
- Example from text Hemoglobin
- 4 chains of 140 amino acids a piece
- Found in insects to mammals
- Insects and invertebrates diverged 600 million
BP - large number of amino acid mutations (100) per
chain in the two sequences (insect invertebrate)
11Multiple String Comparisons
- Comparison of two mammalian hemoglobin sequences
- Exhibit high amino-acid similarity (Our cousin
the chimpanzee shares the identical sequence) - Suggest similar functionality
- Comparison of mammalian and insect hemoglobin
sequences - Exhibits little amino-acid similarity
- However, has similar functionality
12Multiple String Comparisons
- The important point is that while
- sequence similarity ? functional structural
similarity - The converse
- functional structural similarity ? sequence
similarity - is not true, i.e.,
- functional structural similarity ? sequence
similarity
13Family Superfamily Representation
- Data Mining Problem
- Given a set of biologically similar strings ?
find the commonalities that characterize the
family. - Why would we want to do this?
- Conserved features may explain function
structure. - Characterization of the family may make it easy
to recognize new members. - Characterization may also make it easier to
exclude nonmembers.
14Family Superfamily Representation
- Example protein families
- The similarity may be functionality or
- Two- or three-dimensional structure
- Specific Examples
- globins (hemoglobins, myoglobins)
- immunoglobulin (antibody) proteins
15Family Superfamily Representation
- Q Why would we be interested in identifying the
family to which a protein belongs? - A Family membership immediately clues us in on
- Physical structure
- Biological functionality
- Text suggests there are 100,000 proteins in
humans but only 1000 or fewer protein families
16Family Superfamily Representation
- Q If we suspect that a new protein belongs to
some family how do we check? - Align the new protein sequence with a
representative member of the family? - Align the new protein sequence with several
representative members of the family? - Align the new protein sequence with a
generalization of members of the family? - A Align the new protein sequence with a
generalization of members of the family.
17Family Superfamily Representation
- Q What is the representation of the
generalization of members of the family? - Consider
- We want to match family members while
- Excluding non-family members
- This is an established area in machine learning.
- In general, the key is that the representation
language must be sufficiently expressive to
distinguish between - examples. - Conjecture amino acid strings lack sufficient
expressiveness
18Family Superfamily Representation
- Three common currently used representations
- Profile (based on multiple alignment)
- Consensus sequence (based on multiple alignment)
- Signature (some based on multiple alignment, some
not)
19Profile Representation
- Defn. a profile (aka weight matrix)for a multiple
alignment specifies the frequency of each
character in each column. - Consider the following multiple alignment
- a b c a
- a b a b a
- a c c b
- c b b c
- The corresponding extracted profile
- C1 C2 C3 C4 C5
- a .75 .25 .50
- b .75 .75
- c .25 .25 .50 .25
- - .25 .25 .25
20Profile Representation
- log-odds ratios profile entries are sometimes
expressed in this form. - Let p(y, j) denote the frequency of the
occurrence of character y in column j. - Let p(y) denote the frequency of the occurrence
of character y anywhere in multiply aligned
sequences. - log p(y, j)/p(y) is the log-odds ratio for cell
(y, j) of the profile (weight matrix).
21Profile Representation
- Alignment of string S with profile P
- Insertion of spaces into S is allowed
- Use regular string alignment?
- Let C be a string of profile column positions
- Align S by inserting spaces into S and C.
22Profile Representation
- Example S aabbc, P is the profile from the
previous slide - C1 C2 C3 C4 C5
- a .75 .25 .50
- b .75 .75
- c .25 .25 .50 .25
- - .25 .25 .25
- Alignment of S and C.
- S a a b - b c
- C 1 - 2 3 4 5
- Q How do we score such an alignment???
23Profile Representation
- Q How do we score profile alignments?
- Assume we have an alphabet-weight scoring scheme,
e.g., - a b c -
- a 2 1 -3 -1
- b 1 2 1 -1
- c 3 1 2 -1
- - -1 1 1 0
- Column score compute the weighted sum of scores
based on the frequency of characters in the
column. - Alignment score sum the column scores.
24Profile Representation
- a b c - alphabet-weight scoring
scheme - a 2 1 -3 -1
- b 1 2 1 -1
- c 3 1 2 -1
- - -1 1 1 0
- C1 C2 C3 C4 C5 profile
- a .75 .25 .50
- b .75 .75
- c .25 .25 .50 .25
- - .25 .25 .25
- Compute the weighted sum of scores based on the
frequency of characters in the column. - S a a b - b c
- C 1 - 2 3 4 5
- Column1 0.75 2 0.25(-3)
- Column2 0.75 2 0.25(-1)
- Column3 0.25 0 0.50 (-1) 0.25 (-1)
- Column4 0.75 2 0.25 (-1)
- Column5 0.50 (-3) 0.25 2 0.25 (-1)
25Profile Representation
- Q How do we find optimal alignments?
- A Use dynamic programming to maximize
similarity. - As before
- s(x, y) denotes the alphabet-weight assignment
for aligning x y. - p(y, j) denote the frequency of letter y in
column j. - Then let S(x, j) denote Sys(x, y) p(y, j) ,
the score for aligning x with column j.
26Profile Representation
- Defn. Let V(i, j) denote the value of the optimal
alignment of S1..i with the first j columns of
C. - Then V(0, j ) Sk?j S(_,k)
- And V(i, 0) Sk?i S(S1(k), _)
- Here S1(k) denotes the kth character of the first
string argument, i.e., Sk.
27Profile Representation
- The general recurrence is then
- V(i, j) max
- V(i - 1, j - 1) S(S1(i), j), match ith and
jth letters - V(i - 1, j) S(S1(i), _), insert a gap in
the profile - V(i, j - 1) S(_,j) insert a gap in S1.
- Q What is the time complexity for solving this
recurrence using DP?
28Profile Representation
- Clearly the time complexity is O(smn) for DP
- Where
- n is the length of S the string.
- m is length of the profile and
- s is the size of the alphabet.
- O(smn) is more costly than sequence to sequence
alignment. (Do you recall what that cost was?)
29Signature Representation
- This representation is used by protein databases
such as - PROSITE
- BLOCKS
- The core idea is that families of proteins are
characterized by motifs or sequence signatures. - Q What is a motif?
- A (Webster) A usu. repeating salient thematic
element
30Signature Representation
- Example from text
- HADDExnTSN x4QKG x7A
- Where
- A bracket indicates alternative amino acids
- I, L, V, M, F, Y, W
- x denotes any amino acid.
- The subscript denote the length of the string, n
denotes and arbitrary length.
31Signature Representation
- Example from text
- HADDExnTSN x4QKG x7A
- Observations
- The representation is a generalization
- The generalization is a regular expression
32Signature Representation
- Signature
- HADDExnTSNx4QKGx7A
- Matches
- HADDITIIIIQGIIIIIIIA
- IADDITIIIIQGIIIIIIIA
- LADDITIIIIQGIIIIIIIA
- VADDITIIIIQGIIIIIIIA
- MADDITIIIIQGIIIIIIIA
33Signature Representation
- Regular expression representation
- use regular expression pattern matching.
- no need to worry about mismatches/errors.
34Computing Multiple Alignments
- Recall two string local alignment was defined in
terms of global alignment of substrings. - We take the same approach for multiple string
local alignment. - Defn. A local multiple alignment of a set S of
strings is obtained by selecting one substring
Si from each Si ? S and then globally aligning
these substrings.
35Computing Multiple Alignments
- Q Global vs Local alignment which should we
prefer? - Wait for someone to respond!
- Gusfield notes for
- Pairs of sequences and
- Multiple sequences
- there are biological justifications for
preferring local over global alignment of
multiple sequences. - But.
36Computing Multiple Alignments
- But.
- The best (computer science) theoretical results
are for global alignment. - Like the joke about the lost wallet, Gusfield
chooses to emphasize global alignment.
37Computing Multiple Alignments
- Q How can we generalize the concept of score to
multiple alignments? - IOW, what objective function should we use?
- We will consider three types of objective
functions - Sum-of-pairs
- Consensus
- Tree
38Computing Multiple Alignments
- First we define the concepts of induced pairwise
alignment and its corresponding score. - Defn. The induced pairwise alignment of strings
Si and Sj is obtained from the global alignment M
by removing all other rows. - Note instances of matching spaces can be removed
from the induced alignment. - Note to score an induced pairwise alignment any
two-string alignment scoring scheme can be used.
39Computing Multiple Alignments
- Consider the following pairwise scoring scheme
- score mismatches spaces
- In the following example
- 1 A A T - G G T T T
- 2 A A - C G T T A T
- T A T C G - A A T
- score(1,2) 4
- score(1,3) 5
- score(2,3) 4