Bioinformatics Algorithms and Data Structures

About This Presentation

Title:

Bioinformatics Algorithms and Data Structures

Description:

Bioinformatics Algorithms and Data Structures – PowerPoint PPT presentation

Number of Views:60

Avg rating:3.0/5.0

Slides: 40

Provided by: john244

Learn more at: https://www.cse.sc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Bioinformatics Algorithms and Data Structures

1
Bioinformatics Algorithms and Data Structures

Chapter 14.1-5 Multiple String Comparisons
Lecturer Dr. Rose
Slides by Dr. Rose
March 1, 2007

2
Multiple String Comparisons

Q Why are we interesting in multiple string
comparisons?
A At one level we are data-mining.
Looking for similarities
Common evolution
Common functionality
Significance of similarity may not be clear with
only two strings.
Multiple string comparison is accomplished by
multiple alignment.

3
Multiple String Comparisons

Defn. Global multiple alignment of k 2 strings
is
Generalization of alignment of 2 strings
Strings S1,S2,,Sk are inflated with spaces to
achieve strings S1,S2,,Sk with uniform
length l.
Strings are arrayed in k rows of l columns.

4
Example
AGT..CTT.ACGCG AGTAGCTT...GCG ..TAGC.T..GGCG .CTA.
C.TAACCCG ACTA...TAAC...
5
Example
6
Multiple String Comparisons

Consider the relation between two-string
comparison and biological function
two-string alignments are used to find
unsuspected biological relationship from apparent
string similarity.
This follows from the first fact of biological
sequence comparison sequence similarity implies
functional or structural similarity.

7
Multiple String Comparisons

Consider the relation between multiple string
comparison and biological function
Multiple string alignments are used to find
unknown string similarities from known biological
relationships.
This isnt as obvious since there is the tendency
to focus on one-dimensional sequences and not the
corresponding three-dimensional structures or
two-dimensional substructures.

8
Multiple String Comparisons

This follows from the second fact of biological
sequences
Strings that are functionally related can appear
very different and yet preserve the same
important three-dimensional and two-dimensional
features.
There are several levels of abstraction entailed
Three-dimensional structure
Functionality
Amino-acid sequence

9
Multiple String Comparisons

These different levels of abstraction are
preserved/conserved to different degrees
Three-dimensional structure is most preserved
Functionality is somewhat conserved
Amino-acid sequence less likely to be conserved
Q What point are we trying to make?
A The significance is that similarity of
structure may not be blatantly apparent at the
sequence level.
? Comparison of multiple sequences highlights
less apparent similarity.

10
Multiple String Comparisons

Example from text Hemoglobin
4 chains of 140 amino acids a piece
Found in insects to mammals
Insects and invertebrates diverged 600 million
BP
large number of amino acid mutations (100) per
chain in the two sequences (insect invertebrate)

11
Multiple String Comparisons

Comparison of two mammalian hemoglobin sequences
Exhibit high amino-acid similarity (Our cousin
the chimpanzee shares the identical sequence)
Suggest similar functionality
Comparison of mammalian and insect hemoglobin
sequences
Exhibits little amino-acid similarity
However, has similar functionality

12
Multiple String Comparisons

The important point is that while
sequence similarity ? functional structural
similarity
The converse
functional structural similarity ? sequence
similarity
is not true, i.e.,
functional structural similarity ? sequence
similarity

13
Family Superfamily Representation

Data Mining Problem
Given a set of biologically similar strings ?
find the commonalities that characterize the
family.
Why would we want to do this?
Conserved features may explain function
structure.
Characterization of the family may make it easy
to recognize new members.
Characterization may also make it easier to
exclude nonmembers.

14
Family Superfamily Representation

Example protein families
The similarity may be functionality or
Two- or three-dimensional structure
Specific Examples
globins (hemoglobins, myoglobins)
immunoglobulin (antibody) proteins

15
Family Superfamily Representation

Q Why would we be interested in identifying the
family to which a protein belongs?
A Family membership immediately clues us in on
Physical structure
Biological functionality
Text suggests there are 100,000 proteins in
humans but only 1000 or fewer protein families

16
Family Superfamily Representation

Q If we suspect that a new protein belongs to
some family how do we check?
Align the new protein sequence with a
representative member of the family?
Align the new protein sequence with several
representative members of the family?
Align the new protein sequence with a
generalization of members of the family?
A Align the new protein sequence with a
generalization of members of the family.

17
Family Superfamily Representation

Q What is the representation of the
generalization of members of the family?
Consider
We want to match family members while
Excluding non-family members
This is an established area in machine learning.
In general, the key is that the representation
language must be sufficiently expressive to
distinguish between - examples.
Conjecture amino acid strings lack sufficient
expressiveness

18
Family Superfamily Representation

Three common currently used representations
Profile (based on multiple alignment)
Consensus sequence (based on multiple alignment)
Signature (some based on multiple alignment, some
not)

19
Profile Representation

Defn. a profile (aka weight matrix)for a multiple
alignment specifies the frequency of each
character in each column.
Consider the following multiple alignment
a b c a
a b a b a
a c c b
c b b c
The corresponding extracted profile
C1 C2 C3 C4 C5
a .75 .25 .50
b .75 .75
c .25 .25 .50 .25
- .25 .25 .25

20
Profile Representation

log-odds ratios profile entries are sometimes
expressed in this form.
Let p(y, j) denote the frequency of the
occurrence of character y in column j.
Let p(y) denote the frequency of the occurrence
of character y anywhere in multiply aligned
sequences.
log p(y, j)/p(y) is the log-odds ratio for cell
(y, j) of the profile (weight matrix).

21
Profile Representation

Alignment of string S with profile P
Insertion of spaces into S is allowed
Use regular string alignment?
Let C be a string of profile column positions
Align S by inserting spaces into S and C.

22
Profile Representation

Example S aabbc, P is the profile from the
previous slide
C1 C2 C3 C4 C5
a .75 .25 .50
b .75 .75
c .25 .25 .50 .25
- .25 .25 .25
Alignment of S and C.
S a a b - b c
C 1 - 2 3 4 5
Q How do we score such an alignment???

23
Profile Representation

Q How do we score profile alignments?
Assume we have an alphabet-weight scoring scheme,
e.g.,
a b c -
a 2 1 -3 -1
b 1 2 1 -1
c 3 1 2 -1
- -1 1 1 0
Column score compute the weighted sum of scores
based on the frequency of characters in the
column.
Alignment score sum the column scores.

24
Profile Representation

a b c - alphabet-weight scoring
scheme
a 2 1 -3 -1
b 1 2 1 -1
c 3 1 2 -1
- -1 1 1 0
C1 C2 C3 C4 C5 profile
a .75 .25 .50
b .75 .75
c .25 .25 .50 .25
- .25 .25 .25
Compute the weighted sum of scores based on the
frequency of characters in the column.
S a a b - b c
C 1 - 2 3 4 5
Column1 0.75 2 0.25(-3)
Column2 0.75 2 0.25(-1)
Column3 0.25 0 0.50 (-1) 0.25 (-1)
Column4 0.75 2 0.25 (-1)
Column5 0.50 (-3) 0.25 2 0.25 (-1)

25
Profile Representation

Q How do we find optimal alignments?
A Use dynamic programming to maximize
similarity.
As before
s(x, y) denotes the alphabet-weight assignment
for aligning x y.
p(y, j) denote the frequency of letter y in
column j.
Then let S(x, j) denote Sys(x, y) p(y, j) ,
the score for aligning x with column j.

26
Profile Representation

Defn. Let V(i, j) denote the value of the optimal
alignment of S1..i with the first j columns of
C.
Then V(0, j ) Sk?j S(_,k)
And V(i, 0) Sk?i S(S1(k), _)
Here S1(k) denotes the kth character of the first
string argument, i.e., Sk.

27
Profile Representation

The general recurrence is then
V(i, j) max
V(i - 1, j - 1) S(S1(i), j), match ith and
jth letters
V(i - 1, j) S(S1(i), _), insert a gap in
the profile
V(i, j - 1) S(_,j) insert a gap in S1.
Q What is the time complexity for solving this
recurrence using DP?

28
Profile Representation

Clearly the time complexity is O(smn) for DP
Where
n is the length of S the string.
m is length of the profile and
s is the size of the alphabet.
O(smn) is more costly than sequence to sequence
alignment. (Do you recall what that cost was?)

29
Signature Representation

This representation is used by protein databases
such as
PROSITE
BLOCKS
The core idea is that families of proteins are
characterized by motifs or sequence signatures.
Q What is a motif?
A (Webster) A usu. repeating salient thematic
element

30
Signature Representation

Example from text
HADDExnTSN x4QKG x7A
Where
A bracket indicates alternative amino acids
I, L, V, M, F, Y, W
x denotes any amino acid.
The subscript denote the length of the string, n
denotes and arbitrary length.

31
Signature Representation

Example from text
HADDExnTSN x4QKG x7A
Observations
The representation is a generalization
The generalization is a regular expression

32
Signature Representation

Signature
HADDExnTSNx4QKGx7A
Matches
HADDITIIIIQGIIIIIIIA
IADDITIIIIQGIIIIIIIA
LADDITIIIIQGIIIIIIIA
VADDITIIIIQGIIIIIIIA
MADDITIIIIQGIIIIIIIA

33
Signature Representation

Regular expression representation
use regular expression pattern matching.
no need to worry about mismatches/errors.

34
Computing Multiple Alignments

Recall two string local alignment was defined in
terms of global alignment of substrings.
We take the same approach for multiple string
local alignment.
Defn. A local multiple alignment of a set S of
strings is obtained by selecting one substring
Si from each Si ? S and then globally aligning
these substrings.

35
Computing Multiple Alignments

Q Global vs Local alignment which should we
prefer?
Wait for someone to respond!
Gusfield notes for
Pairs of sequences and
Multiple sequences
there are biological justifications for
preferring local over global alignment of
multiple sequences.
But.

36
Computing Multiple Alignments

But.
The best (computer science) theoretical results
are for global alignment.
Like the joke about the lost wallet, Gusfield
chooses to emphasize global alignment.

37
Computing Multiple Alignments

Q How can we generalize the concept of score to
multiple alignments?
IOW, what objective function should we use?
We will consider three types of objective
functions
Sum-of-pairs
Consensus
Tree

38
Computing Multiple Alignments

First we define the concepts of induced pairwise
alignment and its corresponding score.
Defn. The induced pairwise alignment of strings
Si and Sj is obtained from the global alignment M
by removing all other rows.
Note instances of matching spaces can be removed
from the induced alignment.
Note to score an induced pairwise alignment any
two-string alignment scoring scheme can be used.

39
Computing Multiple Alignments

Consider the following pairwise scoring scheme
score mismatches spaces
In the following example
1 A A T - G G T T T
2 A A - C G T T A T
T A T C G - A A T
score(1,2) 4
score(1,3) 5
score(2,3) 4

Write a Comment

User Comments (0)

About PowerShow.com

Bioinformatics Algorithms and Data Structures - PowerPoint PPT Presentation

Bioinformatics Algorithms and Data Structures

Bioinformatics Algorithms and Data Structures – PowerPoint PPT presentation