Sequence Comparison

About This Presentation

Title:

Sequence Comparison

Description:

Reference Books: K.-M. Chao and L.X. Zhang. Sequence Comparison: Theory and Methods, Springer ... extremely important in modern biological and medical sciences. ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 56

Provided by: mat91

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Comparison

1
Sequence Comparison
MA5232 Lecture Notes

LX Zhang
Department of Mathematics
National University of Singapore
matzlx_at_nus.edu.sg

2
See what everyone else has seen, but think them
mathematically.

Reference Books
K.-M. Chao and L.X. Zhang.
Sequence Comparison Theory and Methods,
Springer
2. R.C. Deonier, S. Tavare, and M.S. Waterman,
Computational Genome Analysis, Springer.
3. J. Felsenstein, Inferring Phylogenies,
Springer.

3
Motivation Sequence comparison problems arise
everywhere

Web search tools are designed based on sequence
comparison solutions.
Text processing tools are designed based on
efficient solutions to text storage and word
pattern matching.
Finally, bio-molecular sequence comparison is
extremely important in modern biological and
medical sciences.

4
Motivation
5
Motivation

All genomes are evolved from a common ancestral
genome.
Less important regions of the genome mutate
or change with time more frequently than the
functional important parts.
Hence, sequence conservation implies function.
In other words, sequence conservation serves as a
signal of the functionally important regions in a
genome.

6
An Example of Sequence Function Analysis

To determine whether the SARS is caused by a new
virus or not, one may sequence
the SARS virus and search against a databases
of known bacterial and viral sequences.

Sequence database
Output
List of similar matches
7
Part I Alignment Model for Bio-Seq
Comparison

An alignment among k sequences is a k-row matrix
such that
The row j contains the j-th sequence and
between
two consecutive residues there may be one or
more -s
Each column contains at least one residue.

8
More examples

a g a a - t g c a a c a a g t g - a
a g a a t g - c a a c a a - g t g a
Each alignment is an evolutionary hypothesis,
which accounts for three types of mutations
-- Substitution ( a g c t ? a g t t ) --
Insertion ( a g c t ? a g a c t ) --
Deletion ( a g c t ? a t )
9

Because of point mutation (sub, indel) and
large scale mutation (reversal, fission, fusion,
etc),
a gene has different sequences in different
species.
different genomes have different size and
number of chromosomes.
Indels may not affect the structure of a protein.

10
Alignment of globin protein sequences
11
Part II Scoring an Alignment
a g t c t c c a t c c c
a g t c t c c a t c c c
The quality of an alignment is measured by the
sum of the scores of pairs of aligned residues
plus the scores for gaps.
The left alignment is scored by s(a, a)
3s(c, c) s(t, t) s(g, -) s(t, -)
12

Scores for aligned residues and gaps form a basic
scoring scheme

A G C T - A 4
-2 -1 -2 -1 G 3
-2 -1 -1 C 4
-2 -1 T 3
-1
a g t c t c c a t c c c
has score 4 (-1) 3 4 (-1) 4 4 18
13

Under a meaningful scoring scheme, the score of
an alignment is essentially
the logarithm of likelihood ratio of the
alignment between two random sequences.
Matches score a positive value
Mismatches and indels are penalized
by scoring a negative value.

Indels and Mismatch are introduced to bring up
matches that appear later.
14
A sequence is a word or string on alphabet
?. For DNA sequences, ? a, g, c, t For
protein sequences, ? has 20 letters For
English words, ? a, b, , y, z.

(Global) Pairwise Sequence Alignment
Instance Two sequences S1 and S2, and
a scoring scheme
Question Find an alignment of S1 and S2 that
has the highest score.

Such an alignment that has the highest alignment
score is called an optimal alignment.
15
Part III Generality of Alignment as a Model

Many different problems such as
Maximum subsequence problem
Minimum supersequence problem
Best fitting problem
Finding Levenshtein distance
in sequence comparison can be solved as a
special case of alignment by using a specific
scoring scheme.

16
Maximum Subsequence Problem

A sequence is a word or string on alphabet ?.
For DNA sequences, ? a, g, c, t
For protein sequences, ? has 20 letters
For English words, ? a, b, , y, z.
Let S and T be two sequences on ?.
S is a subsequence of T
or T is a supersequence of S if
all the characters in S appear in T in the same
order.

Maximum Subsequence Problem
Instance Two sequences S1 and S2
Question Find the longest common
subsequence of S1 and S2.
a g g c c a a t a g g c c a a t
a c g g c t c a a c g g c t c a

18
One-to-one correspondence
a g g c c a a t a c g g c t c a
a g g c - - c a a t - - a - - c g g c - - t
c a
s(x,x)1, s(x, y)0
The length of lcs the alignment
score
The LCS problem is a special case of the
alignment problem.
19
Levenshtein distance

In computer science and information theory,
the Levenshtein distance between two strings is
defined to be
the minimum number of edit operations needed to
transform one
string to the other. It is also called the edit
distance.
kitten ? sitten ? sittin ? sitting

Finding the edit distance is equivalent
to aligning sequences with match scores 0
and mismatch and indel score -1
k i t t e n - s i t t i n g
20
Part IV Key Issues of Alignment

1. Algorithmic aspect
Given two or more sequences and a scoring
scheme, how to find the optimal alignment that
has the maximum score?
2. Scoring function
Which score schemes are good at ranking
alignments?
3. Evaluation of output alignments
Are the output alignments statistically
significant?

21
The number of possible alignments

Theorem There are
possible alignments for two m-character
sequences.
Proof. An alignment of two m-character sequences
has at least m columns and at most 2m columns.
So we only need to prove the following fact.

22
Fact There are
possible alignments with k columns for two
m-character sequences S and T.
A
C
A
A
G
C
C
Sequence S appears in the first row in
ways. After the configuration of the first row
is fixed, T appears in the second row in
ways.
23
Alignment GraphCompact representation of all
alignments

The length of a sequence is the number of
characters contained in it.

For example, sequence TATAGC has length 6.
Let S1 and S2 be two sequence of length m and n,
respectively. The alignment graph A(S1, S2) of S1
and S2 are defined as follows
24
There is one-to-one correspondence from the
alignment s between S and T to the paths from
left-top vertex to the right-bottom vertex.
-- Vertices are lattice points (i, j), 0 i
m, 0 j n. In total, (m1)(n1) vertices.
-- There is an arc from (i, j) to (i, j) if
and only if 0 i i 1 and 0 j
j 1. -- Three types of edges
horizontal edges vertical edges
diagonal edges
25
(0, 0)
(1, 0)
(1, 1)
(2, 2)
(3, 3)
(4, 4)
(5, 6)
(4, 5)
(6, 6)
(7,7)
S2
In the alignment graph, each edge
corresponds uniquely to a possible column in an
alignment. -- diagonal edges for
match/mismatches -- horizontal and vertical
edges for indels.
S1
26
A G
T T
A -
C C
- G
T T
G G
G -
- C
S2
S1
27
Part IV Alignment Algorithms

Pairwise Alignment Problem
Input Two sequence S and T and a score
scheme
Question Find an alignment of S and T that has
the
largest alignment score, called optimal
alignment.
Local Alignment Problem
Input Two sequence S and T and a score
scheme
Question Find the best alignment of segments in
S and T respectively.

28
Part IV.1 Global Sequence Alignment
G T A C
G
Each column of an alignment corresponds to an
edge of the corresponding path.
A
T
A
Assign the column score to an edge as
weight. Find an optimal alignment is to find a
path with the largest weight.
C
A
29

Consider two sequences
Xx1x2xj, Yy1y2yk
Write their t-char prefixes as
X1,, tx1x2xt Y1, , ty1y2yt
Let S(j, k) be the score of optimally aligning
X1, , j and Y1,, k.
Consider one of the best alignments of X1,,j
and Y1,,k.

Case 1 The alignment corresponds to a path
whose last edge is diagonal.
xj
yk
A best alignment of X1,, j-1 and Y1, , k-1
S(j, k) S(j-1, k-1) s(xj, yk)
30

S(j, k) optimal score of aligning X1,, j
and Y1,,k.

Case 2 Correspond to a path whose last edge is
vertical
S(j, k) S(j-1, k) s(xj, -)
An optimal alignment of X1,, j-1 and Y1, , k
Case 3 Correspond to a path whose last edge is
horizontal
S(j, k) S(j, k-1) s(-, yk)
An optimal alignment of X1,, j and Y1, , k-1
31

S(j, k) the score of optimally aligning x1, ,
j
and y1, , k.

If we know the optimal alignments for X1,..,
j-1 and Y1, ..,k, X1, .., j and Y1, ,
k-1, X1, , j-1 and Y1, , k-1, we can
find an optimal alignments for X1,, j and Y1,
, k.
S(0, k) the score of optimally aligning an
empty X to Y s(-, y1) s(-, y2)
s(- y3) s(-, yk)
S(j, 0) the score of optimally aligning X to
the empty Y s(x1 , -) s(x2, -)
s(x3, -) s(xj , -)
32
Tabular Computation of Pairwise Alignment
To align an j-character sequence X and a
k-character sequence Y, we form an (j1) by
(k1) matrix F in which S(j, k) is the score
of optimally aligning X1,..,j and Y1,,k
and computed by the formula one by one from left
to right, top to bottom.
A G G C G T
The score of the best align. of AGT and AG
A
G
The score of the best align. of X and Y
T
G
33
Alignment Example
G T A C
G
S(0, 0) S(0, 1) S(0, 2) S(0, 3)
S(0, 4) S(0, 5)
0
A
1
S(1, 0) S(1, 1) S(1, 2) S(1, 3)
S(1, 4) S(1, 5)
T
S(2, 0) S(2, 1) S(2, 2) S(2, 3)
S(2, 4) S(2, 5)
2
A
3
S(3, 0) S(3, 1) S(3, 2) S(3, 3)
S(3, 4) S(3, 5)
C
4
S(4, 0) S(4, 1) S(4, 2) S(4, 3)
S(4, 4) S(4, 5)
A
S(5, 0) S(5, 1) S(5, 2) S(5, 3)
S(5, 4) S(5, 5)
5
34
Alignment Example with Scoring Scheme Match 8,
mismatch -5, indel column -3
G T A C
G
0 -3 -6
-9 -12 -15
0
A
1
-3 -5 -8
2 -1
-4
T
-6 -8 3
2
0
6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -8 5
16 14
5
35
An optimal alignment can be found using
information on how S(j, k) are calculated.
A G G C G T
A
G
T
G
36

a ? b indicates that b can be calculated from
a. Any non-decreasing path from the first cell
to the last cell gives an optimal alignment.
G T A C
G
0
0 -3 -6
-9 -12 -15
A
1
-3 -5 -8 2
-1
-4
T
-6 -8 3
2
0
-6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
37

Backtracking from the right-most cell at the
bottom to the the left-most cell on the top
obtain the optimal path. It is not unique unlike
this example in general.
G T A C
G
0
0 -3 -6
-9 -12 -15
A
1
-3 -5 -8 2
-1
-4
T
-6 -8 3
2
0
-6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
38

An example where there are more than one optimal
alignment
G T A C
G
0
0 -3 -6
-9 -12 -15
A
-3 -5 -8 2
-1
A G
T T
A A
C --
A --
T
-6 -8 3
2
0
A
3
-9 -11 0 11
8 5
C
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
39
Summary

To align a j-char sequence and a k-char
sequence, we use a (j1)x(k1) table in which
each entry is an alignment score. We compute
the entry at (j, k) from three entries at
(j-1, k),
(j, k-1) , (j-1, k) in 5 operations using a
recurrence formula. Hence, the alignment score
can
be computed in 5(j1)(k1) operations.

To output an optimal alignment, we
trackback the score computation from the
the last cell to the first cell. For this
purpose,
we need to know which of entries
(j-1, k), (j, k-1) and (j-1,k-1) gives the
entry at (j, k) using three extra pointers.
Overall, this also takes about 3(j1)(k1)
operations.

Theorem Aligning two sequences of length j
and k respectively takes quadratic time O(jk).
Remark Our algorithm uses O(jk) cells. But it
is possible to use only O(jk) cells
41
Dynamic Programming Technique

Solve an instance of a problem by taking
advantage of the computed solutions to
the small parts of the instance.

The term was first used by Bellman in 1940s
Richard Ernest Bellman (19201984)
42
Part IV.2 Locally Aligning Sequences
Given two sequences S and T, find best
alignment of a segments of X and a segment of Y.
43
A G G C A A
G A
A G T G
G T A
44
Locally Aligning Sequences
Input Two sequences of length m and n, and a
scoring scheme s( , ).
Step 1 F(0, k)0 F(j, 0)0 and for jgt0
and kgt0, compute F(j,k) by
Step 2 Identify the largest entry F(m,n) in
the table and backtrack from the cell (m,
n) to a cell (m, n) such that F(m, n)0
to obtain an optimal local alignment.
45
local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T

CTTAACA
46
Example of a local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
CTTA
A
C
C
47
Part IV.3 Multiple Sequence Alignment

Generalization of dynamic programming method
Given three sequences S1x1,..n, S21,..,n,
S31,..,n,
let F(i, j, k) denote the score of optimally
aligning the i-character prefix of S1, the
j-character prefix of S2 and the k-character
prefix of S3. Then, any alignment of these
prefixes has 7 possibilities in the last column

xi xj xk
-- -- --
xi -- --
-- xj --
-- -- xk
xi xj --
xi -- xk
-- xj xk
48
Basis values F(0, 0, k) ?s(-, -, zt) ,
F(i, 0, 0) ?s( xt , -, -) ,
F(0, j,0) ?s( -, yt , -) .

(i, j, k)
49

One popular scoring approach is the
Sum-of-Pairs score (SP-score), which is
defined as follows
where s(,) is given by a scoring scheme as in
pairwise alignment.

50
Aligning k Sequences

In general, for aligning k m-char sequences, we
need a k-dimensional array of (m1)k cells. At
the last column of an alignment, there are 2k -1
possibilities (1 is subtracted because all gaps
are not allowed).
Hence, we must
fill out all (m1)k cells
for each cell, consider all 2k -1 possibilities.
for each possibility, we compute the SP score
using O(k2) operations.

51
Efficiency of the DP method

For k m-character sequences, the computation
involves a matrix with (m1)k entries each
entry is computed by 2k arithmetic operations.
Hence, the method takes about O(k22k(m1)k)
operations.
When mgt100, kgt20 or mgt300 and k5, aligning by
this is impossible even with Carrillio-Lipman
bound.

52
Complexity of MSA

Theorem MSA is an NP-hard problem.
This means that MSA is likely not solved
essentially faster than guessing and checking
its
solutions.
It also seems true that MSA takes exponential
memory space.

53
Application one Genome Sequencing

Genome sequencing is to determine the exact order
of the millions of chemical building blocks (A,
T, C, and G) that make up the DNA of an
organism.

The genome of a complex organism like human is
huge. However, the existing technology can only
be used for fairly short (say, 500-700bp) stands.
Therefore, a feasible approach to genome
sequencing is to cut the entire genome at a
number of points randomly to generate a large
number of small segments, which are sequenced,
called READs.
54
Computer programs use overlapping information on
different READs to assemble them into a
contiguous sequence. Multiple overlapping READS
are obtained for the target genome by performing
several rounds of fragmentation.
Target XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XX !st round NNNNNAGCATGCTGCANNNNNNNNNNNNNNNNN
NN NNNNNNNNNNNNNNNNGTCATGCTTAGGC
TANNNNN 2nd round NNNNNAGCATGNNNNNNNNNNNNNNNNNN
NNNNNNN NNNNNNNNNNNCTGCAGTCATGCT
TAGGCANNNNNNN Assembly
AGCATGCTGCAGTCATGCTTAGGCA
55
The above sequencing process is called Shotgun
sequencing. Applying the shotgun sequencing to
human genome, we will get about 50 million short
segments of about 600 base pairs long. To have a
highly accurate reconstruction of human genome,
the fragmentation-and-sequencing process needs to
be performed in 7 to 10 rounds to get enough
overlapping information.
Target DNA
1st round
2nd round
3rd round
4th round

Write a Comment

User Comments (0)