Title: Sequence Comparison
1Sequence Comparison
MA5232 Lecture Notes
- LX Zhang
- Department of Mathematics
- National University of Singapore
- matzlx_at_nus.edu.sg
-
2See what everyone else has seen, but think them
mathematically.
- Reference Books
- K.-M. Chao and L.X. Zhang.
- Sequence Comparison Theory and Methods,
Springer - 2. R.C. Deonier, S. Tavare, and M.S. Waterman,
- Computational Genome Analysis, Springer.
- 3. J. Felsenstein, Inferring Phylogenies,
Springer.
3Motivation Sequence comparison problems arise
everywhere
- Web search tools are designed based on sequence
comparison solutions. - Text processing tools are designed based on
efficient solutions to text storage and word
pattern matching. - Finally, bio-molecular sequence comparison is
extremely important in modern biological and
medical sciences.
4Motivation
5Motivation
- All genomes are evolved from a common ancestral
genome. - Less important regions of the genome mutate
or change with time more frequently than the
functional important parts. -
- Hence, sequence conservation implies function.
In other words, sequence conservation serves as a
signal of the functionally important regions in a
genome.
6An Example of Sequence Function Analysis
- To determine whether the SARS is caused by a new
virus or not, one may sequence - the SARS virus and search against a databases
of known bacterial and viral sequences.
Sequence database
Output
List of similar matches
7Part I Alignment Model for Bio-Seq
Comparison
- An alignment among k sequences is a k-row matrix
- such that
- The row j contains the j-th sequence and
between - two consecutive residues there may be one or
more -s - Each column contains at least one residue.
8More examples
a g a a - t g c a a c a a g t g - a
a g a a t g - c a a c a a - g t g a
Each alignment is an evolutionary hypothesis,
which accounts for three types of mutations
-- Substitution ( a g c t ? a g t t ) --
Insertion ( a g c t ? a g a c t ) --
Deletion ( a g c t ? a t )
9- Because of point mutation (sub, indel) and
- large scale mutation (reversal, fission, fusion,
etc), - a gene has different sequences in different
species. - different genomes have different size and
number of chromosomes. - Indels may not affect the structure of a protein.
10Alignment of globin protein sequences
11Part II Scoring an Alignment
a g t c t c c a t c c c
a g t c t c c a t c c c
The quality of an alignment is measured by the
sum of the scores of pairs of aligned residues
plus the scores for gaps.
The left alignment is scored by s(a, a)
3s(c, c) s(t, t) s(g, -) s(t, -)
12- Scores for aligned residues and gaps form a basic
scoring scheme -
A G C T - A 4
-2 -1 -2 -1 G 3
-2 -1 -1 C 4
-2 -1 T 3
-1
a g t c t c c a t c c c
has score 4 (-1) 3 4 (-1) 4 4 18
13- Under a meaningful scoring scheme, the score of
an alignment is essentially - the logarithm of likelihood ratio of the
alignment between two random sequences. - Matches score a positive value
- Mismatches and indels are penalized
- by scoring a negative value.
Indels and Mismatch are introduced to bring up
matches that appear later.
14 A sequence is a word or string on alphabet
?. For DNA sequences, ? a, g, c, t For
protein sequences, ? has 20 letters For
English words, ? a, b, , y, z.
- (Global) Pairwise Sequence Alignment
- Instance Two sequences S1 and S2, and
- a scoring scheme
- Question Find an alignment of S1 and S2 that
- has the highest score.
Such an alignment that has the highest alignment
score is called an optimal alignment.
15 Part III Generality of Alignment as a Model
- Many different problems such as
- Maximum subsequence problem
- Minimum supersequence problem
- Best fitting problem
- Finding Levenshtein distance
- in sequence comparison can be solved as a
special case of alignment by using a specific
scoring scheme.
16Maximum Subsequence Problem
- A sequence is a word or string on alphabet ?.
- For DNA sequences, ? a, g, c, t
- For protein sequences, ? has 20 letters
- For English words, ? a, b, , y, z.
- Let S and T be two sequences on ?.
- S is a subsequence of T
- or T is a supersequence of S if
- all the characters in S appear in T in the same
- order.
17-
- Maximum Subsequence Problem
- Instance Two sequences S1 and S2
- Question Find the longest common
- subsequence of S1 and S2.
- a g g c c a a t a g g c c a a t
- a c g g c t c a a c g g c t c a
18One-to-one correspondence
a g g c c a a t a c g g c t c a
a g g c - - c a a t - - a - - c g g c - - t
c a
s(x,x)1, s(x, y)0
The length of lcs the alignment
score
The LCS problem is a special case of the
alignment problem.
19Levenshtein distance
- In computer science and information theory,
- the Levenshtein distance between two strings is
defined to be - the minimum number of edit operations needed to
transform one - string to the other. It is also called the edit
distance. - kitten ? sitten ? sittin ? sitting
Finding the edit distance is equivalent
to aligning sequences with match scores 0
and mismatch and indel score -1
k i t t e n - s i t t i n g
20Part IV Key Issues of Alignment
- 1. Algorithmic aspect
- Given two or more sequences and a scoring
scheme, how to find the optimal alignment that
has the maximum score? - 2. Scoring function
- Which score schemes are good at ranking
alignments? - 3. Evaluation of output alignments
- Are the output alignments statistically
significant?
21The number of possible alignments
- Theorem There are
-
- possible alignments for two m-character
sequences. - Proof. An alignment of two m-character sequences
has at least m columns and at most 2m columns.
So we only need to prove the following fact.
22Fact There are
possible alignments with k columns for two
m-character sequences S and T.
A
C
A
A
G
C
C
Sequence S appears in the first row in
ways. After the configuration of the first row
is fixed, T appears in the second row in
ways.
23 Alignment GraphCompact representation of all
alignments
- The length of a sequence is the number of
- characters contained in it.
For example, sequence TATAGC has length 6.
Let S1 and S2 be two sequence of length m and n,
respectively. The alignment graph A(S1, S2) of S1
and S2 are defined as follows
24There is one-to-one correspondence from the
alignment s between S and T to the paths from
left-top vertex to the right-bottom vertex.
-- Vertices are lattice points (i, j), 0 i
m, 0 j n. In total, (m1)(n1) vertices.
-- There is an arc from (i, j) to (i, j) if
and only if 0 i i 1 and 0 j
j 1. -- Three types of edges
horizontal edges vertical edges
diagonal edges
25(0, 0)
(1, 0)
(1, 1)
(2, 2)
(3, 3)
(4, 4)
(5, 6)
(4, 5)
(6, 6)
(7,7)
S2
In the alignment graph, each edge
corresponds uniquely to a possible column in an
alignment. -- diagonal edges for
match/mismatches -- horizontal and vertical
edges for indels.
S1
26A G
T T
A -
C C
- G
T T
G G
G -
- C
S2
S1
27Part IV Alignment Algorithms
- Pairwise Alignment Problem
- Input Two sequence S and T and a score
scheme - Question Find an alignment of S and T that has
the - largest alignment score, called optimal
alignment. - Local Alignment Problem
- Input Two sequence S and T and a score
scheme - Question Find the best alignment of segments in
S and T respectively.
28Part IV.1 Global Sequence Alignment
G T A C
G
Each column of an alignment corresponds to an
edge of the corresponding path.
A
T
A
Assign the column score to an edge as
weight. Find an optimal alignment is to find a
path with the largest weight.
C
A
29- Consider two sequences
- Xx1x2xj, Yy1y2yk
- Write their t-char prefixes as
- X1,, tx1x2xt Y1, , ty1y2yt
- Let S(j, k) be the score of optimally aligning
X1, , j and Y1,, k. - Consider one of the best alignments of X1,,j
and Y1,,k.
Case 1 The alignment corresponds to a path
whose last edge is diagonal.
xj
yk
A best alignment of X1,, j-1 and Y1, , k-1
S(j, k) S(j-1, k-1) s(xj, yk)
30- S(j, k) optimal score of aligning X1,, j
and Y1,,k.
Case 2 Correspond to a path whose last edge is
vertical
S(j, k) S(j-1, k) s(xj, -)
An optimal alignment of X1,, j-1 and Y1, , k
Case 3 Correspond to a path whose last edge is
horizontal
S(j, k) S(j, k-1) s(-, yk)
An optimal alignment of X1,, j and Y1, , k-1
31- S(j, k) the score of optimally aligning x1, ,
j - and y1, , k.
If we know the optimal alignments for X1,..,
j-1 and Y1, ..,k, X1, .., j and Y1, ,
k-1, X1, , j-1 and Y1, , k-1, we can
find an optimal alignments for X1,, j and Y1,
, k.
S(0, k) the score of optimally aligning an
empty X to Y s(-, y1) s(-, y2)
s(- y3) s(-, yk)
S(j, 0) the score of optimally aligning X to
the empty Y s(x1 , -) s(x2, -)
s(x3, -) s(xj , -)
32Tabular Computation of Pairwise Alignment
To align an j-character sequence X and a
k-character sequence Y, we form an (j1) by
(k1) matrix F in which S(j, k) is the score
of optimally aligning X1,..,j and Y1,,k
and computed by the formula one by one from left
to right, top to bottom.
A G G C G T
The score of the best align. of AGT and AG
A
G
The score of the best align. of X and Y
T
G
33Alignment Example
G T A C
G
S(0, 0) S(0, 1) S(0, 2) S(0, 3)
S(0, 4) S(0, 5)
0
A
1
S(1, 0) S(1, 1) S(1, 2) S(1, 3)
S(1, 4) S(1, 5)
T
S(2, 0) S(2, 1) S(2, 2) S(2, 3)
S(2, 4) S(2, 5)
2
A
3
S(3, 0) S(3, 1) S(3, 2) S(3, 3)
S(3, 4) S(3, 5)
C
4
S(4, 0) S(4, 1) S(4, 2) S(4, 3)
S(4, 4) S(4, 5)
A
S(5, 0) S(5, 1) S(5, 2) S(5, 3)
S(5, 4) S(5, 5)
5
34Alignment Example with Scoring Scheme Match 8,
mismatch -5, indel column -3
G T A C
G
0 -3 -6
-9 -12 -15
0
A
1
-3 -5 -8
2 -1
-4
T
-6 -8 3
2
0
6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -8 5
16 14
5
35An optimal alignment can be found using
information on how S(j, k) are calculated.
A G G C G T
A
G
T
G
36 a ? b indicates that b can be calculated from
a. Any non-decreasing path from the first cell
to the last cell gives an optimal alignment.
G T A C
G
0
0 -3 -6
-9 -12 -15
A
1
-3 -5 -8 2
-1
-4
T
-6 -8 3
2
0
-6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
37 Backtracking from the right-most cell at the
bottom to the the left-most cell on the top
obtain the optimal path. It is not unique unlike
this example in general.
G T A C
G
0
0 -3 -6
-9 -12 -15
A
1
-3 -5 -8 2
-1
-4
T
-6 -8 3
2
0
-6
-3
A
3
-9 -11 0 11
8 5
C
4
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
38 An example where there are more than one optimal
alignment
G T A C
G
0
0 -3 -6
-9 -12 -15
A
-3 -5 -8 2
-1
A G
T T
A A
C --
A --
T
-6 -8 3
2
0
A
3
-9 -11 0 11
8 5
C
-12 -14 -3 8
19 16
A
-15 -17 -6 5
16 14
5
39Summary
- To align a j-char sequence and a k-char
- sequence, we use a (j1)x(k1) table in which
- each entry is an alignment score. We compute
- the entry at (j, k) from three entries at
(j-1, k), - (j, k-1) , (j-1, k) in 5 operations using a
- recurrence formula. Hence, the alignment score
can - be computed in 5(j1)(k1) operations.
-
40- To output an optimal alignment, we
- trackback the score computation from the
- the last cell to the first cell. For this
purpose, - we need to know which of entries
- (j-1, k), (j, k-1) and (j-1,k-1) gives the
entry at (j, k) using three extra pointers.
Overall, this also takes about 3(j1)(k1)
operations.
Theorem Aligning two sequences of length j
and k respectively takes quadratic time O(jk).
Remark Our algorithm uses O(jk) cells. But it
is possible to use only O(jk) cells
41Dynamic Programming Technique
- Solve an instance of a problem by taking
- advantage of the computed solutions to
- the small parts of the instance.
The term was first used by Bellman in 1940s
Richard Ernest Bellman (19201984)
42Part IV.2 Locally Aligning Sequences
Given two sequences S and T, find best
alignment of a segments of X and a segment of Y.
43A G G C A A
G A
A G T G
G T A
44 Locally Aligning Sequences
Input Two sequences of length m and n, and a
scoring scheme s( , ).
Step 1 F(0, k)0 F(j, 0)0 and for jgt0
and kgt0, compute F(j,k) by
Step 2 Identify the largest entry F(m,n) in
the table and backtrack from the cell (m,
n) to a cell (m, n) such that F(m, n)0
to obtain an optimal local alignment.
45local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T
CTTAACA
46 Example of a local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T
CTTA
A
C
C
47Part IV.3 Multiple Sequence Alignment
- Generalization of dynamic programming method
- Given three sequences S1x1,..n, S21,..,n,
S31,..,n, - let F(i, j, k) denote the score of optimally
aligning the i-character prefix of S1, the
j-character prefix of S2 and the k-character
prefix of S3. Then, any alignment of these
prefixes has 7 possibilities in the last column -
xi xj xk
-- -- --
xi -- --
-- xj --
-- -- xk
xi xj --
xi -- xk
-- xj xk
48Basis values F(0, 0, k) ?s(-, -, zt) ,
F(i, 0, 0) ?s( xt , -, -) ,
F(0, j,0) ?s( -, yt , -) .
(i, j, k)
49- One popular scoring approach is the
- Sum-of-Pairs score (SP-score), which is
- defined as follows
- where s(,) is given by a scoring scheme as in
pairwise alignment.
50Aligning k Sequences
- In general, for aligning k m-char sequences, we
- need a k-dimensional array of (m1)k cells. At
the last column of an alignment, there are 2k -1
possibilities (1 is subtracted because all gaps
are not allowed). - Hence, we must
- fill out all (m1)k cells
- for each cell, consider all 2k -1 possibilities.
- for each possibility, we compute the SP score
using O(k2) operations.
51Efficiency of the DP method
- For k m-character sequences, the computation
involves a matrix with (m1)k entries each
entry is computed by 2k arithmetic operations.
Hence, the method takes about O(k22k(m1)k)
operations. - When mgt100, kgt20 or mgt300 and k5, aligning by
this is impossible even with Carrillio-Lipman
bound.
52Complexity of MSA
- Theorem MSA is an NP-hard problem.
- This means that MSA is likely not solved
- essentially faster than guessing and checking
its - solutions.
- It also seems true that MSA takes exponential
memory space. -
53Application one Genome Sequencing
- Genome sequencing is to determine the exact order
of the millions of chemical building blocks (A,
T, C, and G) that make up the DNA of an
organism.
The genome of a complex organism like human is
huge. However, the existing technology can only
be used for fairly short (say, 500-700bp) stands.
Therefore, a feasible approach to genome
sequencing is to cut the entire genome at a
number of points randomly to generate a large
number of small segments, which are sequenced,
called READs.
54Computer programs use overlapping information on
different READs to assemble them into a
contiguous sequence. Multiple overlapping READS
are obtained for the target genome by performing
several rounds of fragmentation.
Target XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XX !st round NNNNNAGCATGCTGCANNNNNNNNNNNNNNNNN
NN NNNNNNNNNNNNNNNNGTCATGCTTAGGC
TANNNNN 2nd round NNNNNAGCATGNNNNNNNNNNNNNNNNNN
NNNNNNN NNNNNNNNNNNCTGCAGTCATGCT
TAGGCANNNNNNN Assembly
AGCATGCTGCAGTCATGCTTAGGCA
55The above sequencing process is called Shotgun
sequencing. Applying the shotgun sequencing to
human genome, we will get about 50 million short
segments of about 600 base pairs long. To have a
highly accurate reconstruction of human genome,
the fragmentation-and-sequencing process needs to
be performed in 7 to 10 rounds to get enough
overlapping information.
Target DNA
1st round
2nd round
3rd round
4th round