Title: Alignment Algorithms
1Alignment Algorithms
Instructor Yao-Ting Huang
Bioinformatics Laboratory, Department of Computer
Science Information Engineering, National Chung
Cheng University.
2Review of Big O Notation
- Definition of Big-O
- A function f(n) is O(g(n)) if there exist
positive constants c and n0 such that f(n) ? c ?
g(n) for all n ? n0. - e.g., f(n) 100n2 30n 2000 is O(n2)
3Big O Notation
A function f(n) is O(g(n)) if there exist
positive constants c and n0 such that f(n) ? c ?
g(n) for all n ? n0
4How functions grow?
function
n
(Assume one million operations per second.)
For large data sets, algorithms with a complexity
greater than O(n log n) are often impractical!
5Practical Complexity
6orzs sequence evolution
- orz (kid)
- OTZ (adult)
- Orz (big head)
- Crz (motorcycle driver)
- or2 (bottom up)
- oO (back high)
- STO (the other way around)
- the origin?
- their evolutionary relationships?
Chaos illustration
7orzs sequence evolution
orz
Orz
Crz
or2
OTZ
oO
STO
8Homology
- How to measure the homology (similarity) of two
sequences/genes/species?
AGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTA
AGAGCTATGCCATTTATATAAAATTTAAAGGCGCGGGGGGTTAAGAGGCG
CGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAAGC
TATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAA
AATTTAAAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTT
AAAGCGTAAGAGCTATGCCATTTATTTATATAAAATTTAAAGATACAATC
TAAAGTTAAAGCGTAAGAGCTATGCAGGCGCGGGGGGTTAAGAGCTATGC
CATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTT
AAAGAGGCGCGGGGGGT
AAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGGGCGCGGGGGG
TTAAGAGCTATGCCATTAGGCGCGGGGGGTTAAGAGCTATGCCATTTATA
TAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAATTTAT
ATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGAAA
GCGTAAGAGCTATGCCATTTATATAAAATTTAAAGGGCGCGGGGGGTTAA
GAGCTATGCCATTTATATAAGGGGGTTAAGAGCTATGCCATTTTAAGAGC
TATGCCATTTATATAAAATTTAAAGCGTAAGAAGCTATGCCATTTATATA
AAATTTAAGAGCTATGCCATTTATATAAAATTTAAAGAGGCGCGGGGGGT
TAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATT
TATATAAAATTTAAGCTATGCAGGCGCGGGGAGCTGGGTTTATATAAAAT
TTA
9Genetic Variants
- Genetic variants decrease homology of sequences.
- Point mutations, insertions/deletions, and
inversions. - It is difficult to measure the homology with the
existence of these genetic variants.
AAGAAGCTATGCC - G - G -
AAGAGACTATGCCGG
AA - - AGCTATGCCAGTGA
AAAGCTATGCCAGTGA
Alignment
10Genetic Variants
Variations observed in a population
Mutations over time
Disease Mutation
Common Ancestor
time
present
11Milestones of Alignment Algorithms
- 1970 Needleman-Wunsch algorithm
- 1981 Smith-Waterman algorithm developed
- 1985 FASTP/FASTN fast sequence similarity
searching - 1990 BLAST fast sequence similarity searching
- 2001 Mummer
12The Impact of Alignment Algorithms
- BLAST the most cited paper so far.
- Biologists may not know anything of algorithm but
they probably know BLAST.
13Dynamic Programming
- Dynamic programming is a classical method for
solving sequential decision problems with a
compositional cost structure. - Alignment algorithms are originally derived from
dynamic programming.
14Two key ingredients
- Two key ingredients for an optimization problem
to be suitable for a dynamic-programming solution
1. optimal substructures
2. overlapping subproblems
Subproblems are dependent. (otherwise, a
divide-and-conquer approach is the choice.)
Each substructure is optimal. (Principle of
optimality)
15Three Basic Components
- A dynamic-programming algorithm has three basic
components - The recurrence relation (for defining the value
of an optimal solution) - The tabular computation (for computing the value
of an optimal solution) - The traceback (for delivering an optimal
solution).
16Fibonacci Numbers
- Fibonacci numbers F(n), for n 0, 1, 2, , are
- 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,
233, - Sunflowers have spirals in pair of adjacent
Fibonacci numbers for two directions and (e.g.,
34 and 55) Wikipedia
17Recursion
- Formal definition 0, 1, 1, 2, 3, 5, 8, 13, 21,
34,
int fib (int n) if(n 0 n 1) return
1 else return ( fib (n-1) fib (n-2) )
18Recursion
int fib (int n) if(n 0 n 1) return
1 else return ( fib (n-1) fib (n-2) )
1 20
fib(100)
2 21
4 22
8 23
100
Exponential to n
19Tabular Computation
- The tabular computation can avoid recompuation.
20Locating Conserved Genes
- Genes conserved in two different species usually
have the same orientation. - e.g., Mummer (http//mummer.sourceforge.net/).
21Locating Conserved Genes
- Genes conserved in two different species usually
have the same orientation. - This problem is often formulated to find the
longest increasing subsequence (LIS).
1
2
3
4
5
6
7
8
1
5
6
2
3
4
7
8
LIS 1, 2, 3, 4, 7, 8
22Longest Increasing Subsequence (LIS)
- The longest increasing subsequence is to find a
longest increasing subsequence of a given
sequence of distinct integers a1a2an .
- e.g. 9 2 5 3 7 11 8 10 13 6
- 3 7
- 7 10 13
- 7 11
- 3 5 11 13
are increasing subsequences.
We want to find a longest one.
are not increasing subsequences.
23A naive approach for LIS
- Let Li be the length of a longest increasing
subsequence ending at position i.
Li 1 max j 0..i-1Lj aj lt ai(use a
dummy a0 minimum, and L00)
9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 ?
24A naive approach for LIS
- Li 1 max j 0..i-1 Lj aj lt ai
9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 4 5
6 3
The maximum length
The subsequence 2, 3, 7, 8, 10, 13 is a longest
increasing subsequence. This method runs in O(n2)
time.
25An O(n log n) method for LIS
- Define BestEndk as the smallest number of an
increasing subsequence of length k.
9 2 5 3 7 11 8 10 13 6
9
2
2
2
BestEnd1
5
3
BestEnd2
26An O(n log n) method for LIS
- Define BestEndk as the smallest number of an
increasing subsequence of length k.
9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
BestEnd2
7
7
7
BestEnd3
11
8
BestEnd4
27An O(n log n) method for LIS
- Define BestEndk as the smallest number of an
increasing subsequence of length k.
9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
BestEnd3
11
8
8
8
BestEnd4
10
10
BestEnd5
13
BestEnd6
28An O(n log n) method for LIS
- Define BestEndk to be the smallest number of an
increasing subsequence of length k.
9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
6
BestEnd3
11
8
8
8
8
BestEnd4
10
10
10
BestEnd5
For each position, we perform a binary search to
update BestEnd.
13
13
BestEnd6
29Binary search
- Given an ordered sequence x1x2 ... xn, where
x1ltx2lt ... ltxn, and a number y, a binary search
finds the largest xi such that xilt y in O(log n)
time.
n/2
...
n/4
n
30Binary search
- How many steps would a binary search reduce the
problem size to 1?n n/2 n/4 n/8 n/16
... 1
How many steps? O(log n) steps.
31An O(n log n) method for LIS
- Define BestEndk to be the smallest number of an
increasing subsequence of length k.
9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
6
BestEnd3
11
8
8
8
8
BestEnd4
For each position, we perform a binary search to
update BestEnd. Therefore, the running time is
O(n log n).
10
10
10
BestEnd5
13
13
BestEnd6
32LCS
- The longest common subsequence problem is to find
a maximum-length common subsequence between two
sequences. - Sequence 1 president
- Sequence 2 providence
- Its LCS is priden.
president providence
33LCS
- Another example
- Sequence 1 algorithm
- Sequence 2 alignment
- One of its LCS is algm.
a l g o r i t h m a l i g n m e n t
34How to compute LCS?
- Let Aa1a2am and Bb1b2bn .
- len(i, j) the length of an LCS between
a1a2ai and b1b2bj - With proper initializations, len(i, j)can be
computed as follows.
35(No Transcript)
36(No Transcript)
37(No Transcript)
38(No Transcript)
39Pairwise Alignment
- Sequence A CTTAACT
- Sequence B CGGATCAT
- An alignment of A and B
C---TTAACTCGGATCA--T
Sequence A
Sequence B
40Pairwise Alignment
- Sequence A CTTAACT
- Sequence B CGGATCAT
- An alignment of A and B
Mismatch
Match
C---TTAACTCGGATCA--T
Deletion gap
Insertion gap
41Dot Matrix
- Sequence ACTTAACT
- Sequence BCGGATCAT
C G G A T C A T
CTTAACT
42Alignment Graph
- Sequence A CTTAACT Sequence B CGGATCAT
C G G A T C A T
CTTAACT
C---TTAACTCGGATCA--T
43A simple scoring scheme
- Match 8 (w(x, y) 8, if x y)
- Mismatch -5 (w(x, y) -5, if x ? y)
- Each gap symbol -3 (w(-,x)w(x,-)-3)
C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score
44An optimal alignment-- the alignment of maximum
score
- Let Aa1a2am and Bb1b2bn .
- Si,j the score of an optimal alignment between
a1a2ai and b1b2bj - With proper initializations, Si,j can be computed
by dynamic programming.
45Computing Si,j
j
w(ai,bj)
w(ai,-)
i
w(-,bj)
Sm,n
46Initializations
C G G A T C A T
CTTAACT
47S3,5 ?
- Match 8 (w(x, y) 8, if x y)
- Mismatch -5 (w(x, y) -5, if x ? y)
- Each gap symbol -3 (w(-,x)w(x,-)-3)
C G G A T C A T
CTTAACT
48S3,5 5
- Match 8 (w(x, y) 8, if x y)
- Mismatch -5 (w(x, y) -5, if x ? y)
- Each gap symbol -3 (w(-,x)w(x,-)-3)
C G G A T C A T
CTTAACT
optimal score
49C T T A A C TC G G A T C A T
8 5 5 8 -5 8 -3 8 14
C G G A T C A T
CTTAACT
50Now try this example in class
- Sequence A CAATTGA
- Sequence B GAATCTGC
- Their optimal alignment?
- Match 8 (w(x, y) 8, if x y)
- Mismatch -5 (w(x, y) -5, if x ? y)
- Each gap symbol -3 (w(-,x)w(x,-)-3)
51Initializations
G A A T C T G C
CAATTGA
52S4,2 ?
G A A T C T G C
CAATTGA
53S5,5 ?
G A A T C T G C
CAATTGA
54S5,5 14
G A A T C T G C
CAATTGA
optimal score
55C A A T - T G AG A A T C T G C
-5 8 8 8 -3 8 8 -5 27
G A A T C T G C
CAATTGA
56Expressed Sequence Tags (ESTs)
- Expressed sequence tags (ESTs) are a set of
sequenced cDNAs reverse-transcribed from mRNA. - ESTs are usually short (400bp) and only
represent parts of a gene.
57Genome Annotation
- The huge amount of ESTs is often used to identify
genes in the genome. - e.g., 8,134,045 ESTs in human.
- http//www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.ht
ml
ESTs
58Global Alignment vs. Local Alignment
- Global alignment
- Local alignment
59An optimal local alignment
- Mutations may have added too much noise in the
sequences of interest. - Local alignment avoids these regions altogether
and focuses on those with a positive score. - Si,j the score of an optimal local alignment
ending at ai and bj
60local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T
CTTAACT
61local alignment
Match 8 Mismatch -5 Gap symbol -3
C G G A T C A T
CTTAACT
The best score
62A C - TA T C A T 8-38-38 18
C G G A T C A T
CTTAACT
The best score