Alignment Algorithms - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Alignment Algorithms

Description:

A function f(n) is O(g(n)) if there exist positive constants c ... 1981 Smith-Waterman algorithm developed. 1985 FASTP/FASTN: fast sequence similarity searching ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 61
Provided by: yaotin
Category:

less

Transcript and Presenter's Notes

Title: Alignment Algorithms


1
Alignment Algorithms
Instructor Yao-Ting Huang
Bioinformatics Laboratory, Department of Computer
Science Information Engineering, National Chung
Cheng University.
2
Review of Big O Notation
  • Definition of Big-O
  • A function f(n) is O(g(n)) if there exist
    positive constants c and n0 such that f(n) ? c ?
    g(n) for all n ? n0.
  • e.g., f(n) 100n2 30n 2000 is O(n2)

3
Big O Notation
A function f(n) is O(g(n)) if there exist
positive constants c and n0 such that f(n) ? c ?
g(n) for all n ? n0
4
How functions grow?
function
n
(Assume one million operations per second.)
For large data sets, algorithms with a complexity
greater than O(n log n) are often impractical!
5
Practical Complexity
6
orzs sequence evolution
  • orz (kid)
  • OTZ (adult)
  • Orz (big head)
  • Crz (motorcycle driver)
  • or2 (bottom up)
  • oO (back high)
  • STO (the other way around)
  • the origin?
  • their evolutionary relationships?

Chaos illustration
7
orzs sequence evolution
orz
Orz
Crz
or2
OTZ
oO
STO
8
Homology
  • How to measure the homology (similarity) of two
    sequences/genes/species?

AGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTA
AGAGCTATGCCATTTATATAAAATTTAAAGGCGCGGGGGGTTAAGAGGCG
CGGGGGGTTAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAAGC
TATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAA
AATTTAAAGAGGCGCGGGGGGTTAAGAGCTATGCCATTTATATAAAATTT
AAAGCGTAAGAGCTATGCCATTTATTTATATAAAATTTAAAGATACAATC
TAAAGTTAAAGCGTAAGAGCTATGCAGGCGCGGGGGGTTAAGAGCTATGC
CATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTT
AAAGAGGCGCGGGGGGT
AAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGGGCGCGGGGGG
TTAAGAGCTATGCCATTAGGCGCGGGGGGTTAAGAGCTATGCCATTTATA
TAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAATTTAT
ATAAAATTTAAAGCGTAAGAGCTATGCCATTTATATAAAATTTAAAGAAA
GCGTAAGAGCTATGCCATTTATATAAAATTTAAAGGGCGCGGGGGGTTAA
GAGCTATGCCATTTATATAAGGGGGTTAAGAGCTATGCCATTTTAAGAGC
TATGCCATTTATATAAAATTTAAAGCGTAAGAAGCTATGCCATTTATATA
AAATTTAAGAGCTATGCCATTTATATAAAATTTAAAGAGGCGCGGGGGGT
TAAGAGCTATGCCATTTATATAAAATTTAAAGCGTAAGAGCTATGCCATT
TATATAAAATTTAAGCTATGCAGGCGCGGGGAGCTGGGTTTATATAAAAT
TTA
9
Genetic Variants
  • Genetic variants decrease homology of sequences.
  • Point mutations, insertions/deletions, and
    inversions.
  • It is difficult to measure the homology with the
    existence of these genetic variants.

AAGAAGCTATGCC - G - G -
AAGAGACTATGCCGG
AA - - AGCTATGCCAGTGA
AAAGCTATGCCAGTGA
Alignment
10
Genetic Variants
Variations observed in a population
Mutations over time
Disease Mutation
Common Ancestor
time
present
11
Milestones of Alignment Algorithms
  • 1970 Needleman-Wunsch algorithm
  • 1981 Smith-Waterman algorithm developed
  • 1985 FASTP/FASTN fast sequence similarity
    searching
  • 1990 BLAST fast sequence similarity searching
  • 2001 Mummer

12
The Impact of Alignment Algorithms
  • BLAST the most cited paper so far.
  • Biologists may not know anything of algorithm but
    they probably know BLAST.

13
Dynamic Programming
  • Dynamic programming is a classical method for
    solving sequential decision problems with a
    compositional cost structure.
  • Alignment algorithms are originally derived from
    dynamic programming.

14
Two key ingredients
  • Two key ingredients for an optimization problem
    to be suitable for a dynamic-programming solution

1. optimal substructures
2. overlapping subproblems
Subproblems are dependent. (otherwise, a
divide-and-conquer approach is the choice.)
Each substructure is optimal. (Principle of
optimality)
15
Three Basic Components
  • A dynamic-programming algorithm has three basic
    components
  • The recurrence relation (for defining the value
    of an optimal solution)
  • The tabular computation (for computing the value
    of an optimal solution)
  • The traceback (for delivering an optimal
    solution).

16
Fibonacci Numbers
  • Fibonacci numbers F(n), for n 0, 1, 2, , are
  • 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144,
    233,
  • Sunflowers have spirals in pair of adjacent
    Fibonacci numbers for two directions and (e.g.,
    34 and 55) Wikipedia

17
Recursion
  • Formal definition 0, 1, 1, 2, 3, 5, 8, 13, 21,
    34,

int fib (int n) if(n 0 n 1) return
1 else return ( fib (n-1) fib (n-2) )
18
Recursion
int fib (int n) if(n 0 n 1) return
1 else return ( fib (n-1) fib (n-2) )
1 20
fib(100)
2 21
4 22
8 23
100
Exponential to n
19
Tabular Computation
  • The tabular computation can avoid recompuation.

20
Locating Conserved Genes
  • Genes conserved in two different species usually
    have the same orientation.
  • e.g., Mummer (http//mummer.sourceforge.net/).

21
Locating Conserved Genes
  • Genes conserved in two different species usually
    have the same orientation.
  • This problem is often formulated to find the
    longest increasing subsequence (LIS).

1
2
3
4
5
6
7
8
1
5
6
2
3
4
7
8
LIS 1, 2, 3, 4, 7, 8
22
Longest Increasing Subsequence (LIS)
  • The longest increasing subsequence is to find a
    longest increasing subsequence of a given
    sequence of distinct integers a1a2an .
  • e.g. 9 2 5 3 7 11 8 10 13 6
  • 3 7
  • 7 10 13
  • 7 11
  • 3 5 11 13

are increasing subsequences.
We want to find a longest one.
are not increasing subsequences.
23
A naive approach for LIS
  • Let Li be the length of a longest increasing
    subsequence ending at position i.

Li 1 max j 0..i-1Lj aj lt ai(use a
dummy a0 minimum, and L00)
9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 ?
24
A naive approach for LIS
  • Li 1 max j 0..i-1 Lj aj lt ai

9 2 5 3 7 11 8 10 13 6
Li 1 1 2 2 3 4 4 5
6 3
The maximum length
The subsequence 2, 3, 7, 8, 10, 13 is a longest
increasing subsequence. This method runs in O(n2)
time.
25
An O(n log n) method for LIS
  • Define BestEndk as the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
BestEnd1
5
3
BestEnd2
26
An O(n log n) method for LIS
  • Define BestEndk as the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
BestEnd2
7
7
7
BestEnd3
11
8
BestEnd4
27
An O(n log n) method for LIS
  • Define BestEndk as the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
BestEnd3
11
8
8
8
BestEnd4
10
10
BestEnd5
13
BestEnd6
28
An O(n log n) method for LIS
  • Define BestEndk to be the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
6
BestEnd3
11
8
8
8
8
BestEnd4
10
10
10
BestEnd5
For each position, we perform a binary search to
update BestEnd.
13
13
BestEnd6
29
Binary search
  • Given an ordered sequence x1x2 ... xn, where
    x1ltx2lt ... ltxn, and a number y, a binary search
    finds the largest xi such that xilt y in O(log n)
    time.

n/2
...
n/4
n
30
Binary search
  • How many steps would a binary search reduce the
    problem size to 1?n n/2 n/4 n/8 n/16
    ... 1

How many steps? O(log n) steps.
31
An O(n log n) method for LIS
  • Define BestEndk to be the smallest number of an
    increasing subsequence of length k.

9 2 5 3 7 11 8 10 13 6
9
2
2
2
2
2
2
2
2
2
BestEnd1
5
3
3
3
3
3
3
3
BestEnd2
7
7
7
7
7
6
BestEnd3
11
8
8
8
8
BestEnd4
For each position, we perform a binary search to
update BestEnd. Therefore, the running time is
O(n log n).
10
10
10
BestEnd5
13
13
BestEnd6
32
LCS
  • The longest common subsequence problem is to find
    a maximum-length common subsequence between two
    sequences.
  • Sequence 1 president
  • Sequence 2 providence
  • Its LCS is priden.

president providence
33
LCS
  • Another example
  • Sequence 1 algorithm
  • Sequence 2 alignment
  • One of its LCS is algm.

a l g o r i t h m a l i g n m e n t
34
How to compute LCS?
  • Let Aa1a2am and Bb1b2bn .
  • len(i, j) the length of an LCS between
    a1a2ai and b1b2bj
  • With proper initializations, len(i, j)can be
    computed as follows.

35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
(No Transcript)
39
Pairwise Alignment
  • Sequence A CTTAACT
  • Sequence B CGGATCAT
  • An alignment of A and B

C---TTAACTCGGATCA--T
Sequence A
Sequence B
40
Pairwise Alignment
  • Sequence A CTTAACT
  • Sequence B CGGATCAT
  • An alignment of A and B

Mismatch
Match
C---TTAACTCGGATCA--T
Deletion gap
Insertion gap
41
Dot Matrix
  • Sequence ACTTAACT
  • Sequence BCGGATCAT

C G G A T C A T
CTTAACT
42
Alignment Graph
  • Sequence A CTTAACT Sequence B CGGATCAT

C G G A T C A T
CTTAACT
C---TTAACTCGGATCA--T
43
A simple scoring scheme
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)

C - - - T T A A C TC G G A T C A - - T
8 -3 -3 -3 8 -5 8 -3 -3
8 12
Alignment score
44
An optimal alignment-- the alignment of maximum
score
  • Let Aa1a2am and Bb1b2bn .
  • Si,j the score of an optimal alignment between
    a1a2ai and b1b2bj
  • With proper initializations, Si,j can be computed
    by dynamic programming.

45
Computing Si,j

j
w(ai,bj)
w(ai,-)
i
w(-,bj)
Sm,n
46
Initializations

C G G A T C A T
CTTAACT
47
S3,5 ?
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)

C G G A T C A T
CTTAACT
48
S3,5 5
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)

C G G A T C A T
CTTAACT
optimal score
49
C T T A A C TC G G A T C A T
8 5 5 8 -5 8 -3 8 14

C G G A T C A T
CTTAACT
50
Now try this example in class
  • Sequence A CAATTGA
  • Sequence B GAATCTGC
  • Their optimal alignment?
  • Match 8 (w(x, y) 8, if x y)
  • Mismatch -5 (w(x, y) -5, if x ? y)
  • Each gap symbol -3 (w(-,x)w(x,-)-3)

51
Initializations

G A A T C T G C
CAATTGA
52
S4,2 ?

G A A T C T G C
CAATTGA
53
S5,5 ?

G A A T C T G C
CAATTGA
54
S5,5 14

G A A T C T G C
CAATTGA
optimal score
55
C A A T - T G AG A A T C T G C

-5 8 8 8 -3 8 8 -5 27
G A A T C T G C
CAATTGA
56
Expressed Sequence Tags (ESTs)
  • Expressed sequence tags (ESTs) are a set of
    sequenced cDNAs reverse-transcribed from mRNA.
  • ESTs are usually short (400bp) and only
    represent parts of a gene.

57
Genome Annotation
  • The huge amount of ESTs is often used to identify
    genes in the genome.
  • e.g., 8,134,045 ESTs in human.
  • http//www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.ht
    ml

ESTs
58
Global Alignment vs. Local Alignment
  • Global alignment
  • Local alignment

59
An optimal local alignment
  • Mutations may have added too much noise in the
    sequences of interest.
  • Local alignment avoids these regions altogether
    and focuses on those with a positive score.
  • Si,j the score of an optimal local alignment
    ending at ai and bj

60
local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
CTTAACT
61
local alignment
Match 8 Mismatch -5 Gap symbol -3

C G G A T C A T
CTTAACT
The best score
62
A C - TA T C A T 8-38-38 18

C G G A T C A T
CTTAACT
The best score
Write a Comment
User Comments (0)
About PowerShow.com