Sequence Alignment

About This Presentation

Title:

Sequence Alignment

Description:

sequence alignment aggctatcacctgacctccaggccgatgccc tagctatcacgaccgcggtcgatttgcccgac-aggctatcacctgacctccaggccga--tgccc---tag-ctatcac--gaccgc--ggtcgatttgcccgac – PowerPoint PPT presentation

Number of Views:184

Avg rating:3.0/5.0

Slides: 67

Provided by: NirF59

Category:

more less

Transcript and Presenter's Notes

Title: Sequence Alignment

1
Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- TAG-CTATCAC-
-GACCGC--GGTCGATTTGCCCGAC
Definition Given two strings x x1x2...xM, y
y1y2yN, an alignment is an assignment of
gaps to positions 0,, N in x, and 0,, N in y,
so as to line up each letter in one sequence
with either a letter, or a gap in the other
sequence
2
Sequence Comparison

Much of bioinformatics involves sequences
DNA sequences
RNA sequences
Protein sequences
We can think of these sequences as strings of
letters
DNA RNA alphabet ? of 4 letters
Protein alphabet ? of 20 letters

3
Sequence Comparison

Finding similarity between sequences is important
for many biological questions
During evolution biological evolves (mutation,
deletion, duplication, addition, move of
subsequences)
Homologous (share a common ancestor) sequences
are (relatively) similar
Algorithms try to detect similar sequence that
possibly share a common function

4
Sequence Comparison (cont)

For example
Find similar proteins
Allows to predict function structure
Locate similar subsequences in DNA
Allows to identify (e.g) regulatory elements
Locate DNA sequences that might overlap
Helps in sequence assembly

5
Sequence Alignment

Input two sequences over the same alphabet
Output an alignment of the two sequences
Example
GCGCATGGATTGAGCGA
TGCGCCATTGATGACCA
A possible alignment
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A

6
Alignments

-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Three elements
Matches
Mismatches
Insertions deletions (indel)

7
Choosing Alignments

There are many possible alignments
For example, compare
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Which one is better?

8
Scoring Alignments

Intuition
Similar sequences evolved from a common ancestor
Evolution changed the sequences from this
ancestral sequence by mutations
Substitution one letter replaced by another
Deletion deletion of a letter
Insertion insertion of a letter
Scoring of sequence similarity should examine how
many and which operations took place

9
Simple Scoring Rule

Score each position independently
Match m 1
Mismatch s -1
Indel d -2
Score of an alignment is sum of position scores
Scoring Function
Match m m0
Mismatch s s0
Gap d s0
Score F (matches)?m (mismatches)?s
(gaps)?d

10
Example

Example
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
Score (1x13) (-1x2) (-2x4) 3
------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Score (1x5) (-1x6) (-2x11) -23

11
More General Scores

The choice of 1,-1, and -2 scores is quite
arbitrary
Depending on the context, some changes are more
plausible than others
Exchange of an amino-acid by one with similar
properties (size, charge, etc.)
Exchange of an amino-acid by one with opposite
properties
Probabilistic interpretation (e.g.) How likely
is one alignment versus another ?

12
Additive Scoring Rules

We define a scoring function by specifying a
function
?(x,y) is the score of replacing x by y
?(x,-) is the score of deleting x
?(-,x) is the score of inserting x
The score of an alignment is the sum of position
scores

13
How do we compute the best alignment?
AGTGCCCTGGAACCCTGACGGTGGGTCACAAAACTTCTGGA

Alignment is a path from cell (0,0) to cell
(m,n)
Too many possible alignments
O( 2MN)

AGTGACCTGGGAAGACCCTGACCCTGGGTCACAAAACTC
14
The Optimal Score

The optimal alignment score between two sequences
is the maximal score over all alignments of these
sequences
Computing the maximal score or actually finding
an alignment that yields the maximal score are
closely related tasks with similar algorithms.
We now address these two problems.

15
Alignment is additive

Observation
The score of aligning x1xM
y1yN
is additive
Say that x1xi xi1xM
aligns to y1yj yj1yN
The two scores add up
V(x1M, y1N) V(x1i, y1j)
V(xi1M, yj1N)

16
Dynamic Programming

We will now describe a dynamic programming
algorithm
Suppose we wish to align
x1xM
y1yN
Let
V(i,j) optimal score of aligning
x1xi
y1yj

17
Dynamic Programming

Notice three possible cases
xi aligns to yj
x1xi-1 xi
y1yj-1 yj
2. xi aligns to a gap
x1xi-1 xi
y1yj -
yj aligns to a gap
x1xi -
y1yj-1 yj

m, if xi yj V(i,j) V(i-1, j-1)
s, if not
V(i,j) V(i-1, j) d
V(i,j) V(i, j-1) d
18
Dynamic Programming

How do we know which case is correct?
Inductive assumption
V(i, j-1), V(i-1, j), V(i-1, j-1) are optimal
Then,
V(i-1, j-1) s(xi, yj)
V(i, j) max V(i-1, j) d
V( i, j-1) d
Where s(xi, yj) m, if xi yj s, if not

19
Recursive Argument

Define the notation
Using our recursive argument, we get the
following recurrence for V

Vi,j Vi1,j
Vi,j1 Vi1,j1
20
Recursive Argument

Of course, we also need to handle the base cases
in the recursion

AA - -
versus
We fill the matrix using the recurrence rule
21
Dynamic Programming Algorithm
We continue to fill the matrix using the
recurrence rule
22
Dynamic Programming Algorithm
V0,0 V0,1
V1,0 V1,1
versus
23
Dynamic Programming Algorithm
24
Dynamic Programming Algorithm
25
Reconstructing the Best Alignment

To reconstruct the best alignment, we record
which case(s) in the recursive rule maximized the
score

26
Reconstructing the Best Alignment

We now trace back a path that corresponds to the
best alignment

27
Reconstructing the Best Alignment

Sometimes, more than one alignment has the best
score

AAAC A-GC
28
The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
y1 yN
An optimal alignment is composed of optimal
subalignments
29
The Needleman-Wunsch AlgorithmGlobal Alignment
Algorithm

Initialization.
F(0, 0) 0
F(0, j) j ? d
F(i, 0) i ? d
Main Iteration. Filling-in partial alignments
For each i 1M
For each j 1N
F(i-1,j-1) s(xi, yj) case 1
F(i, j) max F(i-1, j) d case
2
F(i, j-1) d case 3
DIAG, if case 1
Ptr(i,j) LEFT, if case 2
UP, if case 3
Termination. F(M, N) is the optimal score, and
from Ptr(M, N) can trace back optimal alignment

30
Time Complexity

Space O(mn)
Time O(mn)
Filling the matrix O(mn)
Backtrace O(mn)

31
Space Complexity

In real-life applications, n and m can be very
large
The space requirements of O(mn) can be too
demanding
If m n 1000, we need 1MB space
If m n 10000, we need 100MB space
We can afford to perform extra computation to
save space
Looping over million operations takes less than
seconds on modern workstations
Can we trade space with time?

32
Why Do We Need So Much Space?
To compute Vn,md(s1..n,t1..m), we need
only O(min(n,m)) space

Compute V(i,j), column by column, storing only
two columns in memory (or line by line if lines
are shorter).

Note however that
This trick fails when we need to reconstruct
the optimizing sequence.
Trace back information requires O(mn) memory
bytes.

33
Space Efficient Version Outline
Input Sequences s1,n and t1,m to be
aligned. Idea perform divide and conquer

If n1 align s1,1 and t1,m
Else, find position (n/2, j) at which some best
alignment crosses a midpoint

Construct alignments
As1,n/2 vs t1,j
Bsn/21,n vs tj1,m
Return AB

34
Finding the Midpoint

The score of the best alignment that goes through
j equals
d(s1,n/2,t1,j) d(sn/21,n,tj1,m)
Thus, we need to compute these two quantities for
all values of j

35
Finding the Midpoint (Algorithm)

Define
Fi,j d(s1,i,t1,j)
Bi,j d(si1,n,tj1,m)
Fi,j Bi,j score of best alignment through
(i,j)
We compute Fi,j as we did before
We compute Bi,j in exactly the same manner,
going backward from Bn,m
Requires linear space complexity
because there is no need to keep trace back
information in this step

36
Time Complexity Analysis

Time to find a mid-point cnm (c - a constant)
Size of recursive sub-problems is (n/2,j) and
(n/2,m-j-1), hence
T(n,m) cnm T(n/2,j) T(n/2,m-j-1)
Lemma T(n,m) ? 2cnm

Proof (by induction) T(n,m) ? cnm 2c(n/2)j
2c(n/2)(m-j-1) ? 2cnm.
Thus, time complexity is linear in size of the
problem At worst, twice the cost of the regular
solution.
37
A variant of the NW algorithm

Maybe it is OK to have an unlimited of gaps in
the beginning and end

----------CTATCACCTGACCTCCAGGCCGATGCCCCTTCCGGC GCG
AGTTCATCTATCAC--GACCGC--GGTCG--------------

Then, we dont want to penalize gaps in the ends

38
Different types of overlaps
39
The Overlap Detection variant

Changes
Initialization
For all i, j,
V(i, 0) 0
V(0, j) 0
Termination
maxi V(i, N)
VOPT max maxj V(M, j)

x1 xM
y1 yN
40
Overlap Alignment Example
s PAWHEAE t HEAGAWGHEE

Scoring system
Match 4
Mismatch -1
Indel -5

41
Overlap Alignment

Initialization Vi,00 , V0,j0

Recurrence as in global alignment
Score maximum value at the bottom line and
rightmost line in the matrix

42
Overlap Alignment Example
s PAWHEAE t HEAGAWGHEE

Scoring system
Match 4
Mismatch -1
Indel -5

43
Overlap Alignment Example
s PAWHEAE t HEAGAWGHEE

Scoring system
Match 4
Mismatch -1
Indel -5

44
Overlap Alignment Example
The best overlap is PAWHEAE------
---HEAGAWGHEE

Pay attention! A different scoring system could
yield a different result, such as
---PAW-HEAE
HEAGAWGHEE-

45
The local alignment problem

Given two strings x x1xM,
y y1yN
Find substrings x, y whose similarity
(optimal global alignment value)
is maximum
e.g. x aaaacccccgggg
y cccgggaaccaacc

46
Why local alignment

Genes are shuffled between genomes
Portions of proteins (domains) are often conserved

47
Cross-species genome similarity

98 of genes are conserved between any two
mammals
gt70 average similarity in protein sequence

hum_a GTTGACAATAGAGGGTCTGGCAGAGGCTC------------
--------- _at_ 57331/400001 mus_a
GCTGACAATAGAGGGGCTGGCAGAGGCTC---------------------
_at_ 78560/400001 rat_a GCTGACAATAGAGGGGCTGGCAGAGA
CTC--------------------- _at_ 112658/369938 fug_a
TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG
_at_ 36008/68174 hum_a CTGGCCGCGGTGCGGAGCGTCTGGA
GCGGAGCACGCGCTGTCAGCTGGTG _at_ 57381/400001 mus_a
CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG
_at_ 78610/400001 rat_a CTGGCCCCGGTGCGGAGCGTCTGGAG
CGGAGCACGCGCTGTCAGCTGGTG _at_ 112708/369938 fug_a
TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG
_at_ 36058/68174 hum_a AGCGCACTCTCCTTTCAGGCAGCT
CCCCGGGGAGCTGTGCGGCCACATTT _at_ 57431/400001 mus_a
AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT
_at_ 78659/400001 rat_a AGCGCACTCG-CTTTCAGGCCGCTCC
CCGGGGAGCTGCGCGGCCACATTT _at_ 112757/369938 fug_a
AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC
_at_ 36084/68174 hum_a AACACCATCATCACCCCTCCCCGGC
CTCCTCAACCTCGGCCTCCTCCTCG _at_ 57481/400001 mus_a
AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG
_at_ 78708/400001 rat_a AACACCGTCGTCA-CCCTCCCCGGCC
TCCTCAACCTCGGCCTCCTCCTCG _at_ 112806/369938 fug_a
CCGAGGACCCTGA-------------------------------------
_at_ 36097/68174
atoh enhancer in human, mouse, rat, fugu fish
48
Local Alignment

As before, we use dynamic programming
We now want to setVi,j to record the best
alignment of a suffix of s1..i and a suffix of
t1..j
How should we change the recurrence rule?
Same as before but with an option to start afresh
The result is called the Smith-Waterman algorithm

49
The Smith-Waterman algorithm

Idea Ignore badly aligning regions
Modifications to Needleman-Wunsch
Initialization V(0, j) V(i, 0) 0
0
Iteration V(i, j) max V(i 1, j) d
V(i, j 1) d
V(i 1, j 1) s(xi, yj)

50
The Smith-Waterman algorithm

Termination
If we want the best local alignment
VOPT maxi,j V(i, j)
If we want all local alignments scoring gt t
?? For all i, j find V(i, j) gt t, and trace back
Complicated by overlapping local alignments

51
Local Alignment

New option
We can start a new match instead of extending a
previous alignment

Alignment of empty suffixes
52
Local Alignment Example
s TAATA t TACTAA
53
Local Alignment Example
s TAATA t TACTAA
54
Local Alignment Example
s TAATA t TACTAA
55
Local Alignment Example
s TAATA t TACTAA
56
Local Alignment Example
s TAATA t TACTAA
57
Alignment with gaps

Observation Insertions and deletions often occur
in blocks longer than a single nucleotide.

Consequence Standard scoring of alignment
studied in lecture, which give a constant penalty
d per gap unit , does not score well this
phenomenon Hence, a better gap score model is
needed.

Question Can you think of an appropriate change
to the scoring system for gaps?

58
Scoring the gaps more accurately

Current model
Gap of length n
incurs penalty n?d
However, gaps usually occur in bunches
Convex gap penalty function
?(n)
for all n, ?(n 1) - ?(n) ? ?(n) - ?(n 1)

?(n)
?(n)
59
Convex gap alignment

Initialization same
Iteration
V(i-1, j-1) s(xi, yj)
V(i, j) max maxk0i-1V(k,j) ?(i-k)
maxk0j-1V(i,k) ?(j-k)
Termination same
Running Time O(N2M) (assume NgtM)
Space O(NM)

60
Compromise affine gaps
?(n)

?(n) d (n 1)?e
gap gap
open extend
To compute optimal alignment,
At position i,j, need to remember best score if
gap is open
best score if gap is not open
F(i, j) score of alignment x1xi to y1yj
if xi aligns to yj
G(i, j) score if xi aligns to a gap after yj
H(i, j) score if yj aligns to a gap after xi
V(i, j) best score of alignment x1xi to
y1yj

e
d
61
Needleman-Wunsch with affine gaps

Why do we need two matrices?
xi aligns to yj
x1xi-1 xi xi1
y1yj-1 yj -
2. xi aligns to a gap
x1xi-1 xi xi1
y1yj - -

Add -d
Add -e
62
Needleman-Wunsch with affine gaps

Initialization V(i, 0) d (i 1)?e
V(0, j) d (j 1)?e
Iteration
V(i, j) max F(i, j), G(i, j), H(i, j)
F(i, j) V(i 1, j 1) s(xi, yj)
V(i, j 1) d
G(i, j) max
G(i, j 1) e
V(i 1, j) d
H(i, j) max
H(i 1, j) e
Termination similar

63
To generalize a little

think of how you would compute optimal
alignment with this gap function

?(n)
.in time O(MN)
64
Remark Edit Distance

Instead of speaking about the score of an
alignment, one often talks about an edit distance
between two sequences, defined to be the cost
of the cheapest set of edit operations needed
to transform one sequence into the other.
Cheapest operation is no change
Next cheapest operation is replace
The most expensive operation is add space.
Our goal is now to minimize the cost of
operations, which is exactly what we actually
did.

65
Where do scoring rules come from ?

We have defined an additive scoring function by
specifying a function ?( ?, ? ) such that
?(x,y) is the score of replacing x by y
?(x,-) is the score of deleting x
?(-,x) is the score of inserting x
But how do we come up with the correct score ?

Answer By encoding experience of what are
similar sequences for the task at hand.
66
Probabilistic Interpretation of Scores

We define the scoring function via
Then, the score of an alignment is the log-ratio
between the two models
Score gt 0 ? Model is more likely
Score lt 0 ? Random is more likely

67
Modeling Assumptions

It is important to note that this interpretation
depends on our modeling assumption!!
For example, if we assume that the letter in each
position depends on the letter in the preceding
position, then the likelihood ratio will have a
different form.

68
Constructing Scoring Rules

The formula
suggests how to construct a scoring rule
Estimate p(,) and q() from the data
Compute ?(a,b) based on the estimated p(,) and
q()
How to estimate these parameters is the subject
matter of parameter estimation in Statistics.

69
Substitution matrix

There exist several matrix based on this scoring
scheme but differing by the way the statistic is
computed
The two major one are PAM and BLOSUM
PAM 1 correspond to statistics computed from an
global alignments of proteins with at most 1 of
mutations
Other PAM matrix (until PAM 250) are extrapolated
by matrix products
BLOSUM 62 correspond to statistics from local
alignments with 62 of similarity.
Other BLOSUM matrix are build from other
alignments

PAM100 gt Blosum90 PAM120 gt Blosum80 PAM160
gt Blosum60 PAM200 gt Blosum52 PAM250 gt
Blosum45

Write a Comment

User Comments (0)