Title: Pairwise Sequence Alignment
1Pair-wise Sequence Alignment
- What happened to the sequences of similar genes?
- random mutation
- deletion, insertion
- What is pair-wise sequence alignment?
Seq. 1 515 EVIRMQDNNPFSFQSDVYSYG EVI
P DVSY Seq. 2 451 EVI---EHKPYNHKADVFS
YA
2Some concepts
3Dotplot
- A simplified representation
- What dotplot does not show
4Sequence Alignment
- Total number of possible alignments for length n
- 22n / sqrt(2?n)
- Dynamic programming
- a method for some optimization problems
- determine a scoring scheme
- best solution based on a scoring scheme
- Needleman-Wunsch - global
5- Questions
- How does it work?
- How to come up with a DP approach
- to an exponential problem?
- How to implement a DP approach?
6Dynamic Programming Algorithm
- Break a problem into subproblems
- Solve each subproblem separately
F(i-1,j-1) s(xi, yj)
F(i,j-1) g
F(i,j) max
F(i-1,j) g
F(i,j) The max score for aligning 1st i
symbols of sequence 1 with 1st j symbols of
sequence 2
s(xi, yj) substitution score for aligning xi
with yj
g gap penalty
7Example
Match 1 Mismatch 0 Gap -1
ACAGTAG
ACTCG
- Initialization
- matrix filling (scoring)
- Trace back
8Group Discussion 9/23/08 Due in class
- Fill up the table in the previous example
- What is the optimal alignment in this example?
9j 0, 1, 2, 3, 4, 5, 6, 7
A C A G T A G
i0
A
i1
C
i2
T
i3
C
i4
G
i5
10Local Alignment Smith- Waterman
0
F(i-1,j-1) s(xi, yj)
F(i,j) max
F(i,j-1) g
F(i-1,j) g
11Local Alignment
A A C C T A T A G C T
G
C
G
A
T
A
T
A
AACCTATAGCT GCGATATA
12Issues in alignment
- Different ways to fill the table
- Multiple optimal alignments
- s(xi, yj) from substitution matrix
- gap penalty
- linear w(k) gk
h gk, k1
Affine w(k)
0, k0
13Gap models
- New gap vs. gap extension
- A gap of length k vs. k gaps of length 1
- 1 insersion / deletion event vs. k events
- gap penalty
- linear w(k) gk
h gk, k1
Affine w(k)
0, k0
14Affine Gap Penalty
- ? Wrong with the F(i,j) formula if AGP is used
- Aligning 1st i symbols of x with 1st j symbols of
y
M( i, j ) best score when xi aligned with yj Ix
(i, j) best score when xi aligned with a
gap Iy (i, j) best score when yj aligned with
a gap
15DP for global alignment for AGP
M(i-1, j-1) s(xi, yj) Ix (i-1, j-1) s(xi,
yj) ly (i-1, j-1) s(xi, yj)
M (i, j) max
M(i-1, j) h g Iy(i-1, j) h g lx (i-1, j)
g
Ix (i, j) max
M(i, j-1) h g Ix(i, j-1) h g ly (i, j-1)
g
Iy (i, j) max
16DP for global alignment using AGP
- Initialization
- M(0, 0) 0
- Ix(i, 0) hgi
- ly(0, j) hgj
- all other cases -?
- Start at the largest element in the three
matrices - M(m, n), Ix(m, n), ly(m, n)
17DP for local alignment for AGP
M(i-1, j-1) s(xi, yj) Ix (i-1, j-1) s(xi,
yj) ly (i-1, j-1) s(xi, yj) 0
M (i, j) max
M(i-1, j) h g Iy(i-1, j) h g //
ignored lx (i-1, j) g
Ix (i, j) max
M(i, j-1) h g Ix(i, j-1) h g //
ignored ly (i, j-1) g
Iy (i, j) max
18DP for Local Alignment for AGP
- Initialization
- M(0, 0) 0
- Ix(i, 0) 0
- ly(0, j) 0
- all other cases -?
- Start at the largest M(i, j), Ix(i, j), ly(i, j)
19Database searching methods
- Ideas Regions that are similar likely to share
short identical subsequences
- Quick search for the regions, then check
carefully locally
20FASTA related methods
1. Quick initial guess common subsequences
- Word, word size (2,6), sensitivity vs. speed
- What are the words in the query also in target
- Pre-computed table that stores locations of words
hashing
21FASTA related methods
2. Find the region with high population of common
words
- Process diagonals, rescore, join regions, using
gaps
3. Local alignment (DP) in the region identified
- Use Smith-Waterman method
- in a band, 32 aa wide around the best score
22Limitation of FASTA
- Can miss biologically significant similarity
- some proteins do not share identical a.a.
- initial step
- Different codons encodes same protein
23BLAST
- Previous 2 kinds approaches
- Theoretically sound
- search for common subsequences
- 1. Word list
- Incorporate similarity measurement for words
- PAM120
- e.g. ACDE
- Scan for word occurrences
- hash table
- Finite state machine
24BLAST
- 2. Extend words to HSP (locally optimal pairs)
- Find additional words within threshold
- Merge within distance A
3. Select significant HSPs, use DP in banded
region
25Multiple Sequence Alignment
- How do we extend knowledge of pair-wise alignment?
AGAC --AC
AGAC AG--
AC AG
AG--
Some possibilities
--AC
AGAC
- Fix pair-wise alignment and then add?
- Evaluate all the possible alignment of N
sequences?
26Scoring MSA
- Sum of pairs (SP) scoring methods
Given a alignment of N sequences, each of which
has length L, in the LxN alignment
Pair-wise sum for each column, then sum all
columns
- Example
- (c(match)1, c(mismatch)-1, c(gap)-2,
c(gap,gap) 0 - SP4SP(I,-,I,V) -21-1-2-2-1-7
- SP SP1 SP2 SP8
AQPILLLV ALR-LL- AK-ILLL- CPPVLILV
- SP tends to overweight a single mutation
SP(A,A,A,C) 0, SP(A,A,A,A) 6
27Extension of DP for N sequences
- Extend F(i,j) for N dimensions
- DP of N dimensions using SP
- Time in the order of (LN)(2N-1)N2 O((2L)NN2)
28STAR method
- DP provide optimal solution but costly
- Heuristic methods STAR, CLUSTALW,
- Progressive alignment
- STAR
- - pair-wise
- - build similarity matrix
- - find a star sequence
- - use star to align other sequence
- - once gap, all time gap
29STAR method
30CLUSTAL family
- What are the disadvantages of STAR method?
- Build Similarity tree clustering
- Alignment starts at most similar sequences
- Pair-wise alignment -- distance matrix
- Fast approximate approach or DP
31CLUSTALW
2. Construct similarity tree, the guide tree
UPGMA (un-weighted pair-group method using
arithmetic average)
3. Progressive alignment
- Start with most similar sequences
- Align group with group using pair-wise alignment
- e.g.