Title: Sequence Alignment
1Sequence Alignment
2Sequence Alignment
AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGG
TCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGA
TTTGCCCGAC
3Distance from sequences
- Mutation events
- Mutation
- we need a score for the substitution of the
symbol i with j - (amino acidic residues, nucleotides, etc.)
- substitution matrices s(i,j)
- A ALASVLIRLITRLYP
- B ASAVHLNRLITRLYP
The substitution matrix should account for the
underlying biological process (conserved
structures or functions)
4Substitution matrices
The basic idea is to measure the correlation
between 2 sequences Given a pair of correlated
sequences we measure the substitution frequency
of a-gt b ( assuming symmetry) Pab The null
hypothesis (random correlation or independent
events) is the product PaPb We then define s
(a,b) log(Pab/PaPb) (log is additive so that
the probability gt sum over pairs) Minimal
distance Maximum score (s)
5Substitution matrices
Following this idea several matrices have been
derived Their main difference is relative to the
alignment types used for computing the
frequencies PAMx (Point Accepted Mutation).
Number x of mutation events. The matrix is
built as A1ik P(ki) for sequences with 1
mutations. Anik(A1ik)n n mutation events
(number of mutations NOT mutated
residues. E.g. n2 P(ik) Pa P(ia) P(ak)
PAMn Log(Anij /Pi)
6PAM10
Very stringent matrix, no out of diagonal element
is gt 0
7PAM160
Some positive values outside the diagonal appear
8PAM250
This is one of the most widely used matrix
9PAM500
10Substitution matrices
PAM matrices are computed starting from very
similar sequences and more distance scoring
matrices are derived by matrix products BLOSUMx
This family is computed directly from blocks of
alignments with a defined sequence identity
threshold (gt x). gt For very related
sequences we must use PAM with low numbers and
BLOSUM with large numbers. The opposite holds for
distant sequences
11BLOSUM62
Probably the most widely used
12BLOSUM90
13BLOSUM30
14Distance between sequences
What kind of operations we consider? Mutation
Deletion and Insertion (rarely rearrangements
) A ALASVLIRLIT--YP B ASAVHL---ITRLYP
The negative gap score value depends only on the
number of holes s(n) -nd linear s(n) -d -
(n-1)e affine (d open e extension) !! Please
note that all scores are position-independent
along the sequence
15Pairwise alignment
Given 2 sequences what is an alignment with a
maximum score? Naïf solution try every
possible alignments and select one with the best
score Using the score function
16How may alignments there are?
Case without internal gaps --AAA -AAA AAA
AAA AAA- AAA-- BB--- BB-- BB- -BB --BB
---BB Given 2 sequences of length m e n, the
number of shifts is m n 1
17How may alignments there are?
Case with internal gaps --AAA -AAA -AAA
-AAA A-AA BB--- BB-- B-B- B--B
BB-- BBAAA BABAA BAABA BAAAB ABBAA AAA
AAA AA-A AAA AAA- BB- B-B -BB-
-BB --BB ... ABABA ABAAB AABBA AABAB AAABB
The number of the alignments is equal at the
number of all possible way of mixing 2 sequences
keeping track of the original sequence order.
Given 2 sequences of lengths n and m, they are
?k0,min(m,n)(mn-k)!/k!(n-k)!(m-k)! For
nm80 we get gt 1043 possible alignments !!!!!!!
18Basic idea
We can keep a table of the precomputed substring
alignments (dynamic programming) ALSKLASPALSAKDL
DSPALS ALSKIADSLAPIKDLSPASLT ALSKLASPALSAKDLDSPAL
-S ALSKIADSLAPIKDLSPASLT- ALSKLASPALSAKDLDSPALS-
ALSKIADSLAPIKDLSPASL-T
19Basic idea
Building step by step Given the 2
sequences ALSKLASPALSAKDLDSPALS,
ALSKIADSLAPIKDLSPASLT The best alignment between
the two substrings ALSKLASPA ALSKIAD can be
computed taking only into consideration
ALSKLASP A ALSKLASP A ALSKLASPA - ALSKIA
D ALSKIAD - ALSKIA D The best among
these 3 possibilities
20The Needleman-Wunsch Matrix
x1 xM
Every nondecreasing path from (0,0) to (M, N)
corresponds to an alignment of the two
sequences
yN y1
21The Needleman-Wunsch Algorithm
- x AGTA m 1
- y ATA s -1
- d -1
F(i,j) i 0 1 2
3 4
Optimal Alignment F(4,3) 2 AGTA A - TA
j 0
1
2
3
22The Needleman-Wunsch Algorithm
- Initialization.
- F(0, 0) 0
- F(0, j) - j ? d
- F(i, 0) - i ? d
- Main Iteration. Filling-in partial alignments
- For each i 1M
- For each j 1N
- F(i-1,j) d case 1
- F(i, j) max F(i, j-1) d case
2 - F(i-1, j-1) s(xi, yj) case 3
- UP, if case 1
- Ptr(i,j) LEFT if case 2
- DIAG if case 3
- Termination. F(M, N) is the optimal score, and
- from Ptr(M, N) can trace back optimal alignment
23The Overlap Detection variant
x1 xN
- Changes
- Initialization
- For all i, j,
- F(i, 0) 0
- F(0, j) 0
- Termination
- maxi F(i, N)
- FOPT max maxj F(M, j)
yM y1
24Structure of a genome
a gene
transcription
pre-mRNA
splicing
mature mRNA
translation
Human 3x109 bp Genome 30,000 genes
200,000 exons 23 Mb coding 15 Mb
noncoding
protein
25Structure of a genome
gene D
A
B
Make D
C
If B then NOT D
If A and B then D
short sequences regulate expression of
genes lots of junk sequence e.g. 50
repeats selfish DNA
gene B
Make B
D
C
If D then B
26Cross-species genome similarity
- 98 of genes are conserved between any two
mammals - 75 average similarity in protein sequence
hum_a GTTGACAATAGAGGGTCTGGCAGAGGCTC------------
--------- _at_ 57331/400001 mus_a
GCTGACAATAGAGGGGCTGGCAGAGGCTC---------------------
_at_ 78560/400001 rat_a GCTGACAATAGAGGGGCTGGCAGAGA
CTC--------------------- _at_ 112658/369938 fug_a
TTTGTTGATGGGGAGCGTGCATTAATTTCAGGCTATTGTTAACAGGCTCG
_at_ 36008/68174 hum_a CTGGCCGCGGTGCGGAGCGTCTGGA
GCGGAGCACGCGCTGTCAGCTGGTG _at_ 57381/400001 mus_a
CTGGCCCCGGTGCGGAGCGTCTGGAGCGGAGCACGCGCTGTCAGCTGGTG
_at_ 78610/400001 rat_a CTGGCCCCGGTGCGGAGCGTCTGGAG
CGGAGCACGCGCTGTCAGCTGGTG _at_ 112708/369938 fug_a
TGGGCCGAGGTGTTGGATGGCCTGAGTGAAGCACGCGCTGTCAGCTGGCG
_at_ 36058/68174 hum_a AGCGCACTCTCCTTTCAGGCAGCT
CCCCGGGGAGCTGTGCGGCCACATTT _at_ 57431/400001 mus_a
AGCGCACTCG-CTTTCAGGCCGCTCCCCGGGGAGCTGAGCGGCCACATTT
_at_ 78659/400001 rat_a AGCGCACTCG-CTTTCAGGCCGCTCC
CCGGGGAGCTGCGCGGCCACATTT _at_ 112757/369938 fug_a
AGCGCTCGCG------------------------AGTCCCTGCCGTGTCC
_at_ 36084/68174 hum_a AACACCATCATCACCCCTCCCCGGC
CTCCTCAACCTCGGCCTCCTCCTCG _at_ 57481/400001 mus_a
AACACCGTCGTCA-CCCTCCCCGGCCTCCTCAACCTCGGCCTCCTCCTCG
_at_ 78708/400001 rat_a AACACCGTCGTCA-CCCTCCCCGGCC
TCCTCAACCTCGGCCTCCTCCTCG _at_ 112806/369938 fug_a
CCGAGGACCCTGA-------------------------------------
_at_ 36097/68174
atoh enhancer in human, mouse, rat, fugu fish
27The local alignment problem
- Given two strings x x1xM,
- y y1yN
- Find substrings x, y whose similarity
- (optimal global alignment value)
- is maximum
- e.g. x aaaacccccgggg
- y cccgggaaccaacc
28Why local alignment
- Genes are shuffled between genomes
- Portions of proteins (domains) are often conserved
29The Smith-Waterman algorithm
- Idea Ignore badly aligning regions
- Modifications to Needleman-Wunsch
- Initialization F(0, j) F(i, 0) 0
-
- 0
- Iteration F(i, j) max F(i 1, j) d
- F(i, j 1) d
- F(i 1, j 1) s(xi, yj)
30The Smith-Waterman algorithm
- Termination
- If we want the best local alignment
-
- FOPT maxi,j F(i, j)
- If we want all local alignments scoring gt t
- For all i, j find F(i, j) gt t, and trace back
31Scoring the gaps more accurately
- Current model
-
- Gap of length n
- incurs penalty n?d
- However, gaps usually occur in bunches
- Convex gap penalty function
- ?(n)
- for all n, ?(n 1) - ?(n) ? ?(n) - ?(n 1)
32General gap dynamic programming
- Initialization same
- Iteration
- F(i-1, j-1) s(xi, yj)
- F(i, j) max maxk0i-1F(k,j) ?(i-k)
- maxk0j-1F(i,k) ?(j-k)
- Termination same
- Running Time O(N2M) (assume NgtM)
- Space O(NM)
33Compromise affine gaps
- ?(n) d (n 1)?e
-
- gap gap
- open extend
- To compute optimal alignment,
- At position i,j, need to remember best score if
gap is open - best score if gap is not open
- F(i, j) score of alignment x1xi to y1yj
- if xi aligns to yj
- G(i, j) score if xi, or yj, aligns to a gap
34Needleman-Wunsch with affine gaps
- Initialization F(i, 0) d (i 1)?e
- F(0, j) d (j 1)?e
- Iteration
- F(i 1, j 1) s(xi, yj)
- F(i, j) max
- G(i 1, j 1) s(xi, yj)
- F(i 1, j) d
- F(i, j 1) d
- G(i, j) max
- G(i, j 1) e
- G(i 1, j) e
- Termination same
35Bounded Dynamic Programming
- Assume we know that x and y are very similar
- Assumption gaps(x, y) lt k(N) ( say NgtM )
- xi
- Then, implies i j lt k(N)
- yj
- We can align x and y more efficiently
- Time, Space O(N ? k(N)) ltlt O(N2)
36Bounded Dynamic Programming
- Initialization
- F(i,0), F(0,j) undefined for i, j gt k
- Iteration
- For i 1M
- For j max(1, i k)min(N, ik)
- F(i 1, j 1) s(xi, yj)
- F(i, j) max F(i, j 1) d, if j gt i k(N)
- F(i 1, j) d, if j lt i k(N)
- Termination same
- Easy to extend to the affine gap case
x1 xM
y1 yN
k(N)