Title: Naive method Boyer-Moore Algorithm/Knuth-Morris Pratt
1Algorithms for biological sequence
Comparison and AlignmentSara Brunetti,
Dipartimento di Scienze Matematiche e
Informatica University of Siena,
Italy,sara.brunetti_at_unisi.it
2History
- 1953 DNA structure, Watson e Crick
- 1975 development of the sequencing technique,
Ranger, Maxam e Gilbert - 1980 American Supreme Court decides that bacteria
genetically modified were patented - 1990 beginning of the Genome Project
- Goals
- sequence the entire human genome producing the
complete DNA trascript - produce maps of the genome showing locations of
expressed sites - 2000 Tony Blair and Bill Clinton announce the
completion of the human genome sequencing - Cost 3 109 euros
3- Amounts of data
- Human genome 3x109 bp word 5000 km long
(Roma-Capo Nord) - Contained in 1015 cells
- Macromolecular structures 35460 entry (14 Marz
06, PDB) - Bioinformatics
- study of problems of storage, organization and
distribution of large amounts of genomic data
4Problems
- Reconstructing long sequences of DNA from
overlapping sequence fragments. - Storing and retrieving DNA sequences in
Databases. - Searching databases for related sequences and
subsequences.
5- The human genome project is an international
multi-component effort to read out the
instruction book for human biology that is our
genome, and to understand what its telling us - Biological problems
- Determine genetic maps from probe data
- Determine the functionality of a protein
- Characterize a protein
- Prediction of protein structure
- Protein interaction
6Computational biology
- study of mathematical and combinatorial problems
of modeling biological processes in the cell,
interpreting the data and providing theories
about their biological relations - Data representation
- Problem formulation
- (Efficient) algorithm design
7Data representation
- Alphabet
- Italian
- A B C D E F G H I L M N O P Q R S T U V Z
- English
- A B C D E F G H I J K L M N O P Q R S T U V W Y Z
- DNA
- A C G T
- Protein
- A Q W E R T Y I P L K H F D S C V N M
- Binary
- 0 1
8Data representation strings
- DNA
- String ACCGTATATAAAAGGCCGGGTT
- Length 22
suffix
substring
prefix
9Similarities and differences?
- -Differences between the human genome and the
chimpanzee genome 2 - -Differences betweeen human and worm 50
- -Similarity between two humans 99,9
- But genome length 3 109 bp
- They can differ into 3 106 positions
10Similarities and differences
- The resemblance of two DNA sequences taken from
different organisms can be explained by the
theory that all contemporary genetic material has
one ancestral ancient DNA. - According to this theory, during the course of
evolution mutations occurred, creating
differences between families of contemporary
species. - Most of these changes are due to local mutations,
each modifying the DNA sequence at a specific
manner.
11Biological motivation
- Learning about the functionality or structure of
a protein without performing any experiments - Basic idea
- In biomolecular sequences (DNA, RNA,
Aminoacid sequences) high similarity usually
implies significant functional or structural
similarity. - Usually 25 sequence identity suffice two
proteins to have same 3-dim structure and almost
identical function
12- WARNING Sequence similarities implies functional
similarities, but the reverse is not necessarily
true! - Beside sequences other levels to enquire 3D
protein structure, cellular biochemistry or
morphology etc., but sequences are easier to
study.
13DNA Sequence Comparison First Success Story
- Finding sequence similarities with genes of known
function is a common approach to infer a newly
sequenced genes function - In 1984 Russell Doolittle and colleagues found
similarities between cancer-causing gene and
normal growth factor (PDGF) gene
14Cystic Fibrosis
- Cystic fibrosis (CF) is a chronic and frequently
fatal genetic disease of the body's mucus glands
(abnormally high level of mucus in glands). CF
primarily affects the respiratory systems in
children. - Mucus is a slimy material that coats many
epithelial surfaces and is secreted into fluids
such as saliva
15Cystic Fibrosis Inheritance
- In early 1980s biologists hypothesized that CF is
an autosomal recessive disorder caused by
mutations in a gene that remained unknown till
1989 - Heterozygous carriers are asymptomatic
- Must be homozygously recessive in this gene in
order to be diagnosed with CF
16Cystic Fibrosis Finding the Gene
17Finding Similarities between the Cystic Fibrosis
Gene and ATP binding proteins
- ATP binding proteins are present on cell membrane
and act as transport channel - In 1989 biologists found similarity between the
cystic fibrosis gene and ATP binding proteins - A plausible function for cystic fibrosis gene,
given the fact that CF involves sweet secretion
with abnormally high sodium level
18Cystic Fibrosis Mutation Analysis
- If a high of cystic fibrosis (CF) patients have
a certain mutation in the gene and the normal
patients dont, then that could be an indicator
of a mutation that is related to CF -
- A certain mutation was found in 70 of CF
patients, convincing evidence that it is a
predominant genetic diagnostics marker for CF
19Cystic Fibrosis and CFTR Gene
20Cystic Fibrosis and the CFTR Protein
- CFTR (Cystic Fibrosis Transmembrane conductance
Regulator) protein is acting in the cell membrane
of epithelial cells that secrete mucus - These cells line the airways of the nose, lungs,
the stomach wall, etc.
21Mechanism of Cystic Fibrosis
The CFTR protein (1480 amino acids) regulates a
chloride ion channel Adjusts the wateriness of
fluids secreted by the cell Those with cystic
fibrosis are missing one single amino acid in
their CFTR Mucus ends up being too thick,
affecting many organs
22Bring in the Bioinformaticians
- Gene similarities between two genes with known
and unknown function alert biologists to some
possibilities - Computing a similarity score between two genes
tells how likely it is that they have similar
functions
23Sequence Comparison Problems
- Informally find which parts of sequences are
alike and which parts are different. - Given two sequences over the same alphabet, about
of the same length (?10.000 char.), and almost
equal, find the places where differences occur. - Problem 1) the same gene is sequenced by two
laboratories and they want to compare the
results. - 2) Given two sequences with a few hundred of
char., find two similar sub-strings (one from
each sequence). - 3) Same as Problem 2), but one sequence is
compared with thousand of others. - Problems 2), 3) in searching local similarities
in large databases of bio-sequences.
24Sequence Comparison Problems
- 4) Given two sequences with a few hundred of
char., find a prefix of one similar to the suffix
of the other. - Problem 4) in the fragment assembly procedure in
large scale DNA sequencing. - We introduce a single basic algorithmic idea to
solve all the above problems.
25Pairwise alignment
- How to compare two sequences?
- Alignment
- Similarity
26Sequence alignment an example
s ATGCAGCTGAGCATCG
t ATACAGCGAGTATCG
27Sequence alignment an example
s ATGCAGCTGAGCATCG
t A
T
A
C
A
G
C
G
A
G
T
A
T
C
G
28Sequence alignment
- Sequence 1 s(s1,,sm) of size m
- Sequence2 t(t1,,tn) of size n
- An alignment (s,t) between s and t is obtained
by insertion of spaces in arbitrary positions
along the sequences so that they end up with the
same size - s1 s2 sl
- t1 t2 tl
- (si,ti) pair of characters in s and t or -
- Not allowed (-,-)
29Longest path in a network
- The nodes of the network are (i,j) where each
(i,j) node denotes an aligned pair (si,tj). - There are edges between nodes (i,j) and (k,l) if
i lt k and j lt l (order preserving) - We add a source and a sink
- (The weights on the edges are given by function p)
A
T
A
G
C
0 1 2 2 3 3 4
A
s A T G A t T A
G C A
T
G
A
0 0 1 2 3 4 5
30The Alignment Grid
- Every alignment path is from source to sink
31Edit Distance vs Hamming Distance
Edit distance may compare i-th letter of v
with j-th letter of w
Hamming distance always compares i-th letter
of v with i-th letter of w
V - ATATATAT
V ATATATAT
Just one shift
Make it all line up
W TATATATA
W TATATATA
Hamming distance Edit
distance d(v, w)8
d(v, w)2 Computing Hamming distance
Computing edit distance is a trivial task
is a non-trivial
task
32Edit Distance Example
- TGCATAT ? ATCCGAT in 5 steps
- TGCATAT ? (delete last T)
- TGCATA ? (delete last A)
- TGCAT ? (insert A at front)
- ATGCAT ? (substitute C for 3rd G)
- ATCCAT ? (insert G before last A)
- ATCCGAT (Done)
- What is the edit distance? 5?
33Edit Distance Example (contd)
TGCATAT ? ATCCGAT in 4 steps TGCATAT ? (insert
A at front) ATGCATAT ? (delete 6th T) ATGCATA ?
(substitute G for 5th A) ATGCGTA ? (substitute
C for 3rd G) ATCCGAT (Done) Can it be
done in 3 steps???
34Alignment as a Path in the Edit Graph
0 1 2 2 3 4 5 6 7 7 A T _ G T T A T _ A T C G
T _ A _ C 0 1 2 3 4 5 5 6 6 7 (0,0) , (1,1) ,
(2,2), (2,3), (3,4), (4,5), (5,5), (6,6), (7,6),
(7,7)
- Corresponding path -
35Alignments in Edit Graph (contd)
- and represent indels in v and w with
score 0. - represent matches with score 1.
- The score of the alignment path is 5.
36Alignment as a Path in the Edit Graph
Every path in the edit graph corresponds to an
alignment
37Alignment as a Path in the Edit Graph
0122345677 v AT_GTTAT_ w ATCGT_A_C
0123455667 (0,0) , (1,1) , (2,2), (2,3),
(3,4), (4,5), (5,5), (6,6), (7,6), (7,7)
38Alignment as a Path in the Edit Graph
Old Alignment 0122345677 v AT_GTTAT_ w
ATCGT_A_C 0123455667
New Alignment 0122345677 v AT_GTTAT_ w
ATCG_TA_C 0123445667
39Sequence similarity an example
ATGCAGCTGAGCATCG ATACAGC GAGTATCG
- We assign a score to each alignment
- p(a,a) 1
- p(a,b) -1
- p(a,-)p(-,a)g -2
-
- Score of the example 13 2 1-1 9
- The best alignment between s and t is the one
with maximal total score (similarity).
40Global alignment
41Global alignment
- Given two sequences s and t of roughly the same
length, determine the alignment of s and t with
maximal score - AC - GCTTTG
- - CATG TAT-
- (NeedlemanWunsch Algorithm)
- Motivation the same gene is sequenced by two
laboratories and they want to compare the results
42Number of alignments
- How many ways s can be aligned with t?
- s1 s2 sl
- t1 t2 tl
- Max(n,m) lt l lt nm
- s1 .. sm - - - -
- - - - - - t1 tn
- - s1 .-. sm -
- t1 . tn
- f(i,j)alignments of one sequence of i letters
with another of j letters - f(n,m)f(n-1,m)f(n-1,m-1)f(n,m-1) and
f(n,n)?(1?2)2n1 ?n as n??
43- Es. two sequences of length 1000 have the
following number of possible alignments - f(1000,1000)?(1?2)2001 ?100010 767,4.. !!!!!!!!
- (there are 10 80 elementary particles in the
universe)
44Dynamic Programming Algorithm
- Basic Idea of Dynamic Programming a problem is
solved taking advantage of the already solved
sub-problems. - Each optimal alignment contains optimal
alignments of the subproblems (example) - - GCTGATATAGCT
- GGGTGAT -TAGCT
- Additivity of the penalty function
- Three essential components
- Recurrence relation
- Tabular computation
- Traceback
-
45Dynamic Programming Algorithm Recurrence
relation
- Sequence 1 s of size m
- Sequence 2 t of size n
- si..j sub-string from char i to
char j of s. - M(i,j) is the score of the best alignment between
s1..i and t1..j - M(j,0) M(0,j)-2j
- M(m,n) is computed by solving the more general
problem of computing M(i,j) for all i,j - No top-down approach, but bottom up
- The computation is arranged in a (m1) (n1)
array M
Mi,j-1 - 2
1, if si tj Mi,j max
Mi-1,j-1 p(i,j), p(i,j)
Mi-1,j -2
-1, if si ? tj
46Dynamic Programming Algorithm Tabular computation
s t A G C
A A A C
row 0 comparison between t and an empty
sequence. column 0
0
-2
-4
-6
comparison between s and an empty sequence
1 -4 -4 1
-3 -6 -1 1
-2
-3
Mi,j is computed by observing the 3 previous
entries Mi-1,j-1, Mi,j-1 and Mi-1,j.
M
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
Mi-1,j-1 a new char of s and a new char of t
are considered 1 is added in
case of match and 1 in case of mismatch.
Align s1..i-1 with t1..j-1 Mi,j-1 a
new char of the sequence t is considered
corresponding to a space in s
(-2). Align s1..i with t1..j-1 and
match a space with tj Mi-1,j a new char of
the sequence s is considered corresponding
to a space in t (-2). Align
s1..i-1 with t1..j and match a space with si
47Dynamic Programming Algorithm- Traceback
Trace back to find the best alignment(s)
solution2
t A G C
s
0
-2
-4
-6
A A A C
1
1 -3
-2
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
Best Score
-1
48Algorithm Similarity
- input S,T,m,n
- output M
- for i ?1 to m do
- M(i,0) ?i g
- for j ?0 to n do
- M(0,j) ? j g
- for i ? 1 to m do
- for j ?1 to n do
- M(i,j) ? max( M(i-1,j)g
- M(i-1,j-1)p(i,j)
- M(i,j-1)g )
- return M
- Complexity O(nm)
49Align (i, j, len)input i, j, array
M obtained by Similarity Alg.output alignment
in align-s align-t, vectors of length lenif i
0 and j 0 then len 0 else if i gt 0 and
Mi, j Mi-1, j cs then Align
(i-1,j,len) len len1 align-s
si align-t (space) else if i gt
0 and jgt0 and Mi, j Mi-1, j-1 c(i,j)
then Align (i-1,j-1,len) len
len1 align-s si align-t tj
else Align (i,j-1,len) len len1
align-s (space) align-t tj
- First call Align(m,n,len)
- max (s,t) ? len ? m n
- Algorithm Align finds solution1.
- By inverting the order of the if statements it
is possible to find - the other solutions.
50Complexity
- Algorithm Similarity takes O( m n) time and
space - Algorithm Align takes O (m n) time
- Let hmn
- T(h) k for h ? 2
- T(h) T(h-1) k, for h gt 2 (k
constant) - T(h) O(h) O(mn)
- Algorithm Similarity can be refined to run with
O(mn) space. - In a row by row computation store the last
and the current - row only.
- Algorithm Align can be designed to run with
O(mn) space with a - divide and conquer strategy.
- It is not a trivial task!
The basic algorithm Similarity can be modified
to solve a variety of different problems!!
51More about similarity and distance
52Similarity and distance
- Two approaches to comparing strings
- Similarity measures how much the strings are
alike - Its definition derives from the concept of one
ancestral ancient DNA - An alignment (s,t) of the strings s and t is
obtained by inserting space characters in them in
such a way that - st
- Removal of - from s gives s
- Removal of - from t gives t
- For every i, either si or ti is not
- A scoring system (p,g) has members
- pAxA-gtR, glt0
- additive scoring
- sim(s,t)max score(s,t)
53Similarity and distance
- Distance measures how much the strings differ
- Its definition derives from the concept of
mutations - A distance d on E is dExE-gtR
- d(x,x)0 for all x in E and d(x,y)gt0 for xltgty
- d(x,y)d(y,x) for all x,y in E
- d(x,y)ltd(x,z)d(y,z) for all x,y in E
- An allignment is obtained by successive
applications of a number of admissible operations
transforming s into t - substitution a-gtb
- insertion or deletion of any character (indel)
- A cost measure (c,h) has members
- cAxA-gtR, hgt0
54- When are similarity and distance algorithms
equivalent? - When sequences are aligned by distance in global
alignment, there is a similarity algorithm that
gives the same set of optimal alignments, and
vice versa - The measures are related by the formula
- p(a,b)M-c(a,b) g-hM/2
- dist(s,t)sim(s,t)M/2(st)
- Es.
- Edit distance, M0gt p(a,a)0, p(a,b)-1, g-1
- M2gt p(a,a)2, p(a,b)1,
g0 - Same set of optimal solutions, different scores.
- Usually 0ltMltmax c(a,b)
55- aligned letterslt-gtsubstitutions l
- spaceslt-gtindel operations r
- score(s,t) ?il p(ai,bi)rg
- cost(s-gtt) ?il c(ai,bi)rh
- score(s,t)cost(s-gtt)lMr M/2
- if global alignment st2lr,
- score(s,t)cost(s-gtt)M/2(st)
- dist(s-gtt)min(M/2(st)- score(s,t))
- M/2(st)- sim(s,t)
56Semi-global alignment
57Semi-global Comparison
- Find the best fit of a short sequence t of size n
into a larger sequence s of size m - s1 sk
sl sm - s
- t
- The solution to this problem as formulated above
will take time proportional to - ?k1..m ?lk..m n(l-k)O(nm3)
58(Exact matching)
- Problem given a pattern p and a larger string s,
find all the occurrences of the pattern p in s - Is there an occurrence?
- How many times p occurrs in s?
- Naive method
- Boyer-Moore Algorithm/Knuth-Morris Pratt Algorithm
59Semi-global Comparison
Ignore the spaces at the beginning and at the end
of a sequence. Problem Find the highest score
semi-global alignment between t and substring
(prefix of a suffix) of s. s
CAGCA -CTTGG ATTCTCGG t - - -
CAGCTTGG(- - - - - - - - 1. Ignore final spaces.
Find the best score between t and a prefix of
s.
Mi,j of problem1 contains the best score
between s1..i and t1..j, hence take the
maximum value Mi,n in the last column n.
There is no need to reach the last row.
60Semi-global Comparison
s CAGCA -CTTGG ATTCTCGG t - - -)CAGCTTGG - - -
- - - - - 2. Ignore initial spaces Find the
best alignment between t and a suffix of
s. Mi,j now contains the best score between
t1..j and a suffix of s1..i, hence in the
first column we have all zeroes.
C A G C . C
0 A 0 G 0
C 0 1 A
Initial char !!Join solutions 1 and 2 to
solve semi-global
comparison!!
61Local Alignment
62Local Alignment
- Find the best fit between a sub-string of s and
a sub-string of t. - s1 sk
si sm - s
- t
- t1 th tj
tn - Motivation Ignore streches of non-coding DNA
63Local Alignment Algorithm SmithWaterman
Global Alignment Local Alignmentbetter
alignment to find conserved segment
--T-CC-C-AGT-TATGT-CAGGGGACACGA-GCATGCAGA-G
AC
AATTGCCGCC-GTCGT-T-TTCAG----CA-GTTATGT-CAGAT-
-C
tccCAGTTATGTCAGgggacacgagcatgcagag
ac
aattgccgccgtcgttttcagCAGTTATGTCAGatc
64Local Alignment Example
Local alignment
Global alignment
65Local Alignment Algorithm SmithWaterman
- The LA problem is still solved computing M.
- Mi,j holds the value of the best alignment
between a suffix - of s1..i and a suffix of t1..j.
- The first row and the first column are
initialized with zeros. -
66Local Alignment
Mi,j-1 - 2
1 if si tj Mi,j max Mi-1,j-1
p(i,j), p(i,j)
Mi-1,j -2 -1 if si ?
tj 0
For any entry Mi,j there exists always the
alignment between the empty suffixes of s1..i
and of t1..j with score 0 At the end choose the
entry Mi,j with maximal score in any position.
Start align tracing back, as before, from there
until you find a value 0.
67Example
HEGAWGHEE PAW HEAE
68End free-space alignment -Motivation
- Find the best fit of substrings of s and t, where
at least one of these substrings must be a prefix
of the original string and one must be a suffix? - Motivation in the shotgun sequence assembly
procedure, one has a large set of partially
overlapping substrings that come from many copies
of one original but unknown DNA sequences. - The problem is to use comparisons of pairs of
substrings to infer the correct original string.
69End free-space alignment
70Example
HEGAWGHEE PAW HEAE
71Kinds of Alignment
72Complexity of Alignments
The space complexity could be a critical
bottleneck. How we can improve such a
complexity? Linear-Space Alignment Hirschbergs
algorithm -- Miller and Myers algorithm
73Extensions to the basic algorithm
- Hirschbergs linear space method for alignment
uses a divide-et-conquer strategy
74- The computation of sim(s,t) can be easily done in
linear space - Each row (or column) depends only on the
preceding one - Optimal alignment in linear space?
- Divide et conquer strategy divide the problem
into smaller subproblems and combine their
solutions to the solution of the whole problem
temp
0 A G C
old
0 AAA C
0
-2
-4
-6
a0
1
-2
1 -3
BestScore(s1..m,t1..n,a)
-4 -1 0 -2
-6 -3 -2 -1 -8 -5 -4 -1
75- The idea is to consider the possible
configurations of an alignment - 1.
- (s1..i-1 si si1..m
- - - t1..n
- for j0)
- s1..i-1 si si1..m
- t1..j-1 tj tj1..n
- for 1ltjltn-1
- (s1..i-1 si si1..m
- t1..n - -
- for jn)
- 2.
- s1..i-1 si si1..m
- t1..j - tj1..n
- for oltjltn
76- An optimal alignment must satisfied also
- Optimal(s1..i-1, t1..j-1) (si,tj)
Optimal(si1..m,tj1..n) - for 0ltjltn, in case 1
- or
- Optimal(s1..i-1, t1..j) (si,-)
Optimal(si1..m,tj1..n) - for 0ltjltn, in case 2
- For a fixed i,
- align s1..i-1 with a prefix of t, and si1..m
with a suffix of t, for 0ltjltn - Take im/2 M(m,n)maxj ( M(m/2,j)Mr(m/2,n-j) )
77Recursive algorithm
- Compute BestScore(s1..m/2,t1..n,a)
-gtM(m/2,n) - Compute BestScore(sm/2..m,tn..1,b)
-gtMr(m/2,n) - Find the column k so that
- M(m/2,k)Mr(m/2,n-k)M(m,n)
- Recursively partition the problem to two
subproblems - Find the path from (0,0) to (m/2,k)
- Find the path from (m,n) to (m/2,n-k)
- Space complexity min(m,n)
- At each step save a and b, and the traceback
pointers for the cells in a and b, and then for
(m/2,k) only - Time complexity T(m,n)number of times a best
score is computed - T(m,n)2T(m,n)T(m/2,k)T(m/2,n-k)O(mn)
78k
m/2
m/2
n-k
79Gap penalty
80Gap penalty function
Gap consecutive number (kgt1) of spaces. From
Biology we know that when mutations are involved,
gap of k spaces are more probable than k isolated
spaces. One concrete example is given by the
c-DNA matching. In the previous problems the cost
w(k) of k internal consecutive spaces was
proportional to k, w(k) k g. Now w(k) h
kg where h g is the cost of the first space
of a gap and g
the cost of the following ones, kgt1.
CA-----CTTGG
hg
g
w(k) h 5g
g
g
g
gap
81- Attention!
- The scoring system is no more additive, i.e. we
cannot break an alignment in two parts and expect
the total score to be the sum of the partial
scores - AAC- - -AATTC C GACTAC
- ACTACCT - - - - - - CGC - -
- The scoring of an alignment is done at the block
level
82Similarities with gap
We need three matrices a, b, c, with the
following meaning ai,j maximum score of an
alignment between s1..i and t1..j
where si is matched with tj. bi,j maximum
score of an alignment between s1..i and t1..j
that ends in a - aligned with
tj. ci,j maximum score of an alignment
between s1..i and t1..j that ends
in si aligned with a - . Where
ai-1,j-1
ai,j p(i,j) max
bi-1,j-1
ci-1,j-1
ai,j-1 -(hg)
ai-1,j -(hg) bi,j max bi,j-1-g
ci,j max
(bi-1,j-(hg) ) (ci,j-1
-(hg))
ci-1,j-g
First space
83Initialization a0,0 0, ai,0 - ? for 0 ?
i ? m, a0,j - ? for 0?j ? n bi,0 - ?
for 0 ? i ? m b0,j -(hgj) for 0 ? j ?
n ci,0 -(hgi) for 0 ? i ? m c0,j - ?
for 0 ? j ? n
am,n Final result
Get the maximum among bm,n
cm,n
Trace back to obtain the optimal alignment,
remembering the current position and which array
belongs to. Time
O(mn) Space 3(mn) O(mn)
84Other gap penalty models
- Constant.
- Affine.
- Convex
- each additional space in a gap contributes less
to the gap weight than the previous space (ex.
Log(q)) - the problem is solvable in O(nm log(m)) time
- Arbitrary
- Any gap weight function is acceptable
- the problem is solvable in O(nm (mn)) time
85Multiple alignment
86Multiple Alignment
- Compare multiple sequences s1, s2, . . . , sk
over the same alphabet - Insert spaces to make them of the same size.
- MQPI LLL
- M LR - LL-
- MK - I LLL
- MP P V LIL
- no column is made exclusively of spaces
- SCORING IS MORE COMPLEX!
87Motivations
- Fact evolutionary and functionally related
molecular strings can differ significantely and
yet preserve the same 3-d and 2-d structure - It is a very important tool to extract and
represent biologically important similarities
(critical common patterns) from a set of several
strings. - Critical patterns may suggest evolutionary
history, a common structure of the protein
product, or a common biological function. - Since critical patterns can be widely dispersed
or faint, they may be not apparent in the
comparison of two strings (ex. hemoglobin). - Such patterns are used to characterize families
of proteins (famility representations) - These representations can be used to identify
other potential members of the family
88SP(Sum of Pairs)-measure
- Possible Scoring Function
-
- SP-function Sum of pair-wise scores of all
pairs of symbols in the - column
- SP-score (I, -, I, V) p(I,-) p(I,I) p(-,V)
p(I,V) p(-,I) p(I,V) - p(a, b) pair-wise score of char a and b. a
or/and b can be space. - p(- ,-) 0 (common practice).
89From multiple alignment Alg. We can also derive
pairwise alignment by just removing columns made
of all spaces. 1.
PEAALYGRFT - - - IKSDUW
2. PEAALYGRFT - - - IKSDUW
3. PESLAYNKF - - -S IKSDUW
4. PEALNYG RY - - -SSESDUW
5. PEALNYGWY - - -SSESDUW
6. PEVIRMQDD
NPFSFQSDYY Projection
2. PEAALYGRF T- - - IKSDUW
4. PEALNYGRY - - -SSESDUW Property if
p(-,-) 0, then for a multiple alignment ?, we
have SP-score (?) ??
score (?ij)
iltj Where ?ij is the pairwise
alignment induced by ? on si and sj
90Combining Optimal Pairwise Alignments into
Multiple Alignment
In the optimal multiple alignment is not true
that induced pairwise alignments ?ij are
optimal for all i,j
91Compute Optimal Alignment (O.A.) according to
SP-score
Dynamic Programming algorithm on k-dimensional
array M of size n x n x n nk for k sequences
- Mi1, i2, , ik holds the score of the O.A. of
s1 .. i1, , s1 .. ik. - Each entry of M depends on 2k -1 other entries,
one for each possible configuration of the
current column. - A A A A -
- B B B B .. -
forbidden configuration - C C - - -
- D - D - -
- To compute SP-score O(k2) steps are required to
sum the scores - of the k(k-1) pair-wise alignment.
- In total O(nk 2k k2) . O.A. is
NP-complete(WangJiang)
92We develop the exact algorithm only for k3.
Consider 3 sequences s1, s2, s3 of size n1, n2,
n3. Initialization M 0,0,0
0 M i,j,0 sp-score(i,j)
(i j) p(x,-) M i,0,k
sp-score(i,k) (i k) p(x,-)
M 0,j,k sp-score(j,k) (j k) p(x,-)
MQPILLL
MQP MLR- LL-
M3,3,0 MLR
MK- ILLL - - -
c(M,M)c(Q,L)c(P,R)c(M,-) c(Q,-) c(P,-)
c(M,-)
c(L,-) c(R,-).
93Mi,j,k is the maximum value among 7 values
S
A
A
N
S
S
V
S
N
94Multiple Alignment Projections
A 3-D alignment can be projected onto the 2-D
plane to represent an alignment between a pair of
sequences.
All 3 Pairwise Projections of the Multiple
Alignment
95Algorithm Optimal Alignment of 3 sequences
For i 1 to n1 for j 1 to n2
for k 1 to n3 if s1(i) s2(j) then
c(i,j) cmatch else
c(i,j) cmis if s1(i) s3(k) then c(i,k)
cmatch else c(i,k)
cmis if s2(j) s3(k) then c(j,k) cmatch
else c(j,k) cmis
m1 Mi-1,j-1,k-1 c(i,j) c(i,k)
c(j,k) m2 Mi-1,j-1,k c(i,j) 2csp
m3 Mi-1,j,k-1 c(i,k) 2csp 1
space on 1 sequence m4 Mi,j-1,k-1
c(j,k) 2csp m5 Mi-1,j,k 2csp
m6 Mi,j-1,k 2csp 1 space on two
sequences m7 Mi,j,k-1 2csp
Mi,j,k max (m1, , m7)
O(n1n2n3)
96Heuristic Algorithms
- Star alignment (do not yield any guarantee of
quality of the resulting)
97Star Alignment-Gusfield
- Build a multiple alignment ? based upon the
pairwise alignments between a fixed sequence, the
center of the star, and the others - is such that its projections ?ij are optimal when
either i or j is the center - Select one of the sequences as center, say sc
- Compute an optimal alignment between sc and si,
for all i - Merge the alignments by inserting spaces in the
sequences
98s4
s3
s5
sc
s2
sk
- How selecting the center sc?
- Try with all and then pick the best
- Compute all the pairwise alignments and take the
best - How inserting the spaces? once a gap in sc,
always a gap - Example
- Let s1 be the center
- ATTGCCATT
- ATGGCCATT
- ATCCAATTTT
- ATCTTCTT
- ACTGACC
s1
99- ATTGCCATT
- ATGGCCATT
- ATTGCCATT - -
- ATC- CAATTTT
- ATTGCCATT - -
- ATCTTC TT - -
- ATTGCCATT - -
- ACTGACC - - - -
- ATTGCCATT - -
- ATGGCCATT - -
- ATC- CAATTTT
- ATCTTC TT - -
- ACTGACC - - - -
100- Define the score in terms of distance d instead
of similarity the triangle inequality holds - Complexity
- Computing the center as the best sequences
minimizing ?id(Si,Sc), takes O(k2 n2) - Computing and merging the pairwise alignment
takes - ?i1 k-1O((i n)n)O(k2 n2)
101- Let M be denotes the optimal multiple alignment
and M the computed - multiple alignment
- d(Si,Sj) is the score of the pairwise alignment
of Si,Sj induced by M - d(M) is the sum of d(Si,Sj) for iltj
- Theorem d(M) ? (2-2/k)d(M)lt2 d(M)
- Computational experiments show that the
approximation is overly pessimistic - Finding solutions are within 15 from the optimum
on average
102The profile
- Given a multiple sequence alignment M involving N
sequences of length l a profile P is a l ?
S?- matrix whose columns are probability
vectors denoting the frequencies of each symbol
in the corresponding alignment column. - Example
- A-GTTTA
- AC-TTA--
- ACGTAAG-
- -C-TA---
103(No Transcript)
104(No Transcript)
105(No Transcript)
106Aligning a string to a profile
- Given a profile P and a string S, how well does S
fit P? - P can be considered as a string of profile
columns positions, and aligning S to P consists
in inserting spaces into S and P. - Scoring a char of S to every char in a column of
P - Example a,a -gt2 a,- and a,b -gt -1 a,c-gt -3
- S aabbc P
- s(a,1)0.75x2 0.25x3
- is the contribute of the first column of P
- A dynamic programming algorithm permits to
compute the optimal al.
A .75 .25 .50 B .75
.75 C .25 .25 .50
.25 - .25 .25 .25
107Align a sequence S vs. a profile P
- Let P (pij) for i1l j1 S1 be a profile
and let Ss1sn be a sequence. - The following scoring function can be defined
- ssp (S?-) ?1,2,,l ? R