Title: Fast and Simple Circular Pattern Matching
1New tabulation and dynamic
programming based techniques for sequence
similarity problems
Szymon Grabowski
Lodz University of Technology, Institute of
Applied Computer Science, Lódz,
Polandsgrabow_at_kis.p.lodz.pl
Sept. 2014
2Agenda
- (Naïve) dynamic programming.
- Four Russians.
- Main LCS results.
- Bille Farach-Coltontechnique.
- Our improvement of the BFC alg.
- Our LCS result with sparse DP.
- Algorithmic apps (Lev distance, LCTS, MerLCS).
- Concl open problems.
3Dynamic Programming (DP)
- Everybody knows
- Quadratic cost for 2 sequences (cant compute a
cell "in a middle" before knowing the previous
rows/cols), - Speedup ideas tabulation (aka Four
Russians),bit-parallelism, sparse dynamic
programming,compressing the input sequences.
3
4DP made (slightly) faster
If we can process blocks of b ? b symbols in O(1)
time, we immediately obtain O(mn / b2) time. We
can do it (Masek Paterson, 1980) e.g. for
binary alphabet and b log n / 4 ? O(mn / log2
n) time.
The idea is to precompute all possible inputs
(short enough strings are guaranteed to repeat
and represent the DP values in differential
manner).
5LCS, selected results (time compl.)
Standard DP O(mn). Tabulation (Masek Paterson,
1980) O(mn / log2 n) for a constant
alphabet. Tabulation (Bille Farach-Colton,
2008) O(mn (log log n)2 / log2 n) for an
integer alphabet. Bit-parallelism (Allison Dix,
1986, ) O(mn / w), w ? log n is machine word
size (in bits). Sparse DP Hunt Szymanski,
1977 O(r log log n), r is the of
matches,Eppstein, Galil, Giancarlo Italiano,
1992 O(D log log(minD, mn / D)), D ? r is the
of dominant matches.
5
6LCS, selected results, contd
Sparse DP Sakai, 2012 O(m? minD?, p(m-q)
n),where p LCS(A, B), q LCS(A1m,
B). LZ78-compressed inputCrochemore, Landau
Ziv-Ukelson, 2003O(hmn / log n), for a constant
alphabet,where h ? 1 is the entropy of the
inputs (for a binary alph.). RLE-compressed
inputseveral results, incl. Liu, Wang Lee,
2008O(minnl, km), where l, m are
RLE-compressed seq lengths. SLP-compressed
inputGawrychowski, 2012 O(kn sqrt(log(n / k)),
where k is total length of SLP-compressed
sequences.
7The technique of Bille Farach-Colton
For an integer alphabet of size ?, the Masek
Patersonresult can easily be modified to have
O(mn log2 ? / log2 n) time. This is fine for
small ?, but not if ? nc, c gt 0.
Bille Farach-Colton use alphabet mapping in
superblocks.Use superblocks of size e.g. log3 n
? log3 n and divide each superblock into blocks
of size ?(log n / log log n) ? ?(log n / log log
n).
8BFC, contd
That is, for current text snippet from A of
length log3 nextract up to log3 n distinct
symbols and encode the current snippet of A and
current snippet of B accordingly (one extra
symbol for "smth else" in snippet B needed).
Easily, O(log log n) bits per encoded symbol
are enough, mapping times overall negligible (a
BST can be used with log(superblock)-factor per
symbol) and O(mn (log log n)2 / log2 n) total
time.
9BFC, alphabet mapping example
Blocks of size 3 ? 3, superblocks of size 9 ? 9.
10Our technique (Alg 1)
Use the BFC alphabet mapping in superblocks. But
use many LUTs (instead of 1), yet with modified
input.One LUT per horizontal stripe (of length
n).
- The LUT input
- snippet of A,
- left block border (1 bit per cell),
- upper block border (1 bit per cell).
- No snippet of B as part of the input, as it is
fixed for a given LUT! (Re-use LUTs for
repeating snippets of B.) - Thanks to it, we work on rectangular (not square)
"portrait"-oriented blocksof size ?(log n / log
log n) ? ?(log n).
11One horizontal stripe (4 blocks of 5 ? 5)
seq A
seqB
Red arrows explicitly stored LCS values black
arrows diff-encoded LCS values.
05550 and 34023 text snippets encoded with ref
to a superblock (not shown).
The diagonally shaded cells are the block output
cells.
11
12LCS, first result (Alg 1)
12
13Output-dependent algorithm
We work in blocks of (b1) ? (b1), but divide
theminto sparse ones, which have ? K
matches,and dense ones with gt K matches.
Key observationknowing the top row and leftmost
column for the blockplus the location of all
matches in itis enough to compute this
block.That is, the text snippets are not needed!
14Where sparse DP meets tabulation
- A sparse block input
- top row b bits (diff encoding),
- leftmost column b bits (diff encoding),
- match locations each in log(b2) bits,totalling
O(K log b) bits. - (Output even less.)
Hence, if K log b b O(log n) (with a small
enough constant), we can use a LUT for all sparse
blocks and compute each of them in constant time.
15Dense blocks
Dense blocks are partitioned into smaller
blockswhich then will be processed by our
technique from Alg 1. The smaller block sizes
are?(log n / log log n) ? ?(b).
16Choosing the parameters
b O(log n) (otherwise the LUT build costs will
be dominating), but also b ?(log n / sqrt(log
log n)) (otherwise this alg will never beat Alg
1). This implies K ?(log n / log log n), with
an appropriate constant.
For a small enough r ( total of matches in the
matrix) we may have O(mn / log2 n) from the above
formula, alas in the pp we have to find and
encode all matches in all sparse blocks, in O(n
r) time.
17LCS, second result (Alg 2)
18Alg 2 niche
- Considering the results of
- Eppstein et al., 1992,
- Sakai, 2012,
- Alg 1,
- we obtain the following niche in which Alg 2 is
the winner
19Simple generalization of Th. 1 and 2
20Longest common transposition-invariant
subsequence (LCTS)
LCTS LCS in the best key transposition (in
music, transposition is shifting a sequence of
notes (pitches) up or down by a constant
interval).
21LCTS, known results and a new one
22Merged LCS (MerLCS)
A bioinformatics problem on 3 sequencesgiven
sequences A, B and P, return a longest seq. T
that is a subsequence of Pand can be split into
two subsequences T and Tsuch that T is a
subsequence of Aand T is a subsequence of
B.A n, B m, P u.
Known resultsPeng, Yang, Huang, Tseng Hor,
2010 O(lmn) time,where l ? n is the result
length. Deorowicz Danek, 2013 O(??u / w? mn
log w) time.
23Our result for MerLCS
DP matrix propertyDeorowicz and Danek noticed
thatM(i, j, k) is equal to or larger by 1
thanany of the three neighhbors M(i 1, j,
k), M(i, j 1, k), M(i, j, k 1).
We generalize our result on 2 sequences to 3
sequences (input 3 text snippetsplus 3 2-dim
walls instead of 1-dim borders!)to obtain O(mnu
/ log3/2 n) for MerLCS,if u ?(nc) for some c gt
0.
24Conclusions
- Tabulation ( Four Russians) is a classic
DP-boosting technique. Interestingly, we managed
to (slightly) improve its application to the LCS
/ edit distance problem.
- Applying tabulation may be even better for a
sparse matrix.
- These techniques work also for a few other
problems than LCS and edit distance.
24
25Open problems
- Can we improve the tabulation based result on
compressible sequences?
- Can we adopt our technique(s) to problemsin
which the conditions from Lemma 3 (or Lemma 7,
involving 3 sequences) are relaxed, that is,
consecutive DP cells may (sometimes) differ more
than by a constant?Exemplary problem
SEQ-EC-LCS (Chen Chao, 2011 Deorowicz
Grabowski, 2014).
25