Suffix trees - PowerPoint PPT Presentation

1 / 113
About This Presentation
Title:

Suffix trees

Description:

Prepare a generalized suffix tree for s = cbaaba$ and sr = abaabc ... Can we construct a suffix tree in linear time ? Ukkonen's linear time construction ... – PowerPoint PPT presentation

Number of Views:81
Avg rating:3.0/5.0
Slides: 114
Provided by: hai2
Category:

less

Transcript and Presenter's Notes

Title: Suffix trees


1
Suffix trees
2
Trie
  • A tree representing a set of strings.

c
a
aeef ad bbfe bbfg c
b
e
b
d
e
f
c
f
e
g
3
Trie (Cont)
  • Assume no string is a prefix of another

c
Each edge is labeled by a letter, no two edges
outgoing from the same node are labeled the
same. Each string corresponds to a leaf.
a
b
e
b
d
e
f
c
f
e
g
4
Compressed Trie
  • Compress unary nodes, label edges by strings

c
?
c
a
a
b
e
b
bbf
d
d
eef
e
f
c
c
f
e
g
e
g
5
Suffix tree
Given a string s a suffix tree of s is a
compressed trie of all suffixes of s
To make these suffixes prefix-free we add a
special character, say , at the end of s
6
Suffix tree (Example)
Let sabab, a suffix tree of s is a compressed
trie of all suffixes of sabab

b ab bab abab
a
b
b

a
a
b

b


7
Trivial algorithm to build a Suffix tree
a
b
Put the largest suffix in
a
b

a
b
b
a
Put the suffix bab in
a
b
b


8
a
b
b
a
a
b
b


Put the suffix ab in
a
b
b
a
b

a

b

9
a
b
b
a
b

a

b

Put the suffix b in
a
b
b

a
a
b

b


10
a
b
b

a
a
b

b



Put the suffix in
a
b
b

a
a
b

b


11

a
b
b

a
a
b

b


We will also label each leaf with the starting
point of the corres. suffix.

a
b
5
b

a
a
b

4
b


3
2
1
12
Analysis
  • Takes O(n2) time to build.

We will see how to do it in O(n) time
13
What can we do with it ?
  • Exact string matching
  • Given a Text T, T n, preprocess it such that
    when a pattern P, Pm, arrives you can quickly
    decide when it occurs in T.
  • We may also want to find all occurrences of P in T

14
Exact string matching
In preprocessing we just build a suffix tree in
O(n) time

a
b
5
b

a
a
b

4
b


3
2
1
Given a pattern P ab we traverse the tree
according to the pattern.
15

a
b
5
b

a
a
b

4
b


3
2
1
If we did not get stuck traversing the pattern
then the pattern occurs in the text.
Each leaf in the subtree below the node we reach
corresponds to an occurrence.
By traversing this subtree we get all k
occurrences in O(nk) time
16
Generalized suffix tree
Given a set of strings S a generalized suffix
tree of S is a compressed trie of all suffixes of
s ? S
To associate each suffix with a unique string in
S add a different special char to each s
17
Generalized suffix tree (Example)
Let s1abab and s2aab here is a generalized
suffix tree for s1 and s2


b b ab
ab bab aab abab
a
b
4
5


a
b
a
3
b
b
4



a

2
1
b

3
2
1
18
So what can we do with it ?
Matching a pattern against a database of strings
19
Longest common substring (of two strings)
Every node with a leaf descendant from string s1
and a leaf descendant from string s2 represents
a maximal common substring and vice versa.


a
b
4
5


a
b
a
3
b
b
4



a

2
1
b

Find such node with largest string depth
3
2
1
20
Lowest common ancetors
A lot more can be gained from the suffix tree if
we preprocess it so that we can answer LCA
queries on it








21
Why?
The LCA of two leaves represents the longest
common prefix (LCP) of these 2 suffixes


a
b
4
5


a
b
a
3
b
b
4



a

2
1
b

3
2
1
22
Finding maximal palindromes
  • A palindrome caabaac, cbaabc
  • Want to find all maximal palindromes in a string s

Let s cbaaba
The maximal palindrome with center between i-1
and i is the LCP of the suffix at position i of s
and the suffix at position m-i1 of sr
23
Maximal palindromes algorithm
Prepare a generalized suffix tree for
s cbaaba and sr abaabc
For every i find the LCA of suffix i of s and
suffix m-i1 of sr
24
Let s cbaaba then sr abaabc
a

b

c
7
7
a b

b
a
baaba
c
c
6
6
a b
c
a
a

4
abc
5
5

3
3
c
a
4
1
2
2
1
25
Analysis
O(n) time to identify all palindromes
26
Can we construct a suffix tree in linear time ?
27
Ukkonens linear time construction
ACTAATC
A
A
1
28
ACTAATC
AC
A
1
29
ACTAATC
AC
AC
1
30
ACTAATC
AC
AC
C
1
2
31
ACTAATC
ACT
AC
C
1
2
32
ACTAATC
ACT
ACT
C
1
2
33
ACTAATC
ACT
ACT
CT
1
2
34
ACTAATC
ACT
T
ACT
CT
1
2
3
35
ACTAATC
ACTA
T
ACT
CT
1
2
3
36
ACTAATC
ACTA
T
ACTA
CT
1
2
3
37
ACTAATC
ACTA
T
ACTA
CTA
1
2
3
38
ACTAATC
ACTA
TA
ACTA
CTA
1
2
3
39
ACTAATC
ACTAA
TA
ACTA
CTA
1
2
3
40
ACTAATC
ACTAA
TA
ACTAA
CTA
1
2
3
41
ACTAATC
ACTAA
TA
ACTAA
CTAA
1
2
3
42
ACTAATC
ACTAA
TAA
ACTAA
CTAA
1
2
3
43
ACTAATC
ACTAA
TAA
A
CTAA
2
3
A
CTAA
1
4
44
Phases extensions
  • Phase i is when we add character i

i
  • In phase i we have i extensions of suffixes

45
ACTAATC
ACTAAT
TAA
A
CTAA
2
3
A
CTAA
1
4
46
ACTAATC
ACTAAT
TAA
A
CTAA
2
3
A
CTAAT
1
4
47
ACTAATC
ACTAAT
TAA
A
CTAAT
2
3
A
CTAAT
1
4
48
ACTAATC
ACTAAT
TAAT
A
CTAAT
2
3
A
CTAAT
1
4
49
ACTAATC
ACTAAT
TAAT
A
CTAAT
2
3
AT
CTAAT
1
4
50
ACTAATC
ACTAAT
TAAT
A
CTAAT
2
3
AT
T
CTAAT
5
1
4
51
Extension rules
  • Rule 1 The suffix ends at a leaf, you add a
    character on the edge entering the leaf
  • Rule 2 The suffix ended internally and the
    extended suffix does not exist, you add a leaf
    and possibly an internal node
  • Rule 3 The suffix exists and the extended suffix
    exists, you do nothing

52
ACTAATC
ACTAATC
TAAT
A
CTAAT
2
3
AT
T
CTAAT
5
1
4
53
ACTAATC
ACTAATC
TAAT
A
CTAAT
2
3
AT
T
CTAATC
5
1
4
54
ACTAATC
ACTAATC
TAAT
A
CTAATC
2
3
AT
T
CTAATC
5
1
4
55
ACTAATC
ACTAATC
TAATC
A
CTAATC
2
3
AT
T
CTAATC
5
1
4
56
ACTAATC
ACTAATC
TAATC
A
CTAATC
2
3
ATC
T
CTAATC
5
1
4
57
ACTAATC
ACTAATC
TAATC
A
CTAATC
2
3
ATC
TC
CTAATC
5
1
4
58
ACTAATC
ACTAATC
T
A
CTAATC
2
ATC
TC
C
AATC
CTAATC
5
1
4
3
6
59
Skip forward..
ACTAATCAC
A
T
C
AC
TCAC
ATCAC
CAC
TAATCAC
CTAATCAC
AATCAC
3
4
6
7
2
1
5
60
ACTAATCACT
A
T
C
AC
TCAC
ATCAC
CAC
TAATCAC
CTAATCAC
AATCAC
3
4
6
7
2
1
5
61
ACTAATCACT
A
T
C
AC
TCAC
ATCAC
CAC
TAATCAC
CTAATCACT
AATCAC
3
4
6
7
2
1
5
62
ACTAATCACT
A
T
C
AC
TCAC
ATCAC
CAC
TAATCACT
CTAATCACT
AATCAC
3
4
6
7
2
1
5
63
ACTAATCACT
A
T
C
AC
TCAC
ATCAC
CAC
TAATCACT
CTAATCACT
AATCACT
3
4
6
7
2
1
5
64
ACTAATCACT
A
T
C
AC
TCAC
ATCACT
CAC
TAATCACT
CTAATCACT
AATCACT
3
4
6
7
2
1
5
65
ACTAATCACT
A
T
C
AC
TCACT
ATCACT
CAC
TAATCACT
CTAATCACT
AATCACT
3
4
6
7
2
1
5
66
ACTAATCACT
A
T
C
AC
TCACT
ATCACT
CACT
TAATCACT
CTAATCACT
AATCACT
3
4
6
7
2
1
5
67
ACTAATCACT
A
T
C
ACT
TCACT
ATCACT
CACT
TAATCACT
CTAATCACT
AATCACT
3
4
6
7
2
1
5
68
ACTAATCACTG
A
T
C
ACT
TCACT
ATCACT
CACT
TAATCACT
CTAATCACT
AATCACT
3
4
6
7
2
1
5
69
ACTAATCACTG
A
T
C
ACT
TCACT
ATCACT
CACT
TAATCACT
CTAATCACTG
AATCACT
3
4
6
7
2
1
5
70
ACTAATCACTG
A
T
C
ACT
TCACT
ATCACT
CACT
TAATCACTG
CTAATCACTG
AATCACT
3
4
6
7
2
1
5
71
ACTAATCACTG
A
T
C
ACT
TCACT
ATCACT
CACT
TAATCACTG
CTAATCACTG
AATCACTG
3
4
6
7
2
1
5
72
ACTAATCACTG
A
T
C
ACT
TCACT
ATCACTG
CACT
TAATCACTG
CTAATCACTG
AATCACTG
3
4
6
7
2
1
5
73
ACTAATCACTG
A
T
C
ACT
TCACTG
ATCACTG
CACT
TAATCACTG
CTAATCACTG
AATCACTG
3
4
6
7
2
1
5
74
ACTAATCACTG
A
T
C
ACT
TCACTG
ATCACTG
CACTG
TAATCACTG
CTAATCACTG
AATCACTG
3
4
6
7
2
1
5
75
ACTAATCACTG
A
T
C
ACTG
TCACTG
ATCACTG
CACTG
TAATCACTG
CTAATCACTG
AATCACTG
3
4
6
7
2
1
5
76
ACTAATCACTG
A
T
C
ACTG
TCACTG
ATCACTG
CACTG
TAATCACTG
CT
AATCACTG
3
4
6
7
2
5
G
AATCACTG
8
1
77
ACTAATCACTG
A
T
C
ACTG
TCACTG
ATCACTG
CACTG
T
CT
AATCACTG
3
4
6
7
5
G
G
AATCACTG
AATCACTG
8
1
2
9
78
ACTAATCACTG
A
G
T
C
11
ACTG
TCACTG
ATCACTG
CACTG
T
CT
G
AATCACTG
3
4
6
7
10
5
G
G
AATCACTG
AATCACTG
8
1
2
9
79
Observations
i
At the first extension we must end at a leaf
because no longer suffix exists (rule 1)
i
At the second extension we still most likely to
end at a leaf.
We will not end at a leaf only if the second
suffix is a prefix of the first
80
i
Say at some extension we do not end at a leaf
Then this suffix is a prefix of some other suffix
(suffixes)
We will not end at a leaf in subsequent extensions
Is there a way to continue using ith character ?
81
Rule 3
Rule 2
If we apply rule 3 then in all subsequent
extensions we will apply rule 3
Otherwise we keep applying rule 2 until in some
subsequent extensions we will apply rule 3
Rule 3
82
In terms of the rules that we apply a phase looks
like
1 1 1 1 1 1 1 2 2 2 2 3 3 3 3
We have nothing to do when applying rule 3, so
once rule 3 happens we can stop
We dont really do anything significant when we
apply rule 1 (the structure of the tree does not
change)
83
Representation
  • We do not really store a substring with each
    edge, but rather pointers into the starting
    position and ending position of the substring in
    the text
  • With this representaion we do not really have to
    do anything when rule 1 applies

84
How do phases relate to each other
1 1 1 1 1 1 1 2 2 2 2 3 3 3 3
i
The next phase we must have
1 1 1 1 1 1 1 1 1 1 1 2/3
So we start the phase with the extension that was
the first where we applied rule 3 in the previous
phase
85
Suffix Links
ACTAATCACTG
A
G
T
C
11
ACTG
TCACTG
ATCACTG
CACTG
T
CT
G
AATCACTG
3
4
6
7
10
5
G
G
AATCACTG
AATCACTG
8
1
2
9
86
ACTAATCACTG
A
G
T
C
11
ACTG
TCACTG
ATCACTG
CACTG
T
CT
G
AATCACTG
3
4
6
7
10
5
G
G
AATCACTG
AATCACTG
8
1
2
9
87
ACTAATCACTG
A
G
T
C
11
ACTG
TCACTG
ATCACTG
CACTG
T
CT
G
AATCACTG
3
4
6
7
10
5
G
G
AATCACTG
AATCACTG
8
1
2
9
88
ACTAATCACTG
A
G
T
C
11
ACTG
TCACTG
ATCACTG
CACTG
T
CT
G
AATCACTG
3
4
6
7
10
5
G
G
AATCACTG
AATCACTG
8
1
2
9
89
ACTAATCACTG
A
G
T
C
11
ACTG
TCACTG
ATCACTG
CACTG
T
CT
G
AATCACTG
3
4
6
7
10
5
G
G
AATCACTG
AATCACTG
8
1
2
9
90
Suffix Links
  • From an internal node that corresponds to the
    string aß to the internal node that corresponds
    to ß (if there is such node)

ß
aß
91
  • Is there such a node ?

Suppose we create v applying rule 2. Then there
was a suffix aßx and now we add aßy
ß
aß
v
y
x..
So there was a suffix ßx
92
  • Is there such a node ?

Suppose we create v applying rule 2. Then there
was a suffix aßx and now we add aßy
ß
aß
x..
z..
v
y
x..
So there was a suffix ßx
If there was also a suffix ßz
Then a node corresponding to ß is there
93
  • Is there such a node ?

Suppose we create v applying rule 2. Then there
was a suffix aßx and now we add aßy
ß
aß
x..
y
v
y
x..
So there was a suffix ßx
If there was also a suffix ßz
Then a node corresponding to ß is there
Otherwise it will be created in the next
extension when we add ßy
94
Inv All suffix links are there except (possibly)
of the last internal node added
You are at the (internal) node corresponding to
the last extension
i
i
Remember we apply rule 2
You start a phase at the last internal node of
the first extension in which you applied rule 3
in the previous iteration
95
1) Go up one node (if needed) to find a suffix
link
2) Traverse the suffix link
3) If you went up in step 1 along an edge that
was labeled d then go down consuming a string d
96
ß
aß
d
d
v
x..
y
Create the new internal node if necessary
97
ß
aß
d
d
v
x..
y
Create the new internal node if necessary
98
ß
aß
d
d
y
v
x..
y
Create the new internal node if necessary, add
the suffix
99
ß
aß
d
d
y
v
x..
y
Create the new internal node if necessary, add
the suffix and install a suffix link if necessary
100
Analysis
Handling all extensions of rule 1 and all
extensions of rule 3 per phase take O(1) time ?
O(n) total
How many times do we carry out rule 2 in all
phases ?
O(n)
Does each application of rule 2 takes constant
time ?
No ! (going up and traversing the suffix link
takes constant time, but then we go down possibly
on many edges..)
101
ß
aß
d
d
y
v
x..
y
102
So why is it a linear time algorithm ?
How much can the depth change when we traverse a
suffix link ?
It can decrease by at most 1
103
Punch line
Each time we go up or traverse a suffix link the
depth decreases by at most 1
When starting the depth is 0, final depth is at
most n
So during all applications of rule 2 together we
cannot go down more than 3n times
THM The running time of Ukkonens algorithm is
O(n)
104
Another application
  • Suppose we have a pattern P and a text T and we
    want to find for each position of T the length of
    the longest substring of P that matches there.
  • How would you do that in O(n) time ?

105
(No Transcript)
106
Drawbacks of suffix trees
  • Suffix trees consume a lot of space
  • It is O(n) but the constant is quite big
  • Notice that if we indeed want to traverse an edge
    in O(1) time then we need an array of ptrs. of
    size S in each node

107
Suffix array
  • We loose some of the functionality but we save
    space.

Let s abab
Sort the suffixes lexicographically ab, abab,
b, bab
The suffix array gives the indices of the
suffixes in sorted order
3
1
4
2
108
How do we build it ?
  • Build a suffix tree
  • Traverse the tree in DFS, lexicographically
    picking edges outgoing from each node and fill
    the suffix array.
  • O(n) time

109
How do we search for a pattern ?
  • If P occurs in T then all its occurrences are
    consecutive in the suffix array.
  • Do a binary search on the suffix array
  • Takes O(mlogn) time

110
Example
Let S mississippi
i
L
ippi
issippi
Let P issa
ississippi
mississippi
pi
M
ppi
sippi
sisippi
ssippi
ssissippi
R
111
How do we accelerate the search ?
Maintain l LCP(P,L)
l
L
Maintain r LCP(P,R)
If l r then start comparing M to P at l 1
M
R
r
112
How do we accelerate the search ?
l
L
If l gt r then
Suppose we know LCP(L,M) If LCP(L,M) lt l we go
left If LCP(L,M) gt l we go right If LCP(L,M) l
we start comparing at l 1
M
R
r
113
Analysis of the acceleration
If we do more than a single comparison in an
iteration then max(l, r ) grows by 1 for each
comparison ? O(logn m) time
Write a Comment
User Comments (0)
About PowerShow.com