Title: Suffix Trees
1(No Transcript)
2Suffix Trees
Construction and Applications
João Carreira 2008
3Outline
- Why Suffix Trees?
- Definition
- Ukkonen's Algorithm (construction)?
- Applications
4Why Suffix Trees?
5Why Suffix Trees?
6Why Suffix Trees?
- Asymptotically fast.
- The basis of state of the art data structures.
7Why Suffix Trees?
- Asymptotically fast.
- The basis of state of the art data structures.
- You don't need a Phd to use them.
8Why Suffix Trees?
- Asymptotically fast.
- The basis of state of the art data structures.
- You don't need a Phd to use them.
- Challenging.
9Why Suffix Trees?
- Asymptotically fast.
- The basis of state of the art data structures.
- You don't need a Phd to use them.
- Challenging.
- Expose interesting algorithmic ideas.
10Definition
Suffix Tree for an m-character string
11Definition
Suffix Tree for an m-character string
- m leaves numbered 1 to m
- edge-label vs node-label
12Definition
Suffix Tree for an m-character string
- m leaves numbered 1 to m
- edge-label vs node-label
- each internal node has at least two children
13Definition
Suffix Tree for an m-character string
- m leaves numbered 1 to m
- edge-label vs node-label
- each internal node has at least two children
- the label of the leaf j is S j..m
14Definition
Suffix Tree for an m-character string
- m leaves numbered 1 to m
- edge-label vs node-label
- each internal node has at least two children
- the label of the leaf j is S j..m
- no two edges out of the same node can have
edge-labels - beginning with the same character
15Definition Example
String xabxac Length (m) 6 characters Number of
Leaves 6 Node 5 label ac
16Implicit vs Explicit
17Ukkonen's Algorithm
suffix tree construction
18Ukkonen's Algorithm
suffix tree construction
- Text S 1..m
- m phases
- phase j is divided into j extensions
- In extension j of phase i 1
- find the end of the path from the root labeled
with substring S j..i - extend the substring by adding the character S(i
1) to its end
19Extension Rules
- Rule 1 Path ß ends at a leaf. S(i 1) is added
to the end of the label on that leaf edge.
20Extension Rules
- Rule 2 No path from the end of ß starts with
S(i 1), but at least one labeled path
continues from the end of ß.
21Extension Rules
- Rule 3 Some path from the end of ß starts with
S(i 1), so we do nothing.
22Ukkonen's Algorithm
suffix tree construction
Complexity
23Ukkonen's Algorithm
suffix tree construction
Complexity
24Ukkonen's Algorithm
suffix tree construction
Complexity
- m phases
- phase j -gt j extensions
25Ukkonen's Algorithm
suffix tree construction
Complexity
- m phases
- phase j -gt j extensions
- find the end of the path of substring ß
O(ß) O(m)?
26Ukkonen's Algorithm
suffix tree construction
Complexity
- m phases
- phase j -gt j extensions
- find the end of the path of substring ß
O(ß) O(m)? - each extension O(1)?
27Ukkonen's Algorithm
suffix tree construction
Complexity
- m phases
- phase j -gt j extensions
- find the end of the path of substring ß
O(ß) O(m)? - each extension O(1)?
O(m3)?
28First make it run, then make it run fast.
Brian Kernighan
29Suffix Links
Definition
- For an internal node v with path-label xa, if
there is another node s(v), with - path-label a, then a pointer from v to s(v) is
called a suffix link.
30Suffix Links
Lemma
- If a new internal node v with path label xa is
added to the current tree in extension - j of some phase, then either the path labeled a
already ends at an internal node - or an internal at the end of the string a will be
created in the next extension - of the same phase.
If Rule 2 applies
31Suffix Links
Lemma
- If a new internal node v with path label xa is
added to the current tree in extension - j of some phase, then either the path labeled a
already ends at an internal node - or an internal at the end of the string a will be
created in the next extension - of the same phase.
- If Rule 2 applies
- S j..i continues with c ? S(i 1)?
32Suffix Links
Lemma
- If a new internal node v with path label xa is
added to the current tree in extension - j of some phase, then either the path labeled a
already ends at an internal node - or an internal at the end of the string a will be
created in the next extension - of the same phase.
- If Rule 2 applies
- S j..i continues with c ? S(i 1)?
- S j 1..i continues with c.
33Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
34Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
2. If v is the root, follow the path for S j..i
(as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following
the path for string ?.
35Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
2. If v is the root, follow the path for S j..i
(as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following
the path for string ?.
3. Using the extension rules, ensure that the
string S j..i S(i1) is in the tree.
36Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
2. If v is the root, follow the path for S j..i
(as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following
the path for string ?.
3. Using the extension rules, ensure that the
string S j..i S(i1) is in the tree.
4. If a new internal w was created in extension j
1 (by rule 2), then string a must end at
node s(w), the end node for the suffix link from
w. Create the suffix link (w, s(w)) from w to
s(w).
37Node Depth
The node-depth of v is at most one greater than
the node depth of s(v).
xß
xß
ß
ß
xa
a
xa
a
x?
x?
?
?
Node depth 4
Node depth 3
equal node-depth 3
38Skip/count Trick
- ? number of characters in an edge
- Directly implemented edge traversal O(?)?
39Skip/count Trick
- ? number of characters in an edge
- Directly implemented edge traversal O(?)?
- Jump from node to node.
- K number of nodes in a path
- Time to traverse a path O(K)?
40Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
41Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
- There are i 1 m extensions in phase i 1
42Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
- There are i 1 m extensions in phase i 1
- In a single extension, the algorithm walks up at
most one edge, traverses one suffix link, - walks down some number of nodes, applies the
extension rules and may add a suffix link.
43Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
- There are i 1 m extensions in phase i 1
- In a single extension, the algorithm walks up at
most one edge, traverses one suffix link, - walks down some number of nodes, applies the
extension rules and may add a suffix link. - The up-walk decreases the current node-depth by
at most one.
44Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
- There are i 1 m extensions in phase i 1
- In a single extension, the algorithm walks up at
most one edge, traverses one suffix link, - walks down some number of nodes, applies the
extension rules and may add a suffix link. - The up-walk decreases the current node-depth by
at most one. - Each suffix link traversal decreases the
node-depth by at most another one.
45Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
- There are i 1 m extensions in phase i 1
- In a single extension, the algorithm walks up at
most one edge, traverses one suffix link, - walks down some number of nodes, applies the
extension rules and may add a suffix link. - The up-walk decreases the current node-depth by
at most one. - Each suffix link traversal decreases the
node-depth by at most another one. - Each down-walk moves to a node of greater depth.
46Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
- There are i 1 m extensions in phase i 1
- In a single extension, the algorithm walks up at
most one edge, traverses one suffix link, - walks down some number of nodes, applies the
extension rules and may add a suffix link. - The up-walk decreases the current node-depth by
at most one. - Each suffix link traversal decreases the
node-depth by at most another one. - Each down-walk moves to a node of greater depth.
- Over the entire phase the node-depth is
decremented at most 2m times.
47Ukkonen's Algorithm
- Using the skip/count trick
- any phase of Ukkonen's algorithm takes O(m) time.
Proof
- There are i 1 m extensions in phase i 1
- In a single extension, the algorithm walks up at
most one edge, traverses one suffix link, - walks down some number of nodes, applies the
extension rules and may add a suffix link. - The up-walk decreases the current node-depth by
at most one. - Each suffix link traversal decreases the
node-depth by at most another one. - Each down-walk moves to a node of greater depth.
- Over the entire phase the node-depth is
decremented at most 2m times. - No node can have depth greater than m, so the
total increment to current node-depth - (down walks) is bounded by 3m over the entire
phase.
48Ukkonen's Algorithm
49Ukkonen's Algorithm
O(m2)?
50First make it run fast, then make it run faster.
João Carreira
51Edge-Label Compression
- A string with m characters has m suffixes.
- If edge labels are represented with characters,
O(m2) space is needed.
52Edge-Label Compression
- A string with m characters has m suffixes.
- If edge labels are represented with characters,
O(m2) space is needed.
To achieve O(m) space, each edge-label
(p, q)?
53Two more tricks...
54Rule 3 is a show stopper
If rule 3 applies in extension j, it will also
apply in all further extensions until the end of
the phase.
Why?
55Rule 3 is a show stopper
If rule 3 applies in extension j, it will also
apply in all further extensions until the end of
the phase.
Why?
- When rule 3 applies, the path labeled S j..i
must continue with character S(i 1), and - so the path labeled S j 1..i does also,
and rule 3 again applies in extensions j1...i1.
56Rule 3 is a show stopper
- End any phase i 1 the first time rule 3 applies.
- The remaining extensions are said to be done
implicitly.
57Once a leaf always a leaf
- Leaf created gt always a leaf in all successive
trees. - No mechanism for extending a leaf edge beyond
its current leaf.
- Once there is a leaf labeled j, extension rule 1
will always apply to extension j - in any sucessive phase.
58Once a leaf always a leaf
- Leaf created gt always a leaf in all successive
trees. - No mechanism for extending a leaf edge beyond
its current leaf.
- Once there is a leaf labeled j, extension rule 1
will always apply to extension j - in any sucessive phase.
(p, e)?
Leaf Edge Label
59Single Phase Algorithm
In each phase i
60Single Phase Algorithm
During construction
61Implicit to Explicit
One last phase to add character O(m)?
62Suffix Trees are a Swiss Knife
63Applications
Exact String Matching
64Applications
Exact String Matching
Preprocessing O(m)?
Search O(n k)?
Three ocurrences of string aw.
65Applications
And much more..
- Longest common substring O(n)?
- Longest repeated substring O(n)?
- Longest palindrome O(n)?
- Most frequently occurring substrings of a
minimum length O(n)? - Shortest substrings occurring only
once O(n)? - Lempel-Ziv decomposition O(n)?
- .....
66Biology easily has 500 years of exciting
problems to work on.
Donald Knuth
67web.ist.utl.pt/joao.carreira
Questions?