Suffix Trees - PowerPoint PPT Presentation

1 / 67
About This Presentation
Title:

Suffix Trees

Description:

By rule2 no problem traversing an edge Biology easily has 500 years of exciting problems to work on. Donald Knuth web.ist.utl.pt/joao.carreira Questions? – PowerPoint PPT presentation

Number of Views:93
Avg rating:3.0/5.0
Slides: 68
Provided by: utl4
Category:
Tags: naive | string | suffix | trees

less

Transcript and Presenter's Notes

Title: Suffix Trees


1
(No Transcript)
2
Suffix Trees
Construction and Applications
João Carreira 2008
3
Outline
  • Why Suffix Trees?
  • Definition
  • Ukkonen's Algorithm (construction)?
  • Applications

4
Why Suffix Trees?
5
Why Suffix Trees?
  • Asymptotically fast.

6
Why Suffix Trees?
  • Asymptotically fast.
  • The basis of state of the art data structures.

7
Why Suffix Trees?
  • Asymptotically fast.
  • The basis of state of the art data structures.
  • You don't need a Phd to use them.

8
Why Suffix Trees?
  • Asymptotically fast.
  • The basis of state of the art data structures.
  • You don't need a Phd to use them.
  • Challenging.

9
Why Suffix Trees?
  • Asymptotically fast.
  • The basis of state of the art data structures.
  • You don't need a Phd to use them.
  • Challenging.
  • Expose interesting algorithmic ideas.

10
Definition
Suffix Tree for an m-character string
  • m leaves numbered 1 to m

11
Definition
Suffix Tree for an m-character string
  • m leaves numbered 1 to m
  • edge-label vs node-label

12
Definition
Suffix Tree for an m-character string
  • m leaves numbered 1 to m
  • edge-label vs node-label
  • each internal node has at least two children

13
Definition
Suffix Tree for an m-character string
  • m leaves numbered 1 to m
  • edge-label vs node-label
  • each internal node has at least two children
  • the label of the leaf j is S j..m

14
Definition
Suffix Tree for an m-character string
  • m leaves numbered 1 to m
  • edge-label vs node-label
  • each internal node has at least two children
  • the label of the leaf j is S j..m
  • no two edges out of the same node can have
    edge-labels
  • beginning with the same character

15
Definition Example
String xabxac Length (m) 6 characters Number of
Leaves 6 Node 5 label ac
16
Implicit vs Explicit
  • What if we have axabx ?

17
Ukkonen's Algorithm
suffix tree construction
18
Ukkonen's Algorithm
suffix tree construction
  • Text S 1..m
  • m phases
  • phase j is divided into j extensions
  • In extension j of phase i 1
  • find the end of the path from the root labeled
    with substring S j..i
  • extend the substring by adding the character S(i
    1) to its end

19
Extension Rules
  • Rule 1 Path ß ends at a leaf. S(i 1) is added
    to the end of the label on that leaf edge.

20
Extension Rules
  • Rule 2 No path from the end of ß starts with
    S(i 1), but at least one labeled path
    continues from the end of ß.

21
Extension Rules
  • Rule 3 Some path from the end of ß starts with
    S(i 1), so we do nothing.

22
Ukkonen's Algorithm
suffix tree construction
Complexity
23
Ukkonen's Algorithm
suffix tree construction
Complexity
  • m phases

24
Ukkonen's Algorithm
suffix tree construction
Complexity
  • m phases
  • phase j -gt j extensions

25
Ukkonen's Algorithm
suffix tree construction
Complexity
  • m phases
  • phase j -gt j extensions
  • find the end of the path of substring ß
    O(ß) O(m)?

26
Ukkonen's Algorithm
suffix tree construction
Complexity
  • m phases
  • phase j -gt j extensions
  • find the end of the path of substring ß
    O(ß) O(m)?
  • each extension O(1)?

27
Ukkonen's Algorithm
suffix tree construction
Complexity
  • m phases
  • phase j -gt j extensions
  • find the end of the path of substring ß
    O(ß) O(m)?
  • each extension O(1)?

O(m3)?
28
First make it run, then make it run fast.
Brian Kernighan
29
Suffix Links
Definition
  • For an internal node v with path-label xa, if
    there is another node s(v), with
  • path-label a, then a pointer from v to s(v) is
    called a suffix link.

30
Suffix Links
Lemma
  • If a new internal node v with path label xa is
    added to the current tree in extension
  • j of some phase, then either the path labeled a
    already ends at an internal node
  • or an internal at the end of the string a will be
    created in the next extension
  • of the same phase.

If Rule 2 applies
31
Suffix Links
Lemma
  • If a new internal node v with path label xa is
    added to the current tree in extension
  • j of some phase, then either the path labeled a
    already ends at an internal node
  • or an internal at the end of the string a will be
    created in the next extension
  • of the same phase.
  • If Rule 2 applies
  • S j..i continues with c ? S(i 1)?

32
Suffix Links
Lemma
  • If a new internal node v with path label xa is
    added to the current tree in extension
  • j of some phase, then either the path labeled a
    already ends at an internal node
  • or an internal at the end of the string a will be
    created in the next extension
  • of the same phase.
  • If Rule 2 applies
  • S j..i continues with c ? S(i 1)?
  • S j 1..i continues with c.

33
Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
34
Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
2. If v is the root, follow the path for S j..i
(as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following
the path for string ?.
35
Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
2. If v is the root, follow the path for S j..i
(as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following
the path for string ?.
3. Using the extension rules, ensure that the
string S j..i S(i1) is in the tree.
36
Single Extension Algorithm
Extension j of phase i 1
1. Find the first node v at or above the end of
S j - 1..i that either has a suffix link
from it or is the root. Let ? denote the string
between v and the end of S j 1..i .
2. If v is the root, follow the path for S j..i
(as in the naive algorithm). Else traverse the
suffix link and walk down from s(v) following
the path for string ?.
3. Using the extension rules, ensure that the
string S j..i S(i1) is in the tree.
4. If a new internal w was created in extension j
1 (by rule 2), then string a must end at
node s(w), the end node for the suffix link from
w. Create the suffix link (w, s(w)) from w to
s(w).
37
Node Depth
The node-depth of v is at most one greater than
the node depth of s(v).


ß
ß
xa
a
xa
a
x?
x?
?
?
Node depth 4
Node depth 3
equal node-depth 3
38
Skip/count Trick
  • ? number of characters in an edge
  • Directly implemented edge traversal O(?)?

39
Skip/count Trick
  • ? number of characters in an edge
  • Directly implemented edge traversal O(?)?
  • Jump from node to node.
  • K number of nodes in a path
  • Time to traverse a path O(K)?

40
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
41
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
  • There are i 1 m extensions in phase i 1

42
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
  • There are i 1 m extensions in phase i 1
  • In a single extension, the algorithm walks up at
    most one edge, traverses one suffix link,
  • walks down some number of nodes, applies the
    extension rules and may add a suffix link.

43
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
  • There are i 1 m extensions in phase i 1
  • In a single extension, the algorithm walks up at
    most one edge, traverses one suffix link,
  • walks down some number of nodes, applies the
    extension rules and may add a suffix link.
  • The up-walk decreases the current node-depth by
    at most one.

44
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
  • There are i 1 m extensions in phase i 1
  • In a single extension, the algorithm walks up at
    most one edge, traverses one suffix link,
  • walks down some number of nodes, applies the
    extension rules and may add a suffix link.
  • The up-walk decreases the current node-depth by
    at most one.
  • Each suffix link traversal decreases the
    node-depth by at most another one.

45
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
  • There are i 1 m extensions in phase i 1
  • In a single extension, the algorithm walks up at
    most one edge, traverses one suffix link,
  • walks down some number of nodes, applies the
    extension rules and may add a suffix link.
  • The up-walk decreases the current node-depth by
    at most one.
  • Each suffix link traversal decreases the
    node-depth by at most another one.
  • Each down-walk moves to a node of greater depth.

46
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
  • There are i 1 m extensions in phase i 1
  • In a single extension, the algorithm walks up at
    most one edge, traverses one suffix link,
  • walks down some number of nodes, applies the
    extension rules and may add a suffix link.
  • The up-walk decreases the current node-depth by
    at most one.
  • Each suffix link traversal decreases the
    node-depth by at most another one.
  • Each down-walk moves to a node of greater depth.
  • Over the entire phase the node-depth is
    decremented at most 2m times.

47
Ukkonen's Algorithm
  • Using the skip/count trick
  • any phase of Ukkonen's algorithm takes O(m) time.

Proof
  • There are i 1 m extensions in phase i 1
  • In a single extension, the algorithm walks up at
    most one edge, traverses one suffix link,
  • walks down some number of nodes, applies the
    extension rules and may add a suffix link.
  • The up-walk decreases the current node-depth by
    at most one.
  • Each suffix link traversal decreases the
    node-depth by at most another one.
  • Each down-walk moves to a node of greater depth.
  • Over the entire phase the node-depth is
    decremented at most 2m times.
  • No node can have depth greater than m, so the
    total increment to current node-depth
  • (down walks) is bounded by 3m over the entire
    phase.

48
Ukkonen's Algorithm
  • m phases
  • 1 phase O(m)?

49
Ukkonen's Algorithm
  • m phases
  • 1 phase O(m)?

O(m2)?
50
First make it run fast, then make it run faster.
João Carreira
51
Edge-Label Compression
  • A string with m characters has m suffixes.
  • If edge labels are represented with characters,
    O(m2) space is needed.

52
Edge-Label Compression
  • A string with m characters has m suffixes.
  • If edge labels are represented with characters,
    O(m2) space is needed.

To achieve O(m) space, each edge-label
(p, q)?
53
Two more tricks...
54
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also
apply in all further extensions until the end of
the phase.
Why?
55
Rule 3 is a show stopper
If rule 3 applies in extension j, it will also
apply in all further extensions until the end of
the phase.
Why?
  • When rule 3 applies, the path labeled S j..i
    must continue with character S(i 1), and
  • so the path labeled S j 1..i does also,
    and rule 3 again applies in extensions j1...i1.

56
Rule 3 is a show stopper
  • End any phase i 1 the first time rule 3 applies.
  • The remaining extensions are said to be done
    implicitly.

57
Once a leaf always a leaf
  • Leaf created gt always a leaf in all successive
    trees.
  • No mechanism for extending a leaf edge beyond
    its current leaf.
  • Once there is a leaf labeled j, extension rule 1
    will always apply to extension j
  • in any sucessive phase.

58
Once a leaf always a leaf
  • Leaf created gt always a leaf in all successive
    trees.
  • No mechanism for extending a leaf edge beyond
    its current leaf.
  • Once there is a leaf labeled j, extension rule 1
    will always apply to extension j
  • in any sucessive phase.

(p, e)?
Leaf Edge Label
59
Single Phase Algorithm
In each phase i
60
Single Phase Algorithm
During construction
61
Implicit to Explicit
One last phase to add character O(m)?
62
Suffix Trees are a Swiss Knife
63
Applications
Exact String Matching
64
Applications
Exact String Matching
Preprocessing O(m)?
Search O(n k)?
Three ocurrences of string aw.
65
Applications
And much more..
  • Longest common substring O(n)?
  • Longest repeated substring O(n)?
  • Longest palindrome O(n)?
  • Most frequently occurring substrings of a
    minimum length O(n)?
  • Shortest substrings occurring only
    once O(n)?
  • Lempel-Ziv decomposition O(n)?
  • .....

66
Biology easily has 500 years of exciting
problems to work on.
Donald Knuth
67
web.ist.utl.pt/joao.carreira
Questions?
Write a Comment
User Comments (0)
About PowerShow.com