Building Suffix Trees in O(m) time - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Building Suffix Trees in O(m) time

Description:

Suffix Link Lemma ... Proof of Suffix Link Lemma. A new internal node is ... Lemma: For any suffix link (v, s(v)) traversed in Ukkonen's algorithm, at that ... – PowerPoint PPT presentation

Number of Views:216
Avg rating:3.0/5.0
Slides: 34
Provided by: erict9
Learn more at: http://www.cse.msu.edu
Category:
Tags: building | lemma | suffix | time | trees

less

Transcript and Presenter's Notes

Title: Building Suffix Trees in O(m) time


1
Building Suffix Trees in O(m) time
  • Weiner had first linear time algorithm in 1973
  • McCreight developed a more space efficient
    algorithm in 1976
  • Ukkonen developed a simpler to understand variant
    in 1995
  • This is what we will focus on

2
Implicit Suffix Trees
  • An implicit suffix tree for string S is a tree
    obtained from the suffix tree for S by
  • removing from all edge labels
  • removing any edges that now have no label
  • removing any node that does not still have at
    least two children
  • Some suffixes may no longer be leaves
  • An implicit suffix tree for prefix S1..i of S
    is similarly defined based on the suffix tree for
    S1..i
  • Ii will denote the implicit suffix tree for
    S1..i

3
Example
  • Implicit tree for xabxa from tree for xabxa
  • xabxa, abxa, bxa, xa, a,

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4


3
2
1
4
Remove
  • Remove
  • xabxa, abxa, bxa, xa, a,

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4


3
2
1
5
Remove after
  • Remove
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
6
Remove unlabeled edges
  • Remove unlabeled edges
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
7
Remove unlabeled edges
  • Remove unlabeled edges
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
8
Remove interior nodes
  • Remove internal nodes with only one child
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
9
Final implicit tree
  • Remove internal nodes with only one child
  • xabxa, abxa, bxa, xa, a

b
x
x
a
a
a
b
b
x
x
a
a
3
2
1
10
Basic Structure of Algorithm
  • Initialization
  • I1 has one edge labeled S(1)
  • For i 1 to m-1 (build Ii1)
  • For j 1 to i1
  • Find location of string Sj..i in tree
  • Extend to incorporate character S(i1)
  • Expand final implicit tree to make full suffix
    tree

11
Order of operations visualization
  • S xabxacdefghixabcab
  • i1 1
  • x
  • i1 2
  • extend x to xa
  • a
  • i1 3
  • extend xa to xab
  • extend a to ab
  • b

12
Extension Rules
  • Case 1 Sj..i ends at a leaf
  • Add character S(i1) to end of label on leaf edge
  • Case 2 Not a leaf, but no path from end of
    Sj..i location continues with S(i1)
  • Split a new leaf edge for character S(i1)
  • May need to create an internal node if Sj..i
    ends in the middle of an edge
  • Case 3 Sj..i1 is already in the tree
  • No update

13
Visualization
  • Implicit tree for axabxb from tree for axabx

b
Rule 1 at a leaf node
Rule 3 already in tree
Rule 2 add a leaf edge (and an interior node)
14
Observations
  • Once Sj..i is located in the tree, extending to
    accommodate S(i1) is constant time
  • Making Ukkonens algorithm O(m2)
  • Finding the Sj..i locations in the suffix trees
    quickly when explicit computation is needed

15
Edge Label Representation
  • Potential Problem
  • Size of edge labels may be W(m2)
  • Example
  • S abcdefghijklmnopqrstuvwxyz
  • Total length is Sjltm1 j m(m1)/2
  • Similar problem can happen when the length of the
    string is arbitrarily larger than the alphabet
    size
  • Solution
  • Label edges with pair of indices indicating
    beginning and end of positions of the substring
    in S

16
Full edge label illustration
  • String S xabxa

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4


3
2
1
17
Compact edge label illustration
  • String S xabxa

(1,2) or (4,5)?
(2,2)
(6,6)
(3,6)
6
(6,6)
5
(6,6)
(3,6)
(3,6)
4
3
2
1
18
Modified Extension Rules
  • Rule 2 new leaf edge
  • label new leaf edge (i1, i1)
  • Rule 1 leaf edge extension
  • label had to be (p,i) before extension
  • given rule 2 above and an induction argument
  • now will be (p, i1)
  • Rule 3 still nothing needs to be done

19
Suffix Links
  • Consider the two strings a and xa
  • Suppose some internal node v of the tree is
    labeled with xa and another node s(v) in the tree
    is labeled with a
  • Then the edge (v,s(v)) is a suffix link
  • Do all internal nodes (the root is not considered
    an internal node) have suffix links?

20
Suffix Link Lemma
  • If a new internal node v with path-label xa is
    added to the current tree in extension j of some
    phase i1, then
  • the path labeled a already exists at an internal
    node of the tree or
  • the internal node labeled a will be created in
    the extension of j1 or
  • string a is empty and s(v) is the root

21
Proof of Suffix Link Lemma
  • A new internal node is created only by extension
    rule 2
  • This means there are two distinct suffixes of
    S1..i1 that start with xa
  • xaS(i1) and xacb where c is not S(i1)
  • This means there are two distinct suffixes of
    S1..i1 that start with a
  • aS(i1) and acb where c is not S(i1)
  • Thus, if a is not empty, a will label an internal
    node once extension j1 is processed which is the
    extension of a

22
Using suffix links to speed up location of Sj..i
  • S1..i must end at a leaf since it is the
    longest string in implicit tree Ii
  • Keep a pointer to this leaf in all cases and
    extend according to rule 1
  • Locating Sj1..i from Sj..i which is at node
    w
  • If w is an internal node, set v to w
  • Otherwise, set v parent(w)
  • If v is the root, you must traverse from root to
    find Sj1..i
  • If not, go to s(v) and begin search for remaining
    portion of Sj..i from there
  • Remaining portion is the label of the edge we
    traversed up
  • (see figure 6.5 on page 100)

23
Skip/count Trick
  • Problem Moving down from s(v) naively takes time
    proportional to the number of characters compared
  • Solution
  • At each node, only compare the first character in
    an edge label to the next character to be checked
  • Then, use the number of characters on that edge
    to update search in constant time
  • Running time is now proportional to the number of
    nodes in the path searched rather than the number
    of characters
  • See Figure 6.6 on page 102

24
O(m2) argument
  • node-depth of v number of nodes on path from
    root to node v
  • Lemma For any suffix link (v, s(v)) traversed in
    Ukkonens algorithm, at that moment, nd(v) lt
    nd(s(v))1
  • If xb is an ancestral internal node of v where b
    is not empty, then it has a suffix link to a node
    with path-label b
  • See Figure 6.7 on page 103

25
O(m2) argument
  • Lemma Any phase takes O(m) time with skip/count
    trick
  • Proof
  • Decrements to node depth at most 2m
  • i1 lt m extensions per phase
  • Walking up decreases node depth at most 1
  • Suffix link traversal decreases node depth at
    most 1
  • At most 3m downward edge traversal
  • Max node depth is m
  • None are negative
  • Each downward traversal increases depth by at
    least 1

26
Observation
  • Making Ukkonens algorithm O(m)
  • Implicit computations of many extensions
  • Need to take argument for a single phase and
    extend to multiple phases

27
Rule 3
  • Suppose suffix extension rule 3 applies to
    Sj..i1.
  • This means Sj..i1 already appears in the
    implicit suffix tree as a prefix of a larger
    suffix (and is thus a substring of S1..i)
  • Then it applies to Sk..i1 for k gt j.
  • Clearly, Sk..i1 must also be a substring of
    S1..i and thus must be in the tree
  • Thus, stop a phase once the first application of
    rule 3 occurs
  • All future rule 3 extensions are done implicitly,
    not explicitly

28
Implicit expansion of leaf nodes
  • Once a leaf, always a leaf
  • It will always be extended using rule 1
  • Implicit expansion of leaf nodes
  • When a leaf edge is created in phase i1, instead
    of labeling it with (p, i1), label it with (p,e)
  • e is a global index that is set to i1 once in
    each phase
  • In later phases, we will not need to explicitly
    extend this leaf but rather can implicitly extend
    it by incrementing e once in its global location
  • How can we easily identify leaf nodes to avoid
    explicitly expanding them in later phases?

29
Avoiding leaf nodes
  • For phase i, let last(i) denote the last
    extension of phase i that is not by rule 3
  • Observation
  • All the suffixes Sj..i for 1 lt j lt last(i)
    end at a leaf node
  • All the extensions for 1 lt j lt last(i) in phase
    i1 and higher are rule 1 extensions by previous
    slide
  • Therefore, last(i1) gt last(i)
  • Trick
  • In phase i1, only explicitly compute extensions
    for last(i)1 up till first rule 3 extension is
    found

30
Single phase algorithm
  • Phase i1
  • Increment e to i1 (implicitly extending all
    existing leaves)
  • Explicitly compute successive extensions starting
    at last(i)1 and continuing until a rule 3
    extension or no more extensions needed
  • Exact location is known given next step in
    previous phase
  • Set last(i1) appropriately (last rule 1 or 2
    extension)
  • Observation
  • Phase i and i1 share at most 1 explicit
    extension
  • In all phases, there will be at most 2m extensions

31
Visualization of explicit extensions
  • S xabxacdefghixabcab
  • Phase 1 extend x (rule 1)
  • Phase 2 extend a (rule 2)
  • Phase 3 extend b (rule 2)
  • Phase 4 extend x (rule 3)
  • Phase 5 extend x (rule 3)
  • Phase 6 extend x (rule 2), extend a (rule 2)
    extend c (rule 2)
  • Phase 7 extend d (rule 2)

32
O(m) argument
  • Lemma All phases take O(m) time
  • Proof
  • Similar to previous node-depth argument
  • Key new observation
  • From one phase to the next, we either start at
    root or we work with same node we ended with last
    time so the node depth is identical
  • With 2m total extensions at most, we get O(m)
    upward and downward traversals of links

33
Finishing up
  • Convert final implicit suffix tree to a true
    suffix tree
  • Add using just another phase of execution
  • Now all suffixes will be leaves
  • Replace e in every leaf edge with m
  • Just requires a traversal of tree which is O(m)
    time
Write a Comment
User Comments (0)
About PowerShow.com