Cupid - PowerPoint PPT Presentation

About This Presentation
Title:

Cupid

Description:

April 2003 Suffix Trees Pavel Shvaiko Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST Introduction ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 73
Provided by: Serve90
Category:
Tags: cupid | suffix | tree

less

Transcript and Presenter's Notes

Title: Cupid


1
April 2003
Suffix Trees
Pavel Shvaiko
2
Outline
  • Introduction
  • Suffix Trees (ST)
  • Building STs in linear time Ukkonens algorithm
  • Applications of ST

3
  • Introduction

4
Substrings
  • String is any sequence of characters.
  • Substring of string S is a string composed of
    characters i through j, i ? j of S.
  • S caterate is a substring.
  • car is not a substring.
  • Empty string is a substring of S.

5
Subsequences
  • Subsequence of string S is a string composed of
    characters i1 lt i2 lt lt ik of S.
  • S cater ate is a subsequence.
  • car is a subsequence.
  • Empty string is a subsequence of S.

6
String/Pattern Matching - I
  • You are given a source string S.
  • Suppose we have to answer queries of the form is
    the string pi a substring of S?
  • Knuth-Morris-Pratt (KMP) string matching.
  • O(S pi ) time per query.
  • O(nS Si pi ) time for n queries.
  • Suffix tree solution.
  • O(S Si pi ) time for n queries.

7
String/Pattern Matching - II
  • KMP preprocesses the query string pi, whereas the
    suffix tree method preprocesses the source string
    (text) S.
  • The suffix tree for the text is built in O(m)
    time during a pre-processing stage thereafter,
    whenever a string of length O(n) is input, the
    algorithm searches it in O(n) time using that
    suffix tree.

8
String Matching Prefixes Suffixes
  • Substrings of S beginning at the first position
    of S are called prefixes of S, and substrings
    that end at its last position are called suffixes
    of S.
  • SAACTAG
  • Prefixes AACTAG,AACTA,AACT,AAC,AA,A
  • Suffixes AACTAG,ACTAG,CTAG,TAG,AG,G
  • pi is a substring of S iff pi is a prefix of some
    suffix of S.

9
  • Suffix Trees

10
Definition Suffix Tree (ST) T for S of length m
  • 1. A rooted tree with m leaves numbered from 1 to
    m.
  • 2. Each internal node, excluding the root, of T
    has at least 2 children.
  • 3. Each edge of T is labeled with a nonempty
    substring of S.
  • 4. No two edges out of a node can have
    edge-labels starting with the same character.
  • 5. For any leaf i, the concatenation of the
    edge-labels on the path from the root to leaf i
    exactly spells out the suffix of S, namely
    Si,m, that starts at position i.

11
Example Suffix Tree for Sxabxac
12
Existence of a suffix tree S
  • If one suffix Sj of S matches a prefix of another
    suffix Si of S, then the path for Sj would not
    end at a leaf.
  • S xabxa
  • S1 xabxa and S4 xa
  • How to avoid this problem?
  • Assume that the last character of S appears
    nowhere else in S.
  • Add a new character not in the alphabet to the
    end of S.

13
Example Suffix Tree for Sxabxac
14
  • Building STs in linear time
  • Ukkonens algorithm

15
Building STs in linear time
  • Weiners algorithm FOCS, 1973
  • The algorithm of 1973 called by Knuth
  • First algorithm of linear time, but much space
  • McGreights algorithm JACM, 1976
  • Linear time and quadratic space
  • More readable
  • Ukkonens algorithm Algorithmica, 1995
  • Linear time algorithm and less space
  • This is what we will focus on

16
Implicit Suffix Trees
  • Ukkonens algorithm constructs a sequence of
    implicit STs, the last of which is converted to a
    true ST of the given string.
  • An implicit suffix tree for string S is a tree
    obtained from the suffix tree for S by
  • removing from all edge labels
  • removing any edges that now have no label
  • removing any node that does not still have at
    least two children
  • An implicit suffix tree for prefix S1,i of S is
    similarly defined based on the suffix tree for
    S1,i.
  • Ii will denote the implicit suffix tree for
    S1,i.
  • Each suffix is in the tree, but may not end at a
    leaf.

17
Example Construction of the Implicit ST
  • Implicit tree for xabxa from tree for xabxa
  • xabxa, abxa, bxa, xa, a,

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4


3
2
1
18
Construction of the Implicit ST Remove
  • Remove
  • xabxa, abxa, bxa, xa, a,

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4


3
2
1
19
Construction of the Implicit ST After the
Removal of
  • Remove
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
20
Construction of the Implicit ST Remove unlabeled
edges
  • Remove unlabeled edges
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
21
Construction of the Implicit ST After the
Removal of Unlabeled Edges
  • Remove unlabeled edges
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
22
Construction of the Implicit ST Remove interior
nodes
  • Remove internal nodes with only one child
  • xabxa, abxa, bxa, xa, a

b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
23
Construction of the Implicit ST Final implicit
tree
  • Remove internal nodes with only one child
  • xabxa, abxa, bxa, xa, a

b
x
x
a
a
a
b
b
x
x
a
a
3
2
1
24
Ukkonens Algorithm (UA)
  • Ii is the implicit suffix tree of the string
    S1, i
  • Construct I1
  • / Construct Ii1 from Ii /
  • for i 1 to m-1 do / phase i1 /
  • for j 1 to i1 do / extension j /
  • Find the end of the path P from the root whose
    label is Sj, i in Ii and extend P with Si1
    by suffix extension rules
  • Convert Im into a suffix tree S

25
Example
  • S xabxacd
  • i1 1
  • x
  • i1 2
  • extend x to xa
  • a
  • i1 3
  • extend xa to xab
  • extend a to ab
  • b

26
Extension Rules
  • Goal extend each Sj,i into Sj,i1
  • Rule 1 Sj,i ends at a leaf
  • Add character S(i1) to the end of the label on
    that leaf edge
  • Rule 2 Sj,i doesnt end at a leaf, and the
    following character is not S(i1)
  • Split a new leaf edge for character S(i1)
  • May need to create an internal node if Sj,i
    ends in the middle of an edge
  • Rule 3 Sj,i1 is already in the tree
  • No update

27
Example Extension Rules
  • Implicit tree for axabxb from tree for axabx

b
Rule 1 at a leaf node
Rule 3 already in tree
Rule 2 add a leaf edge (and an interior node)
28
UA for axabxc (1)
29
UA for axabxc (2)
30
UA for axabxc (3)
31
UA for axabxc (4)
32
Observations
  • Once Sj,i is located in the tree, extension
    rules take only constant time
  • Naively we could find the end of any suffix
    Sj,i in O(Sj,i) time by walking from the root
    of the current tree. By that approach, Im could
    be created in O(m3) time.
  • Making Ukkonens algorithm O(m)
  • Suffix links
  • Skip and count trick
  • Edge-label compression
  • A stopper
  • Once a leaf, always a leaf

33
Suffix Links
  • Consider the two strings a and xa
  • Suppose some internal node v of the tree is
    labeled with xa and another node s(v) in the tree
    is labeled with a
  • The edge (v,s(v)) is called a suffix link
  • Do all internal nodes (the root is not considered
    an internal node) have suffix links?

34
Example suffix links
  • S ACACACAC

35
Suffix Link Lemma
  • If a new internal node v with path-label xa is
    added to the current tree in extension j of some
    phase i1, then
  • the path labeled a already ends at an internal
    node of the tree or
  • the internal node labeled a will be created in
    the extension of j1 in the same phase i1
  • string a is empty and s(v) is the root

36
Proof of Suffix Link Lemma
  • A new internal node is created only by the
    extension rule 2
  • This means that there are two distinct suffixes
    of S1,i1 that start with xa
  • xaS(i1) and xacb where c is not S(i1)
  • This means that there are two distinct suffixes
    of S1,i1 that start with a
  • aS(i1) and acb where c is not S(i1)
  • Thus, if a is not empty, a will label an internal
    node once extension j1 is processed which is the
    extension of a

37
Corollary of Suffix Link Lemma
  • Every internal node of an implicit suffix tree
    has a suffix link from it.

38
How to use suffix links - 1
  • S1,i must end at a leaf since it is the longest
    string in the implicit tree Ii
  • Keep a pointer to this leaf in all cases and
    extend according to rule 1
  • Locating Sj1,i from Sj,i which is at node w
  • If w is an internal node, set v to w
  • Otherwise, set v parent(w)
  • If v is the root, you must traverse from the root
    to find Sj1,i
  • If not, go to s(v) and begin search for the
    remaining portion of Sj,i from there

39
How to use suffix links - 2
40
Skip and Count Trick (1)
  • Problem Moving down from s(v), directly
    implemented, takes time proportional to the
    number of characters compared
  • Solution To make running time proportional to
    the number of nodes in the path searched, instead
    of the number of characters

41
Skip and Count Trick (2)
  • After 4 nodes down-skips, the end of Sj, i is
    found.

42
Skip and Count Trick (3)
  • Node-depth of v, denoted (ND(v)), is the number
    of nodes on the path from the root to the node v
  • Lemma For any suffix link (v, s(v)) traversed in
    Ukkonens algorithm, at that moment, ND(v) ?
    ND(s(v))1

43
Skip and Count Trick (4)
  • At the moment of traversing (v,s(v)) ND(v) ?
    ND(s(v))1

44
Skip and Count Trick (5)
  • The current node-depth of the algorithm is the
    node depth of the node most recently visited by
    the algorithm
  • Lemma Using the skip and count trick, any phase
    of Ukkonens algorithm takes O(m) time.
  • Up-walk decreases the current node-depth by ? 1
  • Suffix link traversal same as up-walk
  • Totally, the current node-depth is decreased by ?
    2m.
  • No node has depth gtm.
  • The total possible increment to the current
    node-depth is ? 3m.

45
Edge Label Representation
  • Potential Problem
  • Size of edge labels may require W(m2) space
  • Thus, the time for the algorithm is at least as
    large as the size of its output
  • Example
  • S abcdefghijklmnopqrstuvwxyz
  • Total length is Sjltm1 j m(m1)/2
  • Similar problem can happen when the length of the
    string is arbitrarily larger than the alphabet
    size
  • Solution
  • Label edges with pair of indices indicating
    beginning and end of positions of the substring
    in S

46
Modified Extension Rules
  • Rule 2 new leaf edge (phase i1)
  • create edge (i1, i1)
  • split edge (p, q) gt (p, w) and (w 1, q)
  • Rule 1 leaf edge extension
  • label had to be (p,i) before extension
  • given rule 2 above and an induction argument
  • (p, q) gt (p, q 1)
  • Rule 3
  • Do nothing

47
Full edge label representation
  • String S xabxa

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4


3
2
1
48
Edge-label Compression
  • String S xabxa

(1,2) or (4,5)?
(2,2)
(6,6)
(3,6)
6
(6,6)
5
(6,6)
(3,6)
(3,6)
4
3
2
1
49
A Stopper
  • In any phase, if suffix extension rule 3 applies
    in extension j, it will also apply in all
    extensions k, where kgtj, until the end of the
    phase.
  • The extensions in phase i1 that are done after
    the first execution of rule 3 are said to be done
    implicitly. This is in contrast to any extension
    j where the end of Sj, i is explicitly found.
    An extension of that kind is called and explicit
    extension.
  • Hence, we can end any phase i1 when the first
    extension rule 3 applies.

50
Once a leaf, always a leaf (1)
  • If at some point in UA a leaf is created and
    labeled j (for the suffix starting at position j
    of S), then that leaf will remain a leaf in all
    successive trees created during the algorithm.
  • In any phase i, there is an initial sequence of
    consecutive extensions (starting with extension
    1) in which only rule 1 or 2 applies, where let
    ji be the last extension in this sequence.
  • Note that ji ? ji1.

51
Once a leaf, always a leaf (2)
  • Implicit extensions for extensions 1 to ji,
    write , e on the leaf edge, where e is a
    symbol denoting the current end and is set to i
    1 once at the beginning. In later phases, we will
    not need to explicitly extend this leaf but
    rather can implicitly extend it by incrementing e
    once in its global location.
  • Explicit extensions from extension ji1 till
    first rule 3 extension is found (or until
    extension i1 is done)

52
Once a leaf, always a leaf (3) Single phase
algorithm
  • j is the last explicit extension computed in
    phase i1
  • Phase i1
  • Increment e to i1 (implicitly extending all
    existing leaves)
  • Explicitly compute successive extensions starting
    at ji1 and continuing until reaching the first
    extension j where rule 3 applies or no more
    extensions needed
  • Set ji1 to j -1, to prepare to the next phase
  • Observation
  • Phase i and i1 share at most 1 explicit extension

53
Example Saxaxbb - (1)
  • e 1, a
  • J1 1
  • e 2, ax
  • S1,2 skip
  • S2,2 rule 2, create(2, e)
  • j2 2
  • e 3, axa
  • S1,3 .. S2,3 skip
  • S3,3 rule 3
  • j3 2

54
Example Saxaxbb - (2)
  • e 4, axax
  • S1,4 .. S2,4 skip
  • S3,4 rule 3
  • S4,4 auto skip
  • j4 2
  • e 5, axaxb
  • S1,5 .. S2,5 skip
  • S3,5 rule 2, split (1,e)? (1, 2) and (3,e),
    create (5,e)
  • S4,5 rule 2, split (2,e)? (2,2) and (3,e),
    create (5,e)
  • S5,5 rule 2, create (5,e)
  • j5 5

55
Example Saxaxbb - (3)
  • e 6, axaxbb
  • S1,6 .. S5,6 skip
  • S6,6 rule 3
  • j6 5
  • e 7, axaxbb
  • S1,7 .. S5,7 skip
  • S6,7 rule 2, split (5,e)? (5,5) and (6,e),
    create (6,e)
  • S7,7 rule 2, create (7,e)
  • j7 7

56
Complexity of UA
  • Since all the implicit extensions in any phase is
    constant, their total cost is O(m).
  • Totally, only 2m explicit extensions are
    executed.
  • The max number of down-walking skips is O(m).
  • Time-complexity of Ukkonens algorithm O(m)

57
Finishing up
  • Convert final implicit suffix tree to a true
    suffix tree
  • Add using just another phase of execution
  • Now all suffixes will be leaves
  • Replace e in every leaf edge with m
  • Just requires a traversal of tree which is O(m)
    time

58
Implementation Issues (1)
  • When the size of the alphabet grows
  • For large trees suffix links allow an algorithm
    to move quickly from one part of the tree to a
    distant part of the tree. This is great for
    worst-case time bounds, but its horrible if the
    tree isn't entirely in memory.
  • Thus, implementing ST to reduce practical space
    use can be a serious concern.
  • The main design issues are how to represent and
    search the branches out of the nodes of the tree.
  • A practical design must balance between
    constraints of space and need for speed

59
Implementation Issues (2)
  • There are four basic choices to represent
    branches
  • An array of size ?(?) at each non-leaf node v
  • A linked list at node v of characters that appear
    at the beginning of the edge-labels out of v.
  • If its kept in sorted order it reduces the
    average time to search for a given character
  • In the worst case it, adds time ? to every node
    operation. If the number of children of v is
    large, then little space is saved over the array
    while noticeably degrading performance
  • A balanced tree implements the list at node v
  • Additions and searches take O(logk) time and O(k)
    space, where k is the number of children of v.
    This alternative makes sense only when k is
    fairly large.
  • A hashing scheme. The challenge is to find a
    scheme balancing space with speed. For large
    trees and alphabets hashing is very attractive at
    least for some of the nodes

60
Implementation Issues (3)
  • When m and ? are large enough, the best design is
    probably a mixture of the above choices.
  • Nodes near the root of the tree tend to have the
    most children, so arrays are sensible choice at
    those nodes.
  • For nodes in the middle of a suffix tree, hashing
    or balanced trees may be the best choice.
  • Sometimes the alphabet size is explicitly
    presented in the time and space bounds
  • Construction time is O(m log?),using ?(m?)
    space. m is the size of the string.

61
  • Applications of Suffix Trees

62
Applications of Suffix Trees
  • Exact string matching
  • Substring problem for a database
  • Longest common substring
  • Suffix arrays

63
Exact String Matching (1)
  • Exact matching problem given a pattern P of
    length n and a text T of length m, find all
    occurrences of P in T in O(nm) time.
  • Overview of the ST approach
  • Built a ST for text T in O(m) time
  • Match the characters of P along the unique path
    in ST until either P is exhausted or no more
    matches are possible
  • In the latter case, P doesnt appear anywhere in
    T
  • In the former case, every leaf in the subtree
    below the point of the last match is numbered
    with a starting location of P in T, and every
    starting location of P in T numbers such a leaf
  • ST approach spends O(m) preprocessing time and
    then O(nk) search time, where k is the number of
    occurrences of P in T

64
Exact String Matching (2)
  • When search terminates at an element node, P
    appears exactly once in the source string T.
  • When the search for P terminates at a branch
    node, each element node in the subtree rooted at
    this branch node gives a different occurrence of
    P.

65
Substring Problem for a Database
  • Input a database D (a set of strings) and a
    string S
  • Output find all the strings in D containing S as
    a substring
  • Usage identity of the person
  • Exact string matching methods cannot work
  • Suffix tree
  • D is stored in O(m) space, where m is the total
    length of all the strings in D.
  • Suffix tree is built in O(m) time
  • Each lookup of S (the length of S is n) costs
    O(n) time

66
Longest common substring (1)
  • Input two strings S1 and S2
  • Output find the longest substring S common to S1
    and S2
  • Example
  • S1common-substring
  • S2common-subsequence
  • Then, Scommon-subs

67
Longest common substring (2)
  • Build a suffix tree for S1 and S2
  • Each leaf represents either a suffix from one of
    S1 and S2, or a suffix from both S1 and S2
  • Mark each internal node v with a 1 (2) if there
    is a leaf in the subtree of v representing a
    suffix from S1(S2)
  • The path-label of any internal node marked both 1
    and 2 is a substring common to both S1 and S2,
    and the longest such string is the longest common
    substring.

68
Longest common substring (3)
  • S1 xabxac, S2 abx, S abx

69
Suffix Arrays (1)
  • A suffix array of an m-character string S, is an
    array of integers in the range 1 to m, specifying
    the lexicographic order of the m suffixes of S.
  • Example S xabxac

70
Suffix Arrays (2)
  • A suffix array of S can be obtained from the
    suffix tree T of S by performing a lexical
    depth-first search of T

71
Suffix Arrays Exact string matching
  • Given two strings T and P, where T m and P
    n, find all the occurrences of P in T ?
  • Using the binary search on the suffix array of T
    , all the occurrences of P in T can be found in
    O(n logm) time.
  • Example let T xabxac and P ac

72
References
  • Dan Gusfield Algorithms on Strings, Trees, and
    Sequences. University of California,
    Davis.Cambridge University Press,1997
  • Ela Hunt et al. A database index to large
    biological sequences. Slides at VLDB, 2001
  • R.C.T. Lee and Chin Lung Lu. Suffix Trees.
    Slides at CS 5313 Algorithms for Molecular
    Biology
Write a Comment
User Comments (0)
About PowerShow.com