Title: Cupid
1 April 2003
Suffix Trees
Pavel Shvaiko
2Outline
- Introduction
- Suffix Trees (ST)
- Building STs in linear time Ukkonens algorithm
- Applications of ST
3 4Substrings
- String is any sequence of characters.
- Substring of string S is a string composed of
characters i through j, i ? j of S. - S caterate is a substring.
- car is not a substring.
- Empty string is a substring of S.
5Subsequences
- Subsequence of string S is a string composed of
characters i1 lt i2 lt lt ik of S. - S cater ate is a subsequence.
- car is a subsequence.
- Empty string is a subsequence of S.
6String/Pattern Matching - I
- You are given a source string S.
- Suppose we have to answer queries of the form is
the string pi a substring of S? - Knuth-Morris-Pratt (KMP) string matching.
- O(S pi ) time per query.
- O(nS Si pi ) time for n queries.
- Suffix tree solution.
- O(S Si pi ) time for n queries.
7String/Pattern Matching - II
- KMP preprocesses the query string pi, whereas the
suffix tree method preprocesses the source string
(text) S. - The suffix tree for the text is built in O(m)
time during a pre-processing stage thereafter,
whenever a string of length O(n) is input, the
algorithm searches it in O(n) time using that
suffix tree.
8String Matching Prefixes Suffixes
- Substrings of S beginning at the first position
of S are called prefixes of S, and substrings
that end at its last position are called suffixes
of S. - SAACTAG
- Prefixes AACTAG,AACTA,AACT,AAC,AA,A
- Suffixes AACTAG,ACTAG,CTAG,TAG,AG,G
- pi is a substring of S iff pi is a prefix of some
suffix of S.
9 10Definition Suffix Tree (ST) T for S of length m
- 1. A rooted tree with m leaves numbered from 1 to
m. - 2. Each internal node, excluding the root, of T
has at least 2 children. - 3. Each edge of T is labeled with a nonempty
substring of S. - 4. No two edges out of a node can have
edge-labels starting with the same character. - 5. For any leaf i, the concatenation of the
edge-labels on the path from the root to leaf i
exactly spells out the suffix of S, namely
Si,m, that starts at position i.
11Example Suffix Tree for Sxabxac
12Existence of a suffix tree S
- If one suffix Sj of S matches a prefix of another
suffix Si of S, then the path for Sj would not
end at a leaf. - S xabxa
- S1 xabxa and S4 xa
- How to avoid this problem?
- Assume that the last character of S appears
nowhere else in S. - Add a new character not in the alphabet to the
end of S.
13Example Suffix Tree for Sxabxac
14- Building STs in linear time
- Ukkonens algorithm
15Building STs in linear time
- Weiners algorithm FOCS, 1973
- The algorithm of 1973 called by Knuth
- First algorithm of linear time, but much space
- McGreights algorithm JACM, 1976
- Linear time and quadratic space
- More readable
- Ukkonens algorithm Algorithmica, 1995
- Linear time algorithm and less space
- This is what we will focus on
16Implicit Suffix Trees
- Ukkonens algorithm constructs a sequence of
implicit STs, the last of which is converted to a
true ST of the given string. - An implicit suffix tree for string S is a tree
obtained from the suffix tree for S by - removing from all edge labels
- removing any edges that now have no label
- removing any node that does not still have at
least two children - An implicit suffix tree for prefix S1,i of S is
similarly defined based on the suffix tree for
S1,i. - Ii will denote the implicit suffix tree for
S1,i. - Each suffix is in the tree, but may not end at a
leaf.
17Example Construction of the Implicit ST
- Implicit tree for xabxa from tree for xabxa
- xabxa, abxa, bxa, xa, a,
b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
18Construction of the Implicit ST Remove
- Remove
- xabxa, abxa, bxa, xa, a,
b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
19Construction of the Implicit ST After the
Removal of
- Remove
- xabxa, abxa, bxa, xa, a
b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
20Construction of the Implicit ST Remove unlabeled
edges
- Remove unlabeled edges
- xabxa, abxa, bxa, xa, a
b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
21Construction of the Implicit ST After the
Removal of Unlabeled Edges
- Remove unlabeled edges
- xabxa, abxa, bxa, xa, a
b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
22Construction of the Implicit ST Remove interior
nodes
- Remove internal nodes with only one child
- xabxa, abxa, bxa, xa, a
b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
23Construction of the Implicit ST Final implicit
tree
- Remove internal nodes with only one child
- xabxa, abxa, bxa, xa, a
b
x
x
a
a
a
b
b
x
x
a
a
3
2
1
24Ukkonens Algorithm (UA)
- Ii is the implicit suffix tree of the string
S1, i - Construct I1
- / Construct Ii1 from Ii /
- for i 1 to m-1 do / phase i1 /
- for j 1 to i1 do / extension j /
- Find the end of the path P from the root whose
label is Sj, i in Ii and extend P with Si1
by suffix extension rules - Convert Im into a suffix tree S
25Example
- S xabxacd
- i1 1
- x
- i1 2
- extend x to xa
- a
- i1 3
- extend xa to xab
- extend a to ab
- b
26Extension Rules
- Goal extend each Sj,i into Sj,i1
- Rule 1 Sj,i ends at a leaf
- Add character S(i1) to the end of the label on
that leaf edge - Rule 2 Sj,i doesnt end at a leaf, and the
following character is not S(i1) - Split a new leaf edge for character S(i1)
- May need to create an internal node if Sj,i
ends in the middle of an edge - Rule 3 Sj,i1 is already in the tree
- No update
27Example Extension Rules
- Implicit tree for axabxb from tree for axabx
b
Rule 1 at a leaf node
Rule 3 already in tree
Rule 2 add a leaf edge (and an interior node)
28UA for axabxc (1)
29UA for axabxc (2)
30UA for axabxc (3)
31UA for axabxc (4)
32Observations
- Once Sj,i is located in the tree, extension
rules take only constant time - Naively we could find the end of any suffix
Sj,i in O(Sj,i) time by walking from the root
of the current tree. By that approach, Im could
be created in O(m3) time. - Making Ukkonens algorithm O(m)
- Suffix links
- Skip and count trick
- Edge-label compression
- A stopper
- Once a leaf, always a leaf
33Suffix Links
- Consider the two strings a and xa
- Suppose some internal node v of the tree is
labeled with xa and another node s(v) in the tree
is labeled with a - The edge (v,s(v)) is called a suffix link
- Do all internal nodes (the root is not considered
an internal node) have suffix links?
34Example suffix links
35Suffix Link Lemma
- If a new internal node v with path-label xa is
added to the current tree in extension j of some
phase i1, then - the path labeled a already ends at an internal
node of the tree or - the internal node labeled a will be created in
the extension of j1 in the same phase i1 - string a is empty and s(v) is the root
36Proof of Suffix Link Lemma
- A new internal node is created only by the
extension rule 2 - This means that there are two distinct suffixes
of S1,i1 that start with xa - xaS(i1) and xacb where c is not S(i1)
- This means that there are two distinct suffixes
of S1,i1 that start with a - aS(i1) and acb where c is not S(i1)
- Thus, if a is not empty, a will label an internal
node once extension j1 is processed which is the
extension of a
37Corollary of Suffix Link Lemma
- Every internal node of an implicit suffix tree
has a suffix link from it.
38How to use suffix links - 1
- S1,i must end at a leaf since it is the longest
string in the implicit tree Ii - Keep a pointer to this leaf in all cases and
extend according to rule 1 - Locating Sj1,i from Sj,i which is at node w
- If w is an internal node, set v to w
- Otherwise, set v parent(w)
- If v is the root, you must traverse from the root
to find Sj1,i - If not, go to s(v) and begin search for the
remaining portion of Sj,i from there
39How to use suffix links - 2
40Skip and Count Trick (1)
- Problem Moving down from s(v), directly
implemented, takes time proportional to the
number of characters compared - Solution To make running time proportional to
the number of nodes in the path searched, instead
of the number of characters
41Skip and Count Trick (2)
- After 4 nodes down-skips, the end of Sj, i is
found.
42Skip and Count Trick (3)
- Node-depth of v, denoted (ND(v)), is the number
of nodes on the path from the root to the node v - Lemma For any suffix link (v, s(v)) traversed in
Ukkonens algorithm, at that moment, ND(v) ?
ND(s(v))1
43Skip and Count Trick (4)
- At the moment of traversing (v,s(v)) ND(v) ?
ND(s(v))1
44Skip and Count Trick (5)
- The current node-depth of the algorithm is the
node depth of the node most recently visited by
the algorithm - Lemma Using the skip and count trick, any phase
of Ukkonens algorithm takes O(m) time. - Up-walk decreases the current node-depth by ? 1
- Suffix link traversal same as up-walk
- Totally, the current node-depth is decreased by ?
2m. - No node has depth gtm.
- The total possible increment to the current
node-depth is ? 3m.
45Edge Label Representation
- Potential Problem
- Size of edge labels may require W(m2) space
- Thus, the time for the algorithm is at least as
large as the size of its output - Example
- S abcdefghijklmnopqrstuvwxyz
- Total length is Sjltm1 j m(m1)/2
- Similar problem can happen when the length of the
string is arbitrarily larger than the alphabet
size - Solution
- Label edges with pair of indices indicating
beginning and end of positions of the substring
in S
46Modified Extension Rules
- Rule 2 new leaf edge (phase i1)
- create edge (i1, i1)
- split edge (p, q) gt (p, w) and (w 1, q)
- Rule 1 leaf edge extension
- label had to be (p,i) before extension
- given rule 2 above and an induction argument
- (p, q) gt (p, q 1)
- Rule 3
- Do nothing
47Full edge label representation
b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
48Edge-label Compression
(1,2) or (4,5)?
(2,2)
(6,6)
(3,6)
6
(6,6)
5
(6,6)
(3,6)
(3,6)
4
3
2
1
49A Stopper
- In any phase, if suffix extension rule 3 applies
in extension j, it will also apply in all
extensions k, where kgtj, until the end of the
phase. - The extensions in phase i1 that are done after
the first execution of rule 3 are said to be done
implicitly. This is in contrast to any extension
j where the end of Sj, i is explicitly found.
An extension of that kind is called and explicit
extension. - Hence, we can end any phase i1 when the first
extension rule 3 applies.
50Once a leaf, always a leaf (1)
- If at some point in UA a leaf is created and
labeled j (for the suffix starting at position j
of S), then that leaf will remain a leaf in all
successive trees created during the algorithm. - In any phase i, there is an initial sequence of
consecutive extensions (starting with extension
1) in which only rule 1 or 2 applies, where let
ji be the last extension in this sequence. - Note that ji ? ji1.
51Once a leaf, always a leaf (2)
- Implicit extensions for extensions 1 to ji,
write , e on the leaf edge, where e is a
symbol denoting the current end and is set to i
1 once at the beginning. In later phases, we will
not need to explicitly extend this leaf but
rather can implicitly extend it by incrementing e
once in its global location. - Explicit extensions from extension ji1 till
first rule 3 extension is found (or until
extension i1 is done)
52Once a leaf, always a leaf (3) Single phase
algorithm
- j is the last explicit extension computed in
phase i1 - Phase i1
- Increment e to i1 (implicitly extending all
existing leaves) - Explicitly compute successive extensions starting
at ji1 and continuing until reaching the first
extension j where rule 3 applies or no more
extensions needed - Set ji1 to j -1, to prepare to the next phase
- Observation
- Phase i and i1 share at most 1 explicit extension
53Example Saxaxbb - (1)
- e 2, ax
- S1,2 skip
- S2,2 rule 2, create(2, e)
- j2 2
- e 3, axa
- S1,3 .. S2,3 skip
- S3,3 rule 3
- j3 2
54Example Saxaxbb - (2)
- e 4, axax
- S1,4 .. S2,4 skip
- S3,4 rule 3
- S4,4 auto skip
- j4 2
- e 5, axaxb
- S1,5 .. S2,5 skip
- S3,5 rule 2, split (1,e)? (1, 2) and (3,e),
create (5,e) - S4,5 rule 2, split (2,e)? (2,2) and (3,e),
create (5,e) - S5,5 rule 2, create (5,e)
- j5 5
55Example Saxaxbb - (3)
- e 6, axaxbb
- S1,6 .. S5,6 skip
- S6,6 rule 3
- j6 5
- e 7, axaxbb
- S1,7 .. S5,7 skip
- S6,7 rule 2, split (5,e)? (5,5) and (6,e),
create (6,e) - S7,7 rule 2, create (7,e)
- j7 7
56Complexity of UA
- Since all the implicit extensions in any phase is
constant, their total cost is O(m). - Totally, only 2m explicit extensions are
executed. - The max number of down-walking skips is O(m).
- Time-complexity of Ukkonens algorithm O(m)
57Finishing up
- Convert final implicit suffix tree to a true
suffix tree - Add using just another phase of execution
- Now all suffixes will be leaves
- Replace e in every leaf edge with m
- Just requires a traversal of tree which is O(m)
time
58Implementation Issues (1)
- When the size of the alphabet grows
- For large trees suffix links allow an algorithm
to move quickly from one part of the tree to a
distant part of the tree. This is great for
worst-case time bounds, but its horrible if the
tree isn't entirely in memory. - Thus, implementing ST to reduce practical space
use can be a serious concern. - The main design issues are how to represent and
search the branches out of the nodes of the tree. - A practical design must balance between
constraints of space and need for speed
59Implementation Issues (2)
- There are four basic choices to represent
branches - An array of size ?(?) at each non-leaf node v
- A linked list at node v of characters that appear
at the beginning of the edge-labels out of v. - If its kept in sorted order it reduces the
average time to search for a given character - In the worst case it, adds time ? to every node
operation. If the number of children of v is
large, then little space is saved over the array
while noticeably degrading performance - A balanced tree implements the list at node v
- Additions and searches take O(logk) time and O(k)
space, where k is the number of children of v.
This alternative makes sense only when k is
fairly large. - A hashing scheme. The challenge is to find a
scheme balancing space with speed. For large
trees and alphabets hashing is very attractive at
least for some of the nodes
60Implementation Issues (3)
- When m and ? are large enough, the best design is
probably a mixture of the above choices. - Nodes near the root of the tree tend to have the
most children, so arrays are sensible choice at
those nodes. - For nodes in the middle of a suffix tree, hashing
or balanced trees may be the best choice. - Sometimes the alphabet size is explicitly
presented in the time and space bounds - Construction time is O(m log?),using ?(m?)
space. m is the size of the string.
61- Applications of Suffix Trees
62Applications of Suffix Trees
- Exact string matching
- Substring problem for a database
- Longest common substring
- Suffix arrays
63Exact String Matching (1)
- Exact matching problem given a pattern P of
length n and a text T of length m, find all
occurrences of P in T in O(nm) time. - Overview of the ST approach
- Built a ST for text T in O(m) time
- Match the characters of P along the unique path
in ST until either P is exhausted or no more
matches are possible - In the latter case, P doesnt appear anywhere in
T - In the former case, every leaf in the subtree
below the point of the last match is numbered
with a starting location of P in T, and every
starting location of P in T numbers such a leaf - ST approach spends O(m) preprocessing time and
then O(nk) search time, where k is the number of
occurrences of P in T
64Exact String Matching (2)
- When search terminates at an element node, P
appears exactly once in the source string T. - When the search for P terminates at a branch
node, each element node in the subtree rooted at
this branch node gives a different occurrence of
P.
65Substring Problem for a Database
- Input a database D (a set of strings) and a
string S - Output find all the strings in D containing S as
a substring - Usage identity of the person
- Exact string matching methods cannot work
- Suffix tree
- D is stored in O(m) space, where m is the total
length of all the strings in D. - Suffix tree is built in O(m) time
- Each lookup of S (the length of S is n) costs
O(n) time
66Longest common substring (1)
- Input two strings S1 and S2
- Output find the longest substring S common to S1
and S2 - Example
- S1common-substring
- S2common-subsequence
- Then, Scommon-subs
67Longest common substring (2)
- Build a suffix tree for S1 and S2
- Each leaf represents either a suffix from one of
S1 and S2, or a suffix from both S1 and S2 - Mark each internal node v with a 1 (2) if there
is a leaf in the subtree of v representing a
suffix from S1(S2) - The path-label of any internal node marked both 1
and 2 is a substring common to both S1 and S2,
and the longest such string is the longest common
substring.
68Longest common substring (3)
69Suffix Arrays (1)
- A suffix array of an m-character string S, is an
array of integers in the range 1 to m, specifying
the lexicographic order of the m suffixes of S. - Example S xabxac
70Suffix Arrays (2)
- A suffix array of S can be obtained from the
suffix tree T of S by performing a lexical
depth-first search of T
71Suffix Arrays Exact string matching
- Given two strings T and P, where T m and P
n, find all the occurrences of P in T ? - Using the binary search on the suffix array of T
, all the occurrences of P in T can be found in
O(n logm) time. - Example let T xabxac and P ac
72References
- Dan Gusfield Algorithms on Strings, Trees, and
Sequences. University of California,
Davis.Cambridge University Press,1997 - Ela Hunt et al. A database index to large
biological sequences. Slides at VLDB, 2001 - R.C.T. Lee and Chin Lung Lu. Suffix Trees.
Slides at CS 5313 Algorithms for Molecular
Biology