Cupid

About This Presentation

Title:

Cupid

Description:

April 2003 Suffix Trees Pavel Shvaiko Outline Introduction Suffix Trees (ST) Building STs in linear time: Ukkonen s algorithm Applications of ST Introduction ... – PowerPoint PPT presentation

Number of Views:81

Avg rating:3.0/5.0

Slides: 73

Provided by: Serve90

Category:

more less

Transcript and Presenter's Notes

Title: Cupid

1
April 2003
Suffix Trees
Pavel Shvaiko
2
Outline

Introduction
Suffix Trees (ST)
Building STs in linear time Ukkonens algorithm
Applications of ST

Introduction

4
Substrings

String is any sequence of characters.
Substring of string S is a string composed of
characters i through j, i ? j of S.
S caterate is a substring.
car is not a substring.
Empty string is a substring of S.

5
Subsequences

Subsequence of string S is a string composed of
characters i1 lt i2 lt lt ik of S.
S cater ate is a subsequence.
car is a subsequence.
Empty string is a subsequence of S.

6
String/Pattern Matching - I

You are given a source string S.
Suppose we have to answer queries of the form is
the string pi a substring of S?
Knuth-Morris-Pratt (KMP) string matching.
O(S pi ) time per query.
O(nS Si pi ) time for n queries.
Suffix tree solution.
O(S Si pi ) time for n queries.

7
String/Pattern Matching - II

KMP preprocesses the query string pi, whereas the
suffix tree method preprocesses the source string
(text) S.
The suffix tree for the text is built in O(m)
time during a pre-processing stage thereafter,
whenever a string of length O(n) is input, the
algorithm searches it in O(n) time using that
suffix tree.

8
String Matching Prefixes Suffixes

Substrings of S beginning at the first position
of S are called prefixes of S, and substrings
that end at its last position are called suffixes
of S.
SAACTAG
Prefixes AACTAG,AACTA,AACT,AAC,AA,A
Suffixes AACTAG,ACTAG,CTAG,TAG,AG,G
pi is a substring of S iff pi is a prefix of some
suffix of S.

Suffix Trees

10
Definition Suffix Tree (ST) T for S of length m

1. A rooted tree with m leaves numbered from 1 to
m.
2. Each internal node, excluding the root, of T
has at least 2 children.
3. Each edge of T is labeled with a nonempty
substring of S.
4. No two edges out of a node can have
edge-labels starting with the same character.
5. For any leaf i, the concatenation of the
edge-labels on the path from the root to leaf i
exactly spells out the suffix of S, namely
Si,m, that starts at position i.

11
Example Suffix Tree for Sxabxac
12
Existence of a suffix tree S

If one suffix Sj of S matches a prefix of another
suffix Si of S, then the path for Sj would not
end at a leaf.
S xabxa
S1 xabxa and S4 xa
How to avoid this problem?
Assume that the last character of S appears
nowhere else in S.
Add a new character not in the alphabet to the
end of S.

13
Example Suffix Tree for Sxabxac
14

Building STs in linear time
Ukkonens algorithm

15
Building STs in linear time

Weiners algorithm FOCS, 1973
The algorithm of 1973 called by Knuth
First algorithm of linear time, but much space
McGreights algorithm JACM, 1976
Linear time and quadratic space
More readable
Ukkonens algorithm Algorithmica, 1995
Linear time algorithm and less space
This is what we will focus on

16
Implicit Suffix Trees

Ukkonens algorithm constructs a sequence of
implicit STs, the last of which is converted to a
true ST of the given string.
An implicit suffix tree for string S is a tree
obtained from the suffix tree for S by
removing from all edge labels
removing any edges that now have no label
removing any node that does not still have at
least two children
An implicit suffix tree for prefix S1,i of S is
similarly defined based on the suffix tree for
S1,i.
Ii will denote the implicit suffix tree for
S1,i.
Each suffix is in the tree, but may not end at a
leaf.

17
Example Construction of the Implicit ST

Implicit tree for xabxa from tree for xabxa
xabxa, abxa, bxa, xa, a,

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4

3
2
1
18
Construction of the Implicit ST Remove

Remove
xabxa, abxa, bxa, xa, a,

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4

3
2
1
19
Construction of the Implicit ST After the
Removal of

Remove
xabxa, abxa, bxa, xa, a

b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
20
Construction of the Implicit ST Remove unlabeled
edges

Remove unlabeled edges
xabxa, abxa, bxa, xa, a

b
x
a
x
a
6
a
b
x
b
5
x
a
a
4
3
2
1
21
Construction of the Implicit ST After the
Removal of Unlabeled Edges

Remove unlabeled edges
xabxa, abxa, bxa, xa, a

b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
22
Construction of the Implicit ST Remove interior
nodes

Remove internal nodes with only one child
xabxa, abxa, bxa, xa, a

b
x
a
x
a
a
b
x
b
x
a
a
3
2
1
23
Construction of the Implicit ST Final implicit
tree

Remove internal nodes with only one child
xabxa, abxa, bxa, xa, a

b
x
x
a
a
a
b
b
x
x
a
a
3
2
1
24
Ukkonens Algorithm (UA)

Ii is the implicit suffix tree of the string
S1, i
Construct I1
/ Construct Ii1 from Ii /
for i 1 to m-1 do / phase i1 /
for j 1 to i1 do / extension j /
Find the end of the path P from the root whose
label is Sj, i in Ii and extend P with Si1
by suffix extension rules
Convert Im into a suffix tree S

25
Example

S xabxacd
i1 1
x
i1 2
extend x to xa
a
i1 3
extend xa to xab
extend a to ab
b

26
Extension Rules

Goal extend each Sj,i into Sj,i1
Rule 1 Sj,i ends at a leaf
Add character S(i1) to the end of the label on
that leaf edge
Rule 2 Sj,i doesnt end at a leaf, and the
following character is not S(i1)
Split a new leaf edge for character S(i1)
May need to create an internal node if Sj,i
ends in the middle of an edge
Rule 3 Sj,i1 is already in the tree
No update

27
Example Extension Rules

Implicit tree for axabxb from tree for axabx

b
Rule 1 at a leaf node
Rule 3 already in tree
Rule 2 add a leaf edge (and an interior node)
28
UA for axabxc (1)
29
UA for axabxc (2)
30
UA for axabxc (3)
31
UA for axabxc (4)
32
Observations

Once Sj,i is located in the tree, extension
rules take only constant time
Naively we could find the end of any suffix
Sj,i in O(Sj,i) time by walking from the root
of the current tree. By that approach, Im could
be created in O(m3) time.
Making Ukkonens algorithm O(m)
Suffix links
Skip and count trick
Edge-label compression
A stopper
Once a leaf, always a leaf

33
Suffix Links

Consider the two strings a and xa
Suppose some internal node v of the tree is
labeled with xa and another node s(v) in the tree
is labeled with a
The edge (v,s(v)) is called a suffix link
Do all internal nodes (the root is not considered
an internal node) have suffix links?

34
Example suffix links

S ACACACAC

35
Suffix Link Lemma

If a new internal node v with path-label xa is
added to the current tree in extension j of some
phase i1, then
the path labeled a already ends at an internal
node of the tree or
the internal node labeled a will be created in
the extension of j1 in the same phase i1
string a is empty and s(v) is the root

36
Proof of Suffix Link Lemma

A new internal node is created only by the
extension rule 2
This means that there are two distinct suffixes
of S1,i1 that start with xa
xaS(i1) and xacb where c is not S(i1)
This means that there are two distinct suffixes
of S1,i1 that start with a
aS(i1) and acb where c is not S(i1)
Thus, if a is not empty, a will label an internal
node once extension j1 is processed which is the
extension of a

37
Corollary of Suffix Link Lemma

Every internal node of an implicit suffix tree
has a suffix link from it.

38
How to use suffix links - 1

S1,i must end at a leaf since it is the longest
string in the implicit tree Ii
Keep a pointer to this leaf in all cases and
extend according to rule 1
Locating Sj1,i from Sj,i which is at node w
If w is an internal node, set v to w
Otherwise, set v parent(w)
If v is the root, you must traverse from the root
to find Sj1,i
If not, go to s(v) and begin search for the
remaining portion of Sj,i from there

39
How to use suffix links - 2
40
Skip and Count Trick (1)

Problem Moving down from s(v), directly
implemented, takes time proportional to the
number of characters compared
Solution To make running time proportional to
the number of nodes in the path searched, instead
of the number of characters

41
Skip and Count Trick (2)

After 4 nodes down-skips, the end of Sj, i is
found.

42
Skip and Count Trick (3)

Node-depth of v, denoted (ND(v)), is the number
of nodes on the path from the root to the node v
Lemma For any suffix link (v, s(v)) traversed in
Ukkonens algorithm, at that moment, ND(v) ?
ND(s(v))1

43
Skip and Count Trick (4)

At the moment of traversing (v,s(v)) ND(v) ?
ND(s(v))1

44
Skip and Count Trick (5)

The current node-depth of the algorithm is the
node depth of the node most recently visited by
the algorithm
Lemma Using the skip and count trick, any phase
of Ukkonens algorithm takes O(m) time.
Up-walk decreases the current node-depth by ? 1
Suffix link traversal same as up-walk
Totally, the current node-depth is decreased by ?
2m.
No node has depth gtm.
The total possible increment to the current
node-depth is ? 3m.

45
Edge Label Representation

Potential Problem
Size of edge labels may require W(m2) space
Thus, the time for the algorithm is at least as
large as the size of its output
Example
S abcdefghijklmnopqrstuvwxyz
Total length is Sjltm1 j m(m1)/2
Similar problem can happen when the length of the
string is arbitrarily larger than the alphabet
size
Solution
Label edges with pair of indices indicating
beginning and end of positions of the substring
in S

46
Modified Extension Rules

Rule 2 new leaf edge (phase i1)
create edge (i1, i1)
split edge (p, q) gt (p, w) and (w 1, q)
Rule 1 leaf edge extension
label had to be (p,i) before extension
given rule 2 above and an induction argument
(p, q) gt (p, q 1)
Rule 3
Do nothing

47
Full edge label representation

String S xabxa

b
x
a

x
a
6
a
b

x

b
5

x
a
a
4

3
2
1
48
Edge-label Compression

String S xabxa

(1,2) or (4,5)?
(2,2)
(6,6)
(3,6)
6
(6,6)
5
(6,6)
(3,6)
(3,6)
4
3
2
1
49
A Stopper

In any phase, if suffix extension rule 3 applies
in extension j, it will also apply in all
extensions k, where kgtj, until the end of the
phase.
The extensions in phase i1 that are done after
the first execution of rule 3 are said to be done
implicitly. This is in contrast to any extension
j where the end of Sj, i is explicitly found.
An extension of that kind is called and explicit
extension.
Hence, we can end any phase i1 when the first
extension rule 3 applies.

50
Once a leaf, always a leaf (1)

If at some point in UA a leaf is created and
labeled j (for the suffix starting at position j
of S), then that leaf will remain a leaf in all
successive trees created during the algorithm.
In any phase i, there is an initial sequence of
consecutive extensions (starting with extension
1) in which only rule 1 or 2 applies, where let
ji be the last extension in this sequence.
Note that ji ? ji1.

51
Once a leaf, always a leaf (2)

Implicit extensions for extensions 1 to ji,
write , e on the leaf edge, where e is a
symbol denoting the current end and is set to i
1 once at the beginning. In later phases, we will
not need to explicitly extend this leaf but
rather can implicitly extend it by incrementing e
once in its global location.
Explicit extensions from extension ji1 till
first rule 3 extension is found (or until
extension i1 is done)

52
Once a leaf, always a leaf (3) Single phase
algorithm

j is the last explicit extension computed in
phase i1
Phase i1
Increment e to i1 (implicitly extending all
existing leaves)
Explicitly compute successive extensions starting
at ji1 and continuing until reaching the first
extension j where rule 3 applies or no more
extensions needed
Set ji1 to j -1, to prepare to the next phase
Observation
Phase i and i1 share at most 1 explicit extension

53
Example Saxaxbb - (1)

e 1, a
J1 1

e 2, ax
S1,2 skip
S2,2 rule 2, create(2, e)
j2 2

e 3, axa
S1,3 .. S2,3 skip
S3,3 rule 3
j3 2

54
Example Saxaxbb - (2)

e 4, axax
S1,4 .. S2,4 skip
S3,4 rule 3
S4,4 auto skip
j4 2

e 5, axaxb
S1,5 .. S2,5 skip
S3,5 rule 2, split (1,e)? (1, 2) and (3,e),
create (5,e)
S4,5 rule 2, split (2,e)? (2,2) and (3,e),
create (5,e)
S5,5 rule 2, create (5,e)
j5 5

55
Example Saxaxbb - (3)

e 6, axaxbb
S1,6 .. S5,6 skip
S6,6 rule 3
j6 5

e 7, axaxbb
S1,7 .. S5,7 skip
S6,7 rule 2, split (5,e)? (5,5) and (6,e),
create (6,e)
S7,7 rule 2, create (7,e)
j7 7

56
Complexity of UA

Since all the implicit extensions in any phase is
constant, their total cost is O(m).
Totally, only 2m explicit extensions are
executed.
The max number of down-walking skips is O(m).
Time-complexity of Ukkonens algorithm O(m)

57
Finishing up

Convert final implicit suffix tree to a true
suffix tree
Add using just another phase of execution
Now all suffixes will be leaves
Replace e in every leaf edge with m
Just requires a traversal of tree which is O(m)
time

58
Implementation Issues (1)

When the size of the alphabet grows
For large trees suffix links allow an algorithm
to move quickly from one part of the tree to a
distant part of the tree. This is great for
worst-case time bounds, but its horrible if the
tree isn't entirely in memory.
Thus, implementing ST to reduce practical space
use can be a serious concern.
The main design issues are how to represent and
search the branches out of the nodes of the tree.
A practical design must balance between
constraints of space and need for speed

59
Implementation Issues (2)

There are four basic choices to represent
branches
An array of size ?(?) at each non-leaf node v
A linked list at node v of characters that appear
at the beginning of the edge-labels out of v.
If its kept in sorted order it reduces the
average time to search for a given character
In the worst case it, adds time ? to every node
operation. If the number of children of v is
large, then little space is saved over the array
while noticeably degrading performance
A balanced tree implements the list at node v
Additions and searches take O(logk) time and O(k)
space, where k is the number of children of v.
This alternative makes sense only when k is
fairly large.
A hashing scheme. The challenge is to find a
scheme balancing space with speed. For large
trees and alphabets hashing is very attractive at
least for some of the nodes

60
Implementation Issues (3)

When m and ? are large enough, the best design is
probably a mixture of the above choices.
Nodes near the root of the tree tend to have the
most children, so arrays are sensible choice at
those nodes.
For nodes in the middle of a suffix tree, hashing
or balanced trees may be the best choice.
Sometimes the alphabet size is explicitly
presented in the time and space bounds
Construction time is O(m log?),using ?(m?)
space. m is the size of the string.

Applications of Suffix Trees

62
Applications of Suffix Trees

Exact string matching
Substring problem for a database
Longest common substring
Suffix arrays

63
Exact String Matching (1)

Exact matching problem given a pattern P of
length n and a text T of length m, find all
occurrences of P in T in O(nm) time.
Overview of the ST approach
Built a ST for text T in O(m) time
Match the characters of P along the unique path
in ST until either P is exhausted or no more
matches are possible
In the latter case, P doesnt appear anywhere in
T
In the former case, every leaf in the subtree
below the point of the last match is numbered
with a starting location of P in T, and every
starting location of P in T numbers such a leaf
ST approach spends O(m) preprocessing time and
then O(nk) search time, where k is the number of
occurrences of P in T

64
Exact String Matching (2)

When search terminates at an element node, P
appears exactly once in the source string T.
When the search for P terminates at a branch
node, each element node in the subtree rooted at
this branch node gives a different occurrence of
P.

65
Substring Problem for a Database

Input a database D (a set of strings) and a
string S
Output find all the strings in D containing S as
a substring
Usage identity of the person
Exact string matching methods cannot work
Suffix tree
D is stored in O(m) space, where m is the total
length of all the strings in D.
Suffix tree is built in O(m) time
Each lookup of S (the length of S is n) costs
O(n) time

66
Longest common substring (1)

Input two strings S1 and S2
Output find the longest substring S common to S1
and S2
Example
S1common-substring
S2common-subsequence
Then, Scommon-subs

67
Longest common substring (2)

Build a suffix tree for S1 and S2
Each leaf represents either a suffix from one of
S1 and S2, or a suffix from both S1 and S2
Mark each internal node v with a 1 (2) if there
is a leaf in the subtree of v representing a
suffix from S1(S2)
The path-label of any internal node marked both 1
and 2 is a substring common to both S1 and S2,
and the longest such string is the longest common
substring.

68
Longest common substring (3)

S1 xabxac, S2 abx, S abx

69
Suffix Arrays (1)

A suffix array of an m-character string S, is an
array of integers in the range 1 to m, specifying
the lexicographic order of the m suffixes of S.
Example S xabxac

70
Suffix Arrays (2)

A suffix array of S can be obtained from the
suffix tree T of S by performing a lexical
depth-first search of T

71
Suffix Arrays Exact string matching

Given two strings T and P, where T m and P
n, find all the occurrences of P in T ?
Using the binary search on the suffix array of T
, all the occurrences of P in T can be found in
O(n logm) time.
Example let T xabxac and P ac

72
References

Dan Gusfield Algorithms on Strings, Trees, and
Sequences. University of California,
Davis.Cambridge University Press,1997
Ela Hunt et al. A database index to large
biological sequences. Slides at VLDB, 2001
R.C.T. Lee and Chin Lung Lu. Suffix Trees.
Slides at CS 5313 Algorithms for Molecular
Biology

Write a Comment

User Comments (0)