Suffix Tree - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Suffix Tree

Description:

Definitions. Example: root. u. v. uv. u. k. Path(k) = uv k = uv. The locus of uv. Path notation ... Since we already have the word ab in the tree thus we need ... – PowerPoint PPT presentation

Number of Views:97
Avg rating:3.0/5.0
Slides: 41
Provided by: ocl4
Category:

less

Transcript and Presenter's Notes

Title: Suffix Tree


1
Suffix Tree
2
Outline
  • Application
  • Characteristics
  • McCreights algorithm (mcc)
  • Definition
  • Overview
  • Data Structure
  • Constructing Suffix Links

3
Application
  • Text editors find, replace
  • Bioinformatics applications
  • Automatic command completion
  • Data compression LZSS (PKZIP)

4
Characteristics
  • Given a string S, we build an index to S in the
    form of a search tree T.
  • Each path starting from the root of this tree
    represents a different suffix.
  • An edge is labeled with a string.
  • The concatenation of the labels on through a path
    from the root to a leaf gives us a suffix.

5
Example
  • S ababc

root
ab
c
b
abc
c
abc
c
6
McCreights algorithm (mcc)
  • Requirement
  • No suffix of S is a prefix of a different suffix
    of S.
  • There is a leaf for each suffix of S.
  • (eg. aba against S1 but aba OK!)

S1. The final character of the string S
should not appear elsewhere in S.
7
McCreights algorithm (mcc)
  • Constraints on the Tree

T1. An edge of T may represent any
nonempty substring of S.
T2. Each internal node of T, except the root,
must have at least two outgoing edges.
T3. Siblings edges represent substrings with
different starting characters.
8
Algorithm mccDefinitions
  • S the alphabet
  • a, b, c, d characters in S
  • p, q, s, t, u, v, w, y, z strings
  • If t uvw for some strings u, v, and w ,then u
    is a prefix of t , v is a t-word , and w is
    a suffix of t .

9
Algorithm mccDefinitions
  • Path notation
  • By path(k) we denote the concatenation of the
    edge labels on the path from the root of T to the
    node k.
  • By T3 path labels are unique and we can denote k
    by w, if and only if path(k) w.

Remark w is underlined
10
Algorithm mccDefinitions
  • Locus notation (another notation in mcc)
  • Node k is called the locus of the string uv, if
    the path from the root to k denotes uv.
  • hence, the locus of uv is uv.

Remark uv is underlined
11
Algorithm mccDefinitions
  • Example

Path notation
Locus notation
root
Path(k) uv ? k uv
u
The locus of uv
u
v
k
uv
12
Algorithm mccDefinitions
  • The Extended Locus of a string u is the locus of
    the shortest extension of u, uw (w is possibly
    empty), .s.t. uw is a node in T.
  • Example
  • uhapp, wy

The extendedlocus of u
hap
py
uw
13
Algorithm mccDefinitions
  • The Contracted Locus of a string u is the locus
    of the longest prefix of u, x (x is possibly
    empty), s.t. x is a node in T.
  • Example
  • uhapp, xhap

The contractedlocus of u
hap
py
uw
14
Algorithm mccDefinitions
  • Let S be our main string
  • sufi is the suffix of S beginning at the ith
    position (position are counted from 1 ? suf1
    S).
  • headi is the longest prefix of sufi , which is
    also a prefix of sufj for some jlti.
  • taili is defined s.t. sufi headi taili

15
Algorithm mccDefinitions
  • Example Sababc
  • suf3abc, head3ab, tail3c
  • suf4bc, head4b, tail4c
  • Constraint S1 assures that taili is never empty.

S
suf3
S
suf4
16
Algorithm mccOverview
  • To build the suffix tree for ababc mcc inserts
    every step i the sufi into tree Ti-1

a
b
a
b
c
Step 1
b
a
b
c
Step 2
a
b
c
Step 3
b
c
Step 4
Step 5
c
17
Algorithm mccOverview
  • To do this we have to insert every step sufi
    without duplicating its prefix in the tree, so we
    need to find its longest prefix in the tree.
  • Its longest prefix in the tree is by definition
    headi.
  • What we do is finding the extended locus of
    headi in Ti-1 and its incoming edge is split by a
    new node which spawns a new edge labeled taili.

18
Algorithm mccOverview
  • Example
  • suf3abc. Since we already have the word ab in
    the tree thus we need to start from there bulding
    our new suffix. Note that indeed abhead3,
    tailic.
  • Notice that headi is the longest prefix of sufi
    that its extended locus exists within Ti-1.

babc
ababc
c
T2
T3
19
Algorithm mccOverview
  • Overview of mccs operations via example of ababc

T0
T1
T2
T3
T4
T5
root
Step i1
ab
2
3
4
5
c
b
ababc
c
abc
c
babc
abc
20
Algorithm mccData Structure
  • For efficiency we would represent each label of
    an edge by 2 numbers denoting its starting and
    ending position in the main string.

root
(0,0)
5
4
3
2
1
c
ab
c
b
a
b
a
b
S
(2,2)
(1,2)
(5,5)
c
abc
abc
c
(3,5)
(5,5)
(3,5)
(5,5)
21
Algorithm mccData Structure
  • Thus, the actual insertion of an edge to the tree
    takes O(1).
  • The introduction of a new internal node and taili
    takes O(1) , hence,
  • if mcc could find the extended locus of headi in
    Ti-1 in constant time, in average over all steps,
    then mcc is linear in n.
  • This is done by exploiting the following lemma

22
Algorithm mccData Structure
  • Lemma 1 If headi-1 xu for some character x
    and some string u (possibly empty), then u is a
    prefix of headi .
  • Proof. headi-1xu, hence, there is a jlti s.t. xu
    is a prefix of both sufj-1 and sufi-1.
  • (1) xu is a prefix of sufj-1 ? u is a prefix of
    sufj .
  • (2) xu is a prefix of sufi-1 ? u is a prefix of
    sufi .
  • By (1), (2) there is some jlti such that u is a
    prefix of both sufj and sufi .
  • Hence, by definition of head u is a prefix of
    headi.
  • S xu..xu

j-1
i-1
23
Algorithm mccData Structure
Sabcdabc, head5abc, head6bc
S
Suf5
Sbcababc, head5ab, head6bc
24
Algorithm mccData Structure
  • To exploit this we introduce Suffix Links
  • From each internal node xu , where x1, we add
    a pointer to the node u.

root
u
xu
25
Algorithm mccData Structure
  • Example

root
(0,0)
c
ab
b
(2,2)
(1,2)
(5,5)
c
abc
abc
c
(3,5)
(5,5)
(3,5)
(5,5)
26
Algorithm mccConstructing Suffix Links
  • Two properties guarantee the correctness of
    constructing suffix links.

P1 in Ti every internal node, except
perhaps the locus of headi (headi), has a
valid suffix link.
P2 in step i mcc visits the contracted locus
of headi in Ti-1 .
It is easy to prove that by mathematical
induction.
27
Algorithm mccConstructing Suffix Links
  • P2 yields that we can use the contracted locus of
    headi-1 to jump with the suffix link to some
    prefix of headi. P1 assures us that there is such
    suffix link.

root
u
xu
P1 this suffix link exists
P2 mcc visits this node
w
taili-1
y
headi-1 xuw
Ti-1
Step i-1
28
Algorithm mccConstructing Suffix Links
  • Three substeps to construct a suffix link
  • Substep A Jumpping
  • jump with the previous suffix link
  • Substep B Rescanning
  • construct the current suffix link
  • Substep C Scanning
  • branch for taili

29
Algorithm mccSubstep A Jumping
  • In this substep mcc will identify strings it had
    already dealt with in the previous steps.
  • Identify 3 strings xuw s.t.
  • headi-1 xuw
  • xu is the contracted locus of headi-1 in Ti-2 ,
    i.e. xu is a node in Ti-2 . If the contracted
    locus of headi-1 in Ti-2 is the root then u?.
  • x ? 1. x? only if headi-1? .

30
Algorithm mccSubstep A Jumping
Contracted locus of xuw
root
root
u
u
xu
xu
root
c
u
xu
w
wy
wv
y
w
Ti-2
headi-1 xuw
Ti-1
taili-1
headi uwv
y
Step i-1
Ti
Step i
31
Algorithm mccSubstep A Jumping
  • Notice that
  • In the previous step headi-1 was found.
  • Since x ? 1 then by lemma 1
  • headi uwv for some, yet to be discovered,
    string (possibly empty) v.
  • By P2, mcc visited xu in the previous step (i-1),
    hence it can identify xu.

32
Algorithm mccSubstep A Jumping
  • If u? then c?root (note that rootu)
  • else c?suffix link of xu (note that cu)
  • explanation
  • u?? thus by definition xu existed (as the
    contracted locus of xuw) in Ti-2 hence by P1 the
    internal node xu has a suffix link.
  • By P2 we remember xu from step i-1 and we can now
    follow its suffix link.

33
Algorithm mccSubstep B Rescanning w
  • Find the edge that starts with the first
    character of w. Denote the edges label z and the
    node it leads to f.
  • If w gt z then start a recursive rescan of
    w-z (or wz ) from f.
  • If w ? z , then w is a prefix of z , and we
    found the extended locus of uw.
  • Construct a new node (if needed) uw.
  • d ? uw

34
Algorithm mccSubstep B Rescanning w
root
u
root
xu
u
c
xu
z
c
w
f
w
w
wv
z
f
d
y
y
v
Ti-1
Ti
Step i
Headi-1 xuw
Headi uwv
35
Algorithm mccSubstep B Rescanning w
root
  • Make the suffix linkof xuw point to d.

u
xu
c
w
w
f
d
xuw
v
Ti
Headi uwv
36
Algorithm mccSubstep C Scanning v
  • Scan the edges from d to find the extended locus
    of uwv.
  • Since we dont know what is v we must scan each
    character in the path from d downward, comparing
    it to taili-1.
  • When we fall out of the tree we have found v.
  • The last node in this trek is the contracted
    locus of headi in Ti-1.
  • When we reach the extended locus of uwv we
    construct the new node uwv, if needed.
  • Construct the new leaf edge taili .

37
Algorithm mccSubstep C Scanning v
root
  • Scanning for the requested v.
  • Comparing each character of the downward path
    beginning at d to taili. When the comparison
    fails we have reached headi .

u
xu
w
w
d
v
vt
taili
t
Ti
Headi uwv
38
Algorithm mccMaintaining T2
  • We shall prove that when we add a new node in the
    end of substep B as the locus of uw then we obey
    constraint T2 that an internal node has at least
    2 son edges.
  • Remark be careful of previous text with green.

39
Algorithm mccMaintaining T2
  • Lemma 2 In step i, at the end of substep B we
    add a new node only if v is empty.
  • Proof. In step i, If v is not empty then
    headiuwv and headi-1xuw hence, w.l.g. we can
    write S as follows
  • Sxuwz..uwv..xuwv
  • Thus, we have 2 occurences of uw with different
    extensions, uwv, uwz, that occur already in the
    tree.
  • Hence, there is a branching node uw.

Ti
uw
v
z
40
Algorithm mccTime Complexity
  • The total time complexity of constructing a
    suffix tree is O(n).
Write a Comment
User Comments (0)
About PowerShow.com