Augmenting Suffix Trees, with Applications - PowerPoint PPT Presentation

About This Presentation

Title:

Augmenting Suffix Trees, with Applications

Description:

To output a list of all documents containing P as a substring (the standard ... of v, represents the longest prefix of the substring formed by concatenating of ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 31

Provided by: theb6

Category:

more less

Transcript and Presenter's Notes

Title: Augmenting Suffix Trees, with Applications

1
Augmenting Suffix Trees, with Applications

Yossi Matias, S. Muthukrishnan, Suleyman Cenk
Sahinalp, Jacob Ziv

Presented by Genady Garber
2
Abstract

Theory of string algorithms play a fundamental
role in
Information retrieval
Data compression
This work consider one algorithmic problem from
each area.
The algorithms rely on augmenting the suffix tree
(adding extra edges, resulting DAGs)
The algorithm construct these suffix DAGs and
manipulate them to solve the problems.

3
Introduction

This paper presents two algorithmic problems
Data Compression
Information Retrieval
All these algorithms rely on the suffix tree data
structure
ST with suitably simple augmentations are very
useful in string processing applications
In described work the suffix tree was augmented
with extra edges and additional information

4
Problems and Background

The Document Listing Problem
The HYZ Compression Problem

5
The Document Listing Problem

Given a set of documents T T1, . . . , Tk
Given a query pattern P
The problem
To output a list of all documents containing P as
a substring (the standard problem can be solved
in time, proportional to number of occurrences of
P in T. The goal is to solve the problem with a
running time depending on the number of documents
containing P)
To report the number of documents containing P
(existing algorithm solves this problem in O(P)
and is based on data structures for computing
Lowest Common Ancestor)
May be used in morbid applications
(discovering gene homologies)

6
The (a,b)-HYZ Compression Problem

Given a binary string T of length t
Need to replace disjoint blocks of size b with
desirably shorter codewords (to allow future
perfect decompression)
Compression Algorithm
To compute the codeword cj for block j we
determine its context (the context of a block Ti
l is the longest substring Tk i - 1, k lt
i, of size at most a , Tk l occurs earlier in
T)
The codeword cj is the ordered pair g , l where
g - length of the context of block j
l - rank of block j with respect to the context
(according to some predetermined order -
lexicographic, etc.)
Intuition similar symbols in data appear in
similar contexts

7
The (a,b)-HYZ Compression Problem (previous
results)

Case of b O(1) and a is unbounded
Average length of a codeword is shown to approach
the conditional entropy for the block, within
additive term of c1logH(C) c2 for constants c1
and c2, provided that the input is generated by a
limited order Markovian source.
Case of b gt loglog t and a O(log t)
This scheme also achieves the optimal compression
(in terms of CO)
Applies for all ergodic sources

8
The problem state

Consider a set of document strings T T1, T2, .
. . Tk, of sizes t1, t2, . . . , tk.
The goal is to build a data structure, supporting
the following queries on an on-line pattern P of
size p
list query (the list of documents containing P)
count query (the number of documents containing
P)
Theorem 1
Given T and P, there is a data structure which
responds to a count query in O(p) time, and a
list query in O(p log k out) time, where out
is the number of documents in T that contain P.

9
The Suffix-DAG Data Structure

Proof Sketch
Build the suffix-DAG of documents T1, . . ., Tk,
in O(t) O(S tk), using O(t) space
The suffix-DAG of T, denoted by SD(T), contains
the generalized suffix tree, GST(T), of the set T
at its core.
A GST(T) is defined to be the compact trie of all
the suffixes of each of the documents in T.
Each leaf node l in GST(T) is labeled with the
list of documents which have a suffix
represented by the path from the root to l.
The substring represented by a path from the root
to any given node n denoted by P(n)

10
The Suffix-DAG Data Structure (cont)

The nodes of SD(T) are the nodes of GST(T)
The edges of SD(T) are of two types
the skeleton edges of SD(T) are the edges of
GST(T)
the supportive edges of SD(T) are defined as
follows for any nodes n1 and n2 in SD(T) there
is a pointer edge from n1 to n2 if and only if
n1 is an ancestor of n2
among the suffix trees ST(T1), ST(T2), . . . ,
ST(Tk) there exists at least one , say ST(Ti),
which has two nodes , n1 , I and n2 , I such
that
P(n1) P(n1 , i) ,P(n2) P(n2 , i)
n1 , I is the parent of n2 , I
such an edge is labeled with I for all relevant
documents in Ti

11
The Suffix-DAG Data Structure (cont.2)

In order to respond to the count and list
queries, one of the standard data structures that
support least common ancestor (LCA) queries on
SD(T) in O(1) time was built.
Also each internal node n of SD(T) contains
array that stores its supportive edges in
pre-order fashion
number of documents which include P(n) as
substring

12
Example

The independent suffix trees
T1 abcb T2 abca T3 abab

13
Example (cont.)

The generalized suffix tree of the set of
documents T1, T2, T3

SK(T)
14
Example (cont.2)

The suffix-DAG of the set of documents

2
SD(T)
1,3
1
2
1
3
1
15
Example (cont.3)

The suffix-DAG of the set of documents

legend
skeleton edge
SD(T)
number of documents in the subtree and the
pointer array
3
1,3,2,1,3,2,1
2,2
3
3
3,1,3,1
2
-
3,1
3
2
-
-
2
16
Lemmas and Proofs

Lemma1
The suffix-DAG is sufficient to respond to the
count queries in O(p) time, and to list queries
in O(p log k out) time
Proof Sketch
To respond to count queries do as follows
with P, trace down GST(T) until the highest level
node n is reached for which P is a prefix of P(n)
return the number of documents that contain P(n)
as a substring
To respond to list queries do as follows
locate n in SD(T) (defined above) and traverse
SD(T) backwards from n to the root
at each node u on the path determine all
supportive edges out of u have their endpoints in
the subtree rooted at n

17
Lemmas and Proofs (cont.)

Complexity of list queries
the key observation is that all corresponding
edges will form a consecutive segment.
the segment may be identified with two binary
searches (performing an LCA query, it takes O(1)
time)
maximum size of the array of supportive edges in
any node is at most kS , where S O(1) (
size of alphabet), that means this procedure
takes O(log k) at each node u
the output at all such segments may contain
duplicate, but total size of the output is
O(outS) O(out), where out is the number of
occurrences of P in t

18
Lemmas and Proofs (cont.2)

Lemma 2
The suffix-DAG of the document set T can be
constructed in O(t) time and O(t) space
Proof Sketch
The construction of GST(T) with all suffix links
and data structure are standard
To complete SD(T) it is necessary
to construct supportive edges
to build supportive edge array
to explain, how the number of documents that
include P(n) is computed

19
Lemmas and Proofs (cont.3)

Proof Sketch (cont.)
the supportive edge with label i can be built by
emulating constriction for ST(Ti) (for each node
is ST(Ti), there is a corresponding node in SD(T)
with the appropriate suffix link)
supporting edges building
if there is supportive edge between nodes n1 and
n2 , then either there is SE between n1( there
is suffix link from ni to ni) and n2, or there
exist one intermediate node to which there is a
supportive edge from n1 (n2).
The time to compute such nodes as length of
string between nodes n1 and n2
to compute the number of documents (n),
containing the substring of n we need
the number of supportive edges from n to its
descendants
the number of supportive edges to n from its
ancestors

(n)
(n)
20
Lemmas and Proofs (cont.4)

Lemma 3
For any node n, (n) Sn E children of n (n)
Proof
if a document Ti includes the substring of more
than one descendant of n, then there should exist
a node in ST(Ti) whose substring is identical to
that of n
for any two supportive edges from n to n1 and n2,
the path from n to n1 and the path from n to n2
do not have any common edges

21
The Compression Algorithm

Compression scheme Ca,b terms
the input is a string T of size t
binary alphabet
The scheme
partition the T into contiguous substrings
(blocks) of size b
replace each block by a corresponding codeword
(function of context)
the context of a block Ti j is the longest
substring Tk i-1 for k lt I, for which Tk l
occurs earlier in T.
if context exceeds a it truncated for a rightmost
characters
the codeword cj is ordered pair ltg,lgt, where
g is the context size
l is the lexicographic order of block j among all
possible substrings of size b immediately
following earlier occurrences of context of block
j

22
Compression Example

T 010011010
a 2 , b 1
the context of block 9 is T7 8 01
the two substrings which follow earlier
occurrences of this context are T3 0 and
T6 1
the lexicographic order of block 9 amount these
substrings is 1

23
The Compression Algorithm(cont)

Theorem 2
There is an algorithm to implement the
compression scheme Ca,b which runs in O(tb)
time and requires O(tb) space, independent of a
Proof
Building suffix tree we augment it as follows
for each node v we store an array of size b in
which for each i 1, . . ., b
store the number of distinct paths rooted at v of
precisely i characters minus the number of such
distinct paths of precisely i-1 characters (the
number may be negative)

24
The Compression Algorithm(cont.2)

Lemma 4
There is an algorithm to construct the augmented
suffix tree of T in O(tb) time
Proof
While inserting a new node into suffix tree,
update the subtree information of the ancestors
of v which are at most b characters higher than v
number of such ancestors is at most b
at most one of the b fields of information at any
ancestor of v needs to be updated

25
The Compression Algorithm(cont.3)

Lemma 5
The augmented suffix tree is sufficient to
compute the codeword for each block of input T in
amortized
O( )
Proof Sketch
The computation of g can be performed by locating
the node in the suffix tree which represents the
longest prefix of the context (can be achieved by
using the suffix links in amortized O(b) time)
The computation of l
traverse the path between v and w (v represents
the longest prefix of the context of block j, w,
the descendant of v, represents the longest
prefix of the substring formed by concatenating
of lock j and its context)
during traversal, compute size of the relevant
subtrees that smaler/greater then the substring
represented by this path

26
The Compression Algorithm(cont.4)

Theorem 3
There is an algorithm to implement the
compression method Ca,b for a log t and b log
log t in O(t) time using O(t) space
Proof Sketch
For any descendents wi of v which are b chars
apart from v, the DataStructure enables to
compute lexicographic order of the path between
any wi and v, allowing easy computation of
codeword of block of size brepresented by path
between v and wi
The algorithm exploits the fact that the context
size is bounded, and its seeks similarities
between suffixes of the input up to a small size.

27
The Compression Algorithm(cont.5)

Lemma 6
The augmented limited suffix tree of input T is
sufficient to compute the codeword for any input
block j in O(b) time
Proof Sketch
Given a block j of the input
v represents its context
w (descendant of v) represents substring
context(j) block(j)
b log log t gt the maximum number of elements
in the search data structure for v is O(log
t)
there is a simple data structure that maintains k
elements and compute the rank of any given
element in O(log k) time
the lexicographic order of node w in only O(log
log t) O(b) time may be computed (in future
work).

28
The Compression Algorithm(cont.6)

Lemma 7
The augmented limited suffix tree of input T can
be built and maintained in O(t) time
Proof Sketch
The depth of augmented limited suffix tree is
bounded by log t gt the total number of nodes in
the tree is only O(t), allowing to adapt ST
construction in O(t) time - without being
penalized for building a suffix trie rather than
the suffix tree.
Because of each node v is inserted to the search
data structure of at most one of its ancestors,
it is possible to construct and maintain the
search data structure of all nodes in O(t) time,
gt the total number of elements maintained by all
search data structures is O(t)
The insertion time of an element e to a search DS
is O(1)
As the total number of of nodes to be inserted is
the DS is bounded by O(t), it may be shown, that
the total time for insertion of nodes in the
search DS is O(t)

29
Results and conclusions

Document listing problem
Processing k documents in linear time ( O(Si ti)
) and space
Time to answer a query with pattern P is O( P
logk out)
The fastest known algoritm runs in time
proportional to number of occurrences of the
pattern in all documents.

30
Results and conclusions (cont.)

(a,b )-HYZ compression problem
unbounded a, complexity O(tb)
gives linear time for b O(1)
The only previously known algorithm, where for
b O(1), the author presents an O(ta) time
algorithm, and for unbounded a this running time
is O( )
a O(log t), b log log t, complexity O(t)
There is no any previously known algorithms

Write a Comment

User Comments (0)