Title: Tree Inclusion, Signatures, and Evaluation of
1Tree Inclusion, Signatures, and Evaluation
of Path-Oriented Queries Dr. Yangjun ChenDept.
Applied Computer Science, University of Winnipeg,
Canada
- Motivation
- Path-Oriented Queries and Tree Inclusion Problem
- Evaluation of Path-Oriented Queries
- - Top-down Algorithm for Tree Inclusion
- - Integration of Signatures into Top-down Tree
Inclusion - Experiment Results
- Summary and Future Work
2Motivation
- Local Information Resource Management document
databases - Internet Distributed Document Databases
- Document Databases
- - Storage of documents in relational databases
- non-structured data, semi-structured data
- - Evaluation of path-oriented queries in
document databases - path-oriented languages XQL, XPath, and
XML-QL - Query evaluation methods
- inverse-file based
- signature based
- string-matching based suffix trees, Pat-trees
- tree-inclusion based
- Integrating signatures into top-down tree
inclusion algorithm
3Path-Oriented Queries and Tree Inclusion Problem
- XML Documents and Path-Oriented Queries
4Path-Oriented Queries and Tree Inclusion Problem
- Tree Inclusion Problem
- Definition (tree embedding) Let T and P be two
labeled trees. A mapping M from the nodes of P to
the nodes of T is an embedding of P into T if it
preserves labels and ancestorship. That is, for
all nodes u and v of P, we require that - a) M(u) M(v) if and only if u v,
- b) label(u) label(M(u)),
- c) u is an ancestor of v in P if and only if
M(u) is an ancestor of M(v) in T, and - d) v is to the left of u iff M(v) is to the
left of M(u). - An embedding is root preserving if M(root(P))
root(T). It can be shown that restricting to
root-preserving embedding does not lose
generality.
5Path-Oriented Queries and Tree Inclusion Problem
Example
T
P
Hotel-room-reservation
Hotel-room-reservation
name
location
reservation
name
location
type
address
price
from
to
?x
City-or-district
rooms
country
address
state
City-or- district
Travel-lodge
Post- code
number
number
street
street
one-bed- room
April 20, 2005
April 28, 2005
Winnipeg
Winnipeg
119.00
Manitoba
Canada
R3B 2E9
515
Portage Ave.
515
Portage Ave.
6Path-Oriented Queries and Tree Inclusion Problem
- - Algorithms for Tree Inclusion Problem
- Bottom-up algorithm
- Kilpelainen-Mannilas Algorithm (Pekka
Kilpelainen and Heikki - Mannila, Ordered and unordered tree inclusion,
SIAM Journal of - Computing, 24340-356, 1995.)
- O(T ?P) time
- O(T ?P) space
- Chens Algorithm (W. Chen, More efficient
algorithm for ordered - tree inclusion, Journal of Algorithms,
26370-385, 1998.) - O(T?leaves(P)) time
- O(leaves(P)?minheight(P), leaves(T))
space
7Path-Oriented Queries and Tree Inclusion Problem
- - Algorithms for Tree Inclusion Problem
- Top-down algorithms
- Y. Chen and Y.B. Chen, An Efficient Top-down
Algorithm for Tree - Inclusion, in Proc. of 18th Intl. Conf.
Symposium on High Performance - Computing System and Application, Winnipeg,
Canada IEEE, - May 2004, pp. 183-187.)
- O(T ?leaves(P)) time, need no extra space
- Y. Chen and Y.B. Chen, On the Top-down Tree
Inclusion Algorithm, - submitted to Information Processing Letters.)
- O(T?height(P)) time, need no extra space
- Advantages of top-down over bottom-up
- - better computational complexities
- - checking trees page-wise (suitable for the
cases of large data volume) - - integrating signatures into tree inclusion to
cut useless subtree checkings - as early as possible
8Evaluation of Path-Oriented Queries
- - Top-down Algorithm
- Target tree T ltt T1, ..., Tkgt, where t
root(T) and each Ti (i 1, , k) - is the subtrees of t
- Pattern forest G ltP1, ..., Pqgt, where each Pj
(j 1, , q) is a subtree. - Main idea
- The algorithm attempts to find the number of
subtrees j (? 0) within an - ordered forest G ltP1, ..., Pqgt (q ? 1), which
are embedded in a target - tree T. If j q, we say that G is embedded in
T. If j lt q, then only the trees - P1, ..., and Pj are embedded in T. Let p1, ...,
pq and t be the roots of P1, ..., Pq - and T, respectively. Since a forest does not
have a root, we use a virtual - node pv to serve as a substitute for root(G).
Thus, root(G) will return pv if - G ltP1, ..., Pqgt with q gt 1, and will return p1
if q 1.
9Evaluation of Path-Oriented Queries
- Top-down Algorithm Case 1 root(G) ? pv (i.e.,
G ltPgt is a tree and root(G) p), and label(p)
? label(t). If G is embedded in T, then there
must exist a subtree Ti of t such that it
contains the whole G. The algorithm should return
1 if an embedding can be found and 0 if it
cannot.
label(root(T)) ? label(root(G))
G
T
Ti
Tree G is included in Ti.
10Evaluation of Path-Oriented Queries
- Top-down Algorithm Case 2 root(G) ? pv (i.e.,
G ltPgt and root(G) p), and label(p)
label(t). Let ltP1, ..., Plgt (l ? 0) be the
forest of subtrees of p and ltT1, ..., Tkgt the
forest of subtrees of t. If G is embedded in T,
there must exist two sequences of integers k1,
..., kg and l1, ..., lg (g ? l) such that
includes lt , ..., gt (i 1, ..., g,
l0 0, lg l), where lt , ..., gt
represents a forest containing subtrees
, ..., and . Thus, if lg l, the algorithm
should return 1 since we have a root preserving
inclusion of G in T. Otherwise, it should return
0.
label(root(T)) label(root(G))
G
T
p
t
Pl
Tk
P1
T1
include
include
11Evaluation of Path-Oriented Queries
- Top-down Algorithm Case 2 root(G) pv and
there exists an integer j (0 ? j ? q) such
that ltP1, ..., Pjgt is included in T. If j q,
then the whole G is embedded in T. There are two
possibilities to be considered when looking for
j. The first possibility is similar to Case 2,
where there are two sequences of integers k1,
..., kg and l1, ..., lg (g ? q) that represent
the order, in which the subtrees of root(G) are
embedded in the subtrees of root(T). In thiscase,
j lg. If j 0, we will check the second
possibility to see whether there exists a root
preserving inclusion of P1 in T, i.e., label(p1)
label(t) and the subtrees of p1 are included
in the subtrees of t. In this case, j 1.
12Evaluation of Path-Oriented Queries
- Top-down Algorithm
possibility 2
qv (virtual node)
label(root(T)) label(root(P1))
G
T
t
Pl
Tk
P1
T1
include
13Evaluation of Path-Oriented Queries
- Top-down Algorithm
- j bottom-up-process(T, G)
- 13. if (j l) then return 1 else 0
- else if t is a leaf then return 0
- 14. (handling Case 1)
- 15. i 1
- 16. while (i ? k) do
- 17. if top-down-process(Ti, G) gt 0 then return
1 - 18. i i 1
- 19. return 0
- end
function top-down-process(T, G) input T ltt
T1, ..., Tkgt, G ltp P1, ..., Pqgt (p may or may
not be a virtual node.) output if root(G) is
virtual, returns j ? 0 else returns 1 if T
includes G otherwise returns 0. begin 1. if
root(G) is virtual then 2. if (T lt P1
P2 or p has only one child) 3. then G P1
4. else j bottom-up-process(T, G) 5. if
(j 0 and label(t) label(P1s
root)) (second possibility in Case
3) 6. then change P1s root to a virtual
node x bottom-up-process(T, P1) 7. if
(x the number of the children of P1s
root) then j 1 else j 0 8. return
j 9. if T lt G return 0 10. else if
(label(t) label(p)) (handling Case
2) 11. then p virtual node
function bottom-up-process(T, G) input T ltt
T1, ..., Tkgt, G ltp P1, ..., Pqgt output j - an
integer begin 1. j 0 i 1 2. while (j lt
q and i ? k) do 3. x top-down-process(Ti,
G) 4. j j x G ltp Pj1, ..., Pqgt i
i 1 end
14Integration of Signatures into Top-down Inclusion
Definition A signature for a key word or an
attribute value is hash-coded bit string. -
Example (constructing a signature for a word
with m 4 and F 12) database
? letter triplets dat, ata, tab, aba, bas,
ase ? H(dat) 5, H(ata) 1, H(tab)
8, H(aba) 1, H(bas) 10, H(ase) 8.
? 100 010 010 100 D. Dervos, Y. Manolopulos
and P. Linardis, Comparison of signature File
models with superimposed coding, J. of
Information Processing Letters 65 (1998) 101 -
106.
15Integration of Signatures into Top-down Inclusion
Definition A signature for a key word or an
attribute value is hash-coded bit string. -
Important parameters m number of 1s in bit
string F length of bit string D size of a
block (or average number of the key words of an
element) optimal choice of the parameters
F ? ln2 m ? D (1) S. Christodoulakis
and C. Faloutsos, Design consideration for a
message file server, IEEE Trans. Software
Engineering, 10(2) (1984) 201-210.
16Integration of Signatures into Top-down Inclusion
- Assigning signatures to tree nodes Let v be a
node in a tree T. If v is a leaf node, its
signature sv is equal to the signature assigned
to its label. Otherwise, sv s ? v1 ? ... ? vn,
where s represents the signature for the label
associated with v, and s1, ... , and sn are the
signatures of vs children v1, ..., vn,
respectively.
T
a
1111 1101
e
b
1111 1101
1111 1000
f
e
c
d
1100 0000
0001 0101
0010 1000
1010 1000
17Integration of Signatures into Top-down Inclusion
- Cutting off useless subtree checks by examining
signatures We assign each node v in T a bit
string sv (called a signature), and each
node u in P a bit string su in such a way that
if su matches sv then the subtree Tv rooted at
v may includes the subtree Pu rooted at u.
Otherwise, Tv definitely does not contain Pu.
By matching, we mean that for each bit set to 1
in su, the corresponding bit in sv is also set
to 1 while for a bit set to 0 in su,
the corresponding bit in sv can be 0 or 1. In
the following, we discuss this technique in
great detail.
virtual node
T
P
This subtree will not be explored.
a
1111 1101
a
b
1111 1000
0011 1101
e
c
d
f
1100 0000
0010 1000
1010 1000
0001 0101
18Integration of Signatures into Top-down Inclusion
- Determine the length of signatures Consider
s s1 / s2, where s1 and s2 are of length F and
with m1 and m2 bits set to 1,
respectively. How to determine the length of
S? l - the number of 1s in s d l - m,
where m max(m1, m2). length(s) F cd,
where c is a constant and should be tuned for
different applications. The value of d can be
estimated as follows. l - random variable
representing the number of positions, in which
both s1 and s2 have 1s.
19Integration of Signatures into Top-down Inclusion
- Determine the length of signatures El
1 ? p(l 1) 2 ? p(l 2) ... m ? p(l
m) (2) m min(m1, m2) and p(l i)
represents the probability that l is equal to i.
p(l i) (3) d l - m m1
m2 - l - max(m1, m2).
20Evaluation of Path-Oriented Queries
- Procedure for calculating signature
length 1) Identify the key words in a document,
which can be done by using Connexor-analyzer
(http//www.connexor.com/demos/index.html.) 2) De
termine the length of the signatures for the
nodes of a document tree, which can be done in
two steps - First, use formula (1) to
determine the initial length of the
signatures according to the number of the
chosen key words and their distribution - Second
ly, use formula (2) and (3) to determine the
length of the signatures for each document
according to the initial length set
for signatures.
21Evaluation of Path-Oriented Queries
- Determine Procedure for calculating signature
length
In the figure, F stands for the initial length of
the signatures and m for the initial number of
bits set to 1.
22Experiment Results
- Test Platform Computer - DELL desktop PC
equipped with Pentium III 864Ghz
processor, 512MB RAM and 20GB hard disk.
Database system - Oracle-9i Enterprise Edition,
The default buffer cache of Oracle-9i is of
size 4MB. Language - Oracle PL/SQL language.
Data - all the 37 Shakespeares plays in a
database
23Experiment Results
- Storage of XML documents in databases All the
documents are stored in three tables. The
relation Element has the following
structure DocID ltintegergt, ID ltintegergt,
Ename ltstringgt, firstChildID ltintegergt,
siblingID ltintegergt, attributeID ltintegergt
docID ID Ename firstChildID siblinID attribute
1 1 Hotel-room-reservation 2 1
1 2 Name 1 3
1 3 Location 4 11
1 4 City-or-district 2 5
1 5 State 3 6
1 6 Country 4 7
1 7 Address 8
1 8 Number 5 9
1 9 Street 6 10
1 10 Post-code 7
1 11 Type 12 14
1 12 Rooms 8 13
1 13 Price 9
1 14 Reservation-time 15
1 15 From 10 15
1 16 To 11
24Experiment Results
- Storage of XML documents in databases The
relation Text is of a simpler structure DocID
ltintegergt, textID ltintegergt, value
ltstringgt, where textID is for the identifiers
of texts as the values of the corresponding
elements in the original document. One should
notice that a text takes always an element as
the parent node. See the following table for
illustration.
docID textID value
1 1 Travel-lodge
1 2 Winnipeg
1 3 Manitoba
1 4 Canada
1 5 500
1 6 Portage Ave.
1 7 R3B 2E9
1 8 One-bed-room
1 9 119.00
1 10 April 20, 2005
1 11 April 28, 2005
25Experiment Results
- Storage of XML documents in databases The
relation Attribute has five data
fields DocID ltintegergt, att-ID ltintegergt,
parentID ltintegergt, att-name ltstringgt, att-val
ue ltstringgt.
docID Att-ID parentID Att-name Att-value
1 1 1 Filecode 1302
26Experiment Results
- Tested queries
Group I - for testing path length impact
Query Path Expression
Q1 /play//magnificence
Q2 /play/act//magnificence
Q3 /play/act/scene//magnificence
Q4 /play /act/scene/speech//magnificence
Q5 /play/act/scene/speech/line/magnificence
Group II - for testing node degree impact
Query Path Expression
Q6 /play//line/magnificence
Q7 /play/act//magnificence ? /play//line/churchyard
Q8 /play/act//magnificence ? /play//line/churchyard ? /play//line/reverence
Q9 /play/act//magnificence ? /play//line/churchyard ? /play//line/reverence ? /play//line/frequent
Q10 /play/act//magnificence ? /play//line/churchyard ? /play//line/reverence ? /play//line/frequent ? /play//line/heirless
27Experiment Results
- Tested queries
Group III - for testing impact of matching at
higher level
Query Path Expression
Q11 /play//line/magnificence ? /play//line/perpetuity
Q12 /play//line/churchyard ? /play//line/ladyship
Q13 /play//line/reverence ? /play//line/continent
Q14 /play//line/frequent ? /play//line/linen
Q15 /play//line/heirless ? /play//line/delivery
Group IV - for testing impact of matching at
middle level
Query Path Expression
Q16 /scene//line/magnificence ? /scene//line/utterance
Q17 /scene//line/churchyard ? /scene//line/barbarism
Q18 /scene//line/reverence ? /scene//line/carriage
Q19 /scene//line/frequent ? /scene//line/imagination
Q20 /scene//line/heirless ? /scene//line/successor
28Experiment Results
- Tested queries
Group V- for testing impact of matching at lower
level
Query Path Expression
Q21 /speech//line/magnificence ? /speech//line/unintelligence
Q22 /speech//line/churchyard ? /speech//line/crickets
Q23 /speech//line/reverence ? /speech//line/ceremonious
Q24 /speech//line/frequent ? /speech//line/exercise
Q25 /speech//line/heirless ? /speech//line/companion
29Experiment Results
- - Tested methods
- Inversion on Elements and Words (IEW)
- (C. Zhang, J. Naughton, D. DeWitt, Q. Luo and G.
Lohman, On Supporting - Containment Queries in Relational Database
Management Systems, in Proc. of ACM - SIGMOD Intl. Conf. On Management of Data,
California, USA, 2001.) - Inversion on Paths and Words (IPW)
- (C. Seo, S. Lee, and H. Kim, An Efficient Index
Technique for XML Documents - Using RDBMS, Information and Software Technology
45(2003) 11-22, Elsevier - Science B.V.)
- Tree Inclusion Algorithm (TIA)
- Tree Inclusion with Signatures (TIS)
30Experiment Results
- Tested methods Inversion on Elements and Words
(IEW) - (Dno, Wposition, level) for a text
word - (Dno, Eposition, level) for an element
Example
31Experiment Results
- Tested methods To evaluate the query
/hotel-room-reservation/location/address street
Portage Ave., four joins are
performed self-joins on E-index relation to
connect hotel-room-reservation and
location, location and address, address
and street, the join between E-index and
T-index relations to connect street and
Portage Ave.
32Experiment Results
- Tested methods Inversion on Paths and Words
(IPW) - Path(path, pathID), -
PathIndex(pathID, docno, begin, end) -
Word(word, wordID) - WordIndex(wordID, docno,
pathID, position)
33Experiment Results
- - Tested methods
- In order to process the same query
- /hotel-room-reservation/location/address street
Portage Ave., - two joins are needed.
- First join between Path and WordIndex relations
with the following join condition - Path.path hotel-room-reservation/location/ad
dress/street and - Path.pathID WordIndex.pathID.
- The second join between the result R of the first
join and the Word relation with the - join condition
- R.wordID Word.wordID and Word.word
Portage Ave..
34Experiment Results
- Tested results
2
1000
IPW
TIS
TIA
Execution time (sec.)
Execution time (sec.)
1
100
?
Q1
Q2
Q3
Q4
Q5
Q6
Q7
Q8
Q9
Q10
Results of Group I
Results of Group II
12
Execution time (sec.)
6
Q1
Q2
Q3
Q4
Q5
Results of Group IV
35Experiment Results
- Tested results
12
Execution time (sec.)
6
Q1
Q2
Q3
Q4
Q5
Results of Group V
36Summary and Future Work
- Path-oriented queries in document databases
- Evaluation of path-oriented queries
- - top-down algorithm for tree inclusion
problem signatures- Integration of signatures
into top-down tree inclusion - Future work
- document recognition using
- tree inclusion
- probabilistic analysis
- Benford low
- Zipf low