CS760: XML Research 2 - PowerPoint PPT Presentation

1 / 78

About This Presentation

Title:

CS760: XML Research 2

Description:

Optimizing an XML query requires estimating the selectivity of path expressions. Database statistics used for selectivity estimation must be summarized to fit in ... – PowerPoint PPT presentation

Number of Views:50

Avg rating:3.0/5.0

Slides: 79

Provided by: ydch

Category:

more less

Transcript and Presenter's Notes

Title: CS760: XML Research 2

1
CS760 XML Research 2

September 16, 2002.
Yon Dohn Chung

2
Outline

Selectivity Estimation of Path Expressions
Indexing and Querying XML Data on RDBMS
XML Query processing using Signatures
Path Indexing for XML Document Retrieval
Extraction of DTD information from XML Documents
Filtering of XML Documents in SDI Environments

3
Estimating the Selectivity of XML Path
Expressionsfor Internet Scale Applications

Ashraf Aboulnaga, et. al.
VLDB, 2001

4
Contents

Introduction
Path Trees
Markov Tables
Experimental Evaluation
Summary

5
Introduction

XML queries use path expressions to navigate
through the structure of XML data
Optimizing an XML query requires estimating the
selectivity of path expressions
Database statistics used for selectivity
estimation must be summarized to fit in the
available memory

6
Path Trees

Construct a tree representing the structure of an
XML document

tag name
frequency
7
Path Trees

Summarize the path tree by
deleting low-frequency nodes
adding ?-nodes which represent the information
contained in the deleted nodes at a coarser
granularity
Summarization Methods
sibling-?
level-?
global-?
no-?

8
Path Trees

Sibling-?
mark the lowest-frequency node A for deletion
coalesce A and its sibling B into one ?-node if B
is a ?-node or a marked regular node

delete A, I, J, E, H, D, C, G
9
Path Trees

Level-?
delete the lowest-frequency nodes
coalesce all deleted nodes into a ?-node at each
level

delete A, I, J, E, H, D
10
Path Trees

Global-?
a single ?-node represents all deleted nodes

delete A, I, J, E
11
Path Trees

No-?
low-frequency nodes are simply deleted and not
replaced with ?-nodes
assumes that nodes not in the summarized path
tree did not exist in the original tree
To reduce the size of a path tree by n nodes,
of nodes that each method deletes is as follows

12
Path Trees

Selectivity Estimation
scan the path tree looking for all nodes whose
tags match the first tag of the path query
navigate down the tree matching tags in the path
query with tags in the tree
match a tag in the path query to a ?-node if it
cannot be matched to a node with a regular tag
e.g., //A/B/C matches all of //A/?/C, //A/?/?,
and //?/B/?
the selectivity of the path query is the total
frequency of the nodes which correspond to the
path query

13
Markov Tables

Construct a table of all the distinct paths of
length up to m and their frequency

(m 2)
14
Markov Tables

The frequency of longer paths can be estimated
using the following formula
The paths in XML data are modeled as a Markov
process of order m - 1

15
Markov Tables

Summarize the Markov table by
deleting low-frequency paths
replacing the deleted paths of length 1 or 2 with
?-paths (paths of length greater than 2 are
discarded)
Summarization Methods
suffix-?
global-?
no-?

16
Markov Tables

Suffix-?

SD
SD
SDA/D
SD
SDB/D
SD
17
Markov Tables

Global-?
? represents all deleted paths of length 1
?/? represents all deleted paths of length 2
No-?
low-frequency paths are simply discarded
assumes that paths not in the summarized Markov
table did not exist in the original table

18
Experimental Evaluation

Data Sets
synthetic data set and real data set
Query Workloads
random paths all queries have a non-zero result
size
random tags most queries have a result size of
zero
Path Tree Summarization
random paths the methods using ?-nodes are
better than no-?
random tags no-? is the best method
Markov Table Summarization
random paths suffix-? and m2 is best
random tags no-? and m2 is best

19
Summary

The selectivity of path expressions are very
important for query optimization.
The paper proposed two estimation methods
Path Tree
Markov table

20
Indexing and Querying XML Data for Regular
Expressions

Q. Li and B. Moon
VLDB, 2001

21
Contents

Introduction
Numbering Scheme for A-D Relationship
Index and Data Organization
Path-Join Algorithms
Summary

22
Introduction

XML as a standard for data representation and
exchange
Challenge Indexing and Querying XML
Use relational DBMS to XML data.
Fast access to XML data via path expressions
Path expressions to navigate through and retrieve
XML data
Q1 /chapter/_/figure_at_captionTree Frogs
Q2 (E1/E2)/E3/((E4_at_Av)(E5/_/E6))

23
Numbering Scheme

XML objects are modeled by a tree structure
nodes are XML elements and attributes
parent-child represents nesting between objects
To process path expression queries
(e.g.) chapter3/section, chapter3/_/figure
conventional approach traverse XML trees
new approach
collect two object sets
determine A-D relationship between objects

24
Extended Preorder

Annotate a node with a pair of ltorder, sizegt
for Y and its parent X,
order(X) lt order(Y) and
order(Y) size (Y) lt order(X) size(X)
for sibling X and Y, if X us before Y in
preorder,
order(X) size(X) lt order(Y)
Lemma
X is an ancestor of Y iff order(X) lt order(Y) lt
order(X) size(X)

25
Extended Preorder Examples

(1,100) is an ancestor of (17, 5)
1 lt 17, 175 lt 1100
(11, 5) and (25, 5) are siblings
115 lt 25
(10, 30) is not an ancestor of (45,4)
10 lt 45
455 gt 1030

26
Index and Data Organization

Two supplementary structures
name index (in B tree)
a name string ? nid
value table stores all string values
Element index (B tree)
nid ? a list of element records grouped by
document ID (did)
an element record contains (order,size), depth,
parent ID
quickly find all elements having the same name
string
Attribute index (B tree)
same to element index except mapping value id. to
attribute value in value table
Structure index (B tree)
did ? a list of element and attribute records
nid, ltorder, sizegt, etc.
quickly find all objects belonging to the same
document

27
Path-Join Algorithms

Decompose a path expression
Q2 (E1/E2)/E3/((E4_at_Av)(E5/_/E6))

E1
E2
E3
E4
_at_Av
E5
E6
/

/_/
EE-Join
EA-Join
EE-Join

KC-Join

Union
/
EE-Join
/
EE-Join
28
EA-Join

Join an element set and attribute set by A-D
(e.g.) figure_at_captionTree Frogs
Input
..., Ei, ..., Ei is a set of elements from a
document did
..., Aj, ..., Aj is a set of attributes from a
document did
Output
a set of (e, a) pairs such that e is a parent of
a
Algorithm
foreach Ei and Aj with the same did do
foreach e ? Ei and a ? Aj do
if (e is parent of a) then output (e, a)

29
EE-Join

Join two element sets by A-D relationship
(e.g.) chapter/_/figure
Input
..., Ei, ... and ..., Fj, ..., Ei and Fj
are sets of elements from a document did
Output
a set of (e, f) pairs such that e is a an
ancestor of f
Algorithm
foreach Ei and Fj with the same did do
foreach e ? Ei and f ? Fj do
if (e is ancestor of f) then output (e, f)

30
KC-Join

(e.g.) chapter, figure, chapter/chapter
Input
..., Ei, ..., Ei is a set of elements from a
document did
Output
a Kleene closure of ..., Ei, ...
Algorithm
i 1 Ki ..., Ei, ...
repeat
ii1 Ki EE-Join(Ki-1, K1)
until (Ki is empty)
output union of K1, K2, ..., Ki-1

31
Summary of Contributions

Design a numbering schme
Extended Preorder
Determine ancestor-descendant relationship
Propose Path-Join algorithms
Conventional tree traversal is slow
Join algorithms to avoid tree traversal
Design indexing and storage strictures
XISS
Element index, Attribute index, Structure index

32
A New Query Processing Technique for XML Based on
Signature

S. Park and H.J.Kim
DASFAA, 2001

33
Contents

Introduction
s-DOM
Query Processing with s-NFA
Summary

34
Introduction

The previous index methods (path index in OODB
and T-index) do not cover all possible regular
path expressions for the storage requirement.
It is also a problem that the index itself is a
semi-structured data
The signature is one of methods that reduce the
search space
Our idea
add signature information to each node of XML
documents
the signature gives hints as to whether some
nodes exist in the sub-tree of the specific node
the size of signature is so small

35
s-DOM

s-DOM is a DOM where we add a signature to each
node
The signature of a node is the ORing of all the
hash values of its child nodes
Algorithm
MakeSignature(node)
s 0
if node is an Element or Attribute node then
foreach ChildNode of node do
s s V MakeSignature(ChildNode)
s s V Hash(ChildNode.Name)
end for
end if
node.signature s

36
DOM An Example
37
s-DOM
lt Hash value of strings gt
lt Signature of a node in s-DOM gt
38
Query Processing

Query processing with NFA
a regular path expression is a regular
expression, thus can be transformed into NFA
therefore, a regular path expression can be
processed through an NFA
s-NFA is an NFA of which state nodes have
signatures
the signature is the ORed hash values of all the
labels along a NFA path of a state node (called
path signatures)
query processing with s-NFA reduces the search
space

39
s-NFA
lt Path Signatures gt
40
Summary

s-DOM
add a signature to each node in DOM
the signature of a node is the ORed signature
values of its descendents
s-NFA
add a signature to each state in NFA
the signature of a state is the ORed signature
values of the path to the node
Using signature methods, the search space for
tree traversal is reduced.

41
An Index Scheme for Efficient Retrieval of XML
Documents

J. H. Kim, et. al.

42
Contents

Problem Definition
Related Work
the inverted file
Motivation
The Proposed Index Structure
Analysis
An Improvement
Summary

43
Problem Definition

Input
Set of XML documents
Set of path information
Path query
Regular path expression
Output
ID of documents which contains the path that
satisfies the path query

44
Related Work

The inverted file

45
Motivation

Traditional inverted file
No false match for the plain documents
False match occurs for the XML documents
Do not consider the hierarchy for the elements
Can only provide the candidate set
How about using paths for inversion ?
No false match !
But, tremendous replication will occur.
e.g.
a, a/b, a/b/c, a/b/c/d
a is replicated 4 times, b is replicated 3
times, c is replicated twice.

46
The Proposed Method

Transform to reduce replication

/invoice /invoice/buyer /invoice/buyer/name /in
voice/buyer/address
47
The Proposed Index

The architecture

48
Analysis

Space analysis
the number of nodes in a k-ary tree with depth n
the number of nodes in case of no transform
thus, we can save space by more than (n-1) times

49
Analysis

Worst cases in query processing
if the query contains operator
e.g.
/address
all nodes in the tree must be traversed
/invoices//person
all nodes in sub-trees below /invoice must be
traversed

50
An Improvement

A solution for handling
construct short-cuts for every vocabulary such
that
it must be easy to get the list of nodes which
are located behind in the query
it must be easy to determine the
ancestor/descendant relation between the
before-nodes and behind-nodes of in the query

51
An Improvement

Architecture

52
An Improvement

Query processing
e.g. /a/b/_/c//d/e
1. normal tree traversal before
make a candidate node list A
2. vocabulary lookup when appears
acquire all nodes with the tag behind ,
candidate node list B
check ancestor/descendant relationships between
nodes in A and B

53
Experiment

Environment
Windows XP, Pentium4 2GHz, 512MB
JDK 1.4, Xerces 1.4.4

DocBook
NITF
54
Experiment Result
Processing Time for Document Retrieval
DocBook
NITF
55
Experiment Result
The Number of Filtered Documents
DocBook
NITF
56
Summary

Inversion of path information of XML documents
a method for XML document retrieval
also, a preprocessing method for XML query
processing.
an index structure for a set of XML documents,
not a single XML document.

57
XTRACT A System for Extracting Document Type
Descriptors from XML Documents

Minos Garofalakis, et. al.
SIGMOD, 2000

58
Contents

Introduction
Problem Definition
System Architecture
Generalization Subsystem
Factoring Subsystem
MDL Subsystem
Summary

59
Introduction

Document Type Descriptor (DTD)
a schema which specifies the internal structure
of an XML document
plays a crucial role in
the efficient storage of XML data
the effective formulation and optimization of XML
queries
XTRACT
a system for inferring a DTD for a database of
XML documents

60
Problem Definition
Given a set I of N input sequences nested
within element e, compute a DTD for e such that
every sequence in I conforms to the DTD.
ex) I ab, abab, ababab (1) (a b) ? ANY
(allows any arbitrary sequences of as and bs)
(2) ab abab ababab ? or of all the sequences
in I (3) ab ab(ab abab) ? derived from (2)
by factoring ab (4) (ab) ? concise (i.e.,
small in size) and precise (i.e. not cover
too many sequences not contained in I)
61
System Architecture
62
Generalization Subsystem

Generates general candidate DTDs for each input
sequence
finds patterns in the input sequence
replaces patterns with appropriate regular
expressions
metacharacters such as and
Inspired by real-life DTDs for limiting the set
of candidate DTDs

ex) I abab, bbbe Candidate DTDs (ab), (a
b), be
ex) I ababaabb Candidate DTDs (a b), (a
b)ab, (ab)(a b), (ab)ab
63
Factoring Subsystem

Factors candidate DTDs in the output of the
generalization module
Uses adaptations of algorithms from the logic
optimization literature

ex) (1) SG bd, be ? SF b(d e) (2)
SG ac, ad, bc, bd ? SF (a b)(c d)
SG the output of the generalization module SF
the output of the factoring module
64
MDL Subsystem

Minimum Description Length (MDL) principle
the best theory to infer from a set of data is
the one which minimizes the sum of
the length of the theory
the length of the data when encoded with the help
of the theory
the above sum is referred to as the MDL cost

ex) I ab, abab, ababab
65
MDL Subsystem

Applies the MDL principle to find the best DTD D
among the candidates
D covers all sequences in I
D has minimum MDL cost
Optimal DTD selection based on MDL cost is
NP-complete
a heuristic algorithm is proposed.
For algorithms of generalization, factoring
and minimum MDL-cost selection, refer to the
paper.

66
Summary

DTD is very important for XML storage and query
processing
DTD extraction from a set of XML documents using
data mining techniques
generalization
factorization
MDL-based optimal DTD selection

67
Efficient Filtering of XML Documents for
Selective Dissemination of Information

Mehmet Altinel and Michael J. Franklin
VLDB, 2000

68
Contents

Introduction XML-based SDI system
XFilter architecture
Filtering Method
Summary

69
Introduction

XML-based SDI system

User Profiles
Filtered Data
XML Documents
XML Conversion
Filter Engine
Users
Data Sources
70
XFilter Architecture
User Profiles (XPath Queries)
/a//b/c //b/d//e /c//d//e
/a/bc/d/e //d///e /b/e
XPath Parser
71
Query Index

Construction of Query Index in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
CL
CL(Candidate List) current node WL(Wait
List) path nodes representing future
states
WL
CL
WL
CL
WL
Query Index
72
XFilter Filtering Method