Title: CS760: XML Research 2
1CS760 XML Research 2
- September 16, 2002.
- Yon Dohn Chung
2Outline
- Selectivity Estimation of Path Expressions
- Indexing and Querying XML Data on RDBMS
- XML Query processing using Signatures
- Path Indexing for XML Document Retrieval
- Extraction of DTD information from XML Documents
- Filtering of XML Documents in SDI Environments
3Estimating the Selectivity of XML Path
Expressionsfor Internet Scale Applications
- Ashraf Aboulnaga, et. al.
- VLDB, 2001
4Contents
- Introduction
- Path Trees
- Markov Tables
- Experimental Evaluation
- Summary
5Introduction
- XML queries use path expressions to navigate
through the structure of XML data - Optimizing an XML query requires estimating the
selectivity of path expressions - Database statistics used for selectivity
estimation must be summarized to fit in the
available memory
6Path Trees
- Construct a tree representing the structure of an
XML document
tag name
frequency
7Path Trees
- Summarize the path tree by
- deleting low-frequency nodes
- adding ?-nodes which represent the information
contained in the deleted nodes at a coarser
granularity - Summarization Methods
- sibling-?
- level-?
- global-?
- no-?
8Path Trees
- Sibling-?
- mark the lowest-frequency node A for deletion
- coalesce A and its sibling B into one ?-node if B
is a ?-node or a marked regular node
delete A, I, J, E, H, D, C, G
9Path Trees
- Level-?
- delete the lowest-frequency nodes
- coalesce all deleted nodes into a ?-node at each
level
delete A, I, J, E, H, D
10Path Trees
- Global-?
- a single ?-node represents all deleted nodes
delete A, I, J, E
11Path Trees
- No-?
- low-frequency nodes are simply deleted and not
replaced with ?-nodes - assumes that nodes not in the summarized path
tree did not exist in the original tree - To reduce the size of a path tree by n nodes,
of nodes that each method deletes is as follows
12Path Trees
- Selectivity Estimation
- scan the path tree looking for all nodes whose
tags match the first tag of the path query - navigate down the tree matching tags in the path
query with tags in the tree - match a tag in the path query to a ?-node if it
cannot be matched to a node with a regular tag - e.g., //A/B/C matches all of //A/?/C, //A/?/?,
and //?/B/? - the selectivity of the path query is the total
frequency of the nodes which correspond to the
path query
13Markov Tables
- Construct a table of all the distinct paths of
length up to m and their frequency
(m 2)
14Markov Tables
- The frequency of longer paths can be estimated
using the following formula - The paths in XML data are modeled as a Markov
process of order m - 1
15Markov Tables
- Summarize the Markov table by
- deleting low-frequency paths
- replacing the deleted paths of length 1 or 2 with
?-paths (paths of length greater than 2 are
discarded) - Summarization Methods
- suffix-?
- global-?
- no-?
16Markov Tables
SD
SD
SDA/D
SD
SDB/D
SD
17Markov Tables
- Global-?
- ? represents all deleted paths of length 1
- ?/? represents all deleted paths of length 2
- No-?
- low-frequency paths are simply discarded
- assumes that paths not in the summarized Markov
table did not exist in the original table
18Experimental Evaluation
- Data Sets
- synthetic data set and real data set
- Query Workloads
- random paths all queries have a non-zero result
size - random tags most queries have a result size of
zero - Path Tree Summarization
- random paths the methods using ?-nodes are
better than no-? - random tags no-? is the best method
- Markov Table Summarization
- random paths suffix-? and m2 is best
- random tags no-? and m2 is best
19Summary
- The selectivity of path expressions are very
important for query optimization. - The paper proposed two estimation methods
- Path Tree
- Markov table
20Indexing and Querying XML Data for Regular
Expressions
- Q. Li and B. Moon
- VLDB, 2001
21Contents
- Introduction
- Numbering Scheme for A-D Relationship
- Index and Data Organization
- Path-Join Algorithms
- Summary
22Introduction
- XML as a standard for data representation and
exchange - Challenge Indexing and Querying XML
- Use relational DBMS to XML data.
- Fast access to XML data via path expressions
- Path expressions to navigate through and retrieve
XML data - Q1 /chapter/_/figure_at_captionTree Frogs
- Q2 (E1/E2)/E3/((E4_at_Av)(E5/_/E6))
23Numbering Scheme
- XML objects are modeled by a tree structure
- nodes are XML elements and attributes
- parent-child represents nesting between objects
- To process path expression queries
- (e.g.) chapter3/section, chapter3/_/figure
- conventional approach traverse XML trees
- new approach
- collect two object sets
- determine A-D relationship between objects
24Extended Preorder
- Annotate a node with a pair of ltorder, sizegt
- for Y and its parent X,
- order(X) lt order(Y) and
- order(Y) size (Y) lt order(X) size(X)
- for sibling X and Y, if X us before Y in
preorder, - order(X) size(X) lt order(Y)
- Lemma
- X is an ancestor of Y iff order(X) lt order(Y) lt
order(X) size(X)
25Extended Preorder Examples
- (1,100) is an ancestor of (17, 5)
- 1 lt 17, 175 lt 1100
- (11, 5) and (25, 5) are siblings
- 115 lt 25
- (10, 30) is not an ancestor of (45,4)
- 10 lt 45
- 455 gt 1030
26Index and Data Organization
- Two supplementary structures
- name index (in B tree)
- a name string ? nid
- value table stores all string values
- Element index (B tree)
- nid ? a list of element records grouped by
document ID (did) - an element record contains (order,size), depth,
parent ID - quickly find all elements having the same name
string - Attribute index (B tree)
- same to element index except mapping value id. to
attribute value in value table - Structure index (B tree)
- did ? a list of element and attribute records
nid, ltorder, sizegt, etc. - quickly find all objects belonging to the same
document
27Path-Join Algorithms
- Decompose a path expression
- Q2 (E1/E2)/E3/((E4_at_Av)(E5/_/E6))
E1
E2
E3
E4
_at_Av
E5
E6
/
/_/
EE-Join
EA-Join
EE-Join
KC-Join
Union
/
EE-Join
/
EE-Join
28EA-Join
- Join an element set and attribute set by A-D
- (e.g.) figure_at_captionTree Frogs
- Input
- ..., Ei, ..., Ei is a set of elements from a
document did - ..., Aj, ..., Aj is a set of attributes from a
document did - Output
- a set of (e, a) pairs such that e is a parent of
a - Algorithm
- foreach Ei and Aj with the same did do
- foreach e ? Ei and a ? Aj do
- if (e is parent of a) then output (e, a)
29EE-Join
- Join two element sets by A-D relationship
- (e.g.) chapter/_/figure
- Input
- ..., Ei, ... and ..., Fj, ..., Ei and Fj
are sets of elements from a document did - Output
- a set of (e, f) pairs such that e is a an
ancestor of f - Algorithm
- foreach Ei and Fj with the same did do
- foreach e ? Ei and f ? Fj do
- if (e is ancestor of f) then output (e, f)
30KC-Join
- (e.g.) chapter, figure, chapter/chapter
- Input
- ..., Ei, ..., Ei is a set of elements from a
document did - Output
- a Kleene closure of ..., Ei, ...
- Algorithm
- i 1 Ki ..., Ei, ...
- repeat
- ii1 Ki EE-Join(Ki-1, K1)
- until (Ki is empty)
- output union of K1, K2, ..., Ki-1
31Summary of Contributions
- Design a numbering schme
- Extended Preorder
- Determine ancestor-descendant relationship
- Propose Path-Join algorithms
- Conventional tree traversal is slow
- Join algorithms to avoid tree traversal
- Design indexing and storage strictures
- XISS
- Element index, Attribute index, Structure index
32A New Query Processing Technique for XML Based on
Signature
- S. Park and H.J.Kim
- DASFAA, 2001
33Contents
- Introduction
- s-DOM
- Query Processing with s-NFA
- Summary
34Introduction
- The previous index methods (path index in OODB
and T-index) do not cover all possible regular
path expressions for the storage requirement. - It is also a problem that the index itself is a
semi-structured data - The signature is one of methods that reduce the
search space - Our idea
- add signature information to each node of XML
documents - the signature gives hints as to whether some
nodes exist in the sub-tree of the specific node - the size of signature is so small
35s-DOM
- s-DOM is a DOM where we add a signature to each
node - The signature of a node is the ORing of all the
hash values of its child nodes - Algorithm
- MakeSignature(node)
- s 0
- if node is an Element or Attribute node then
- foreach ChildNode of node do
- s s V MakeSignature(ChildNode)
- s s V Hash(ChildNode.Name)
- end for
- end if
- node.signature s
36DOM An Example
37s-DOM
lt Hash value of strings gt
lt Signature of a node in s-DOM gt
38Query Processing
- Query processing with NFA
- a regular path expression is a regular
expression, thus can be transformed into NFA - therefore, a regular path expression can be
processed through an NFA - s-NFA is an NFA of which state nodes have
signatures - the signature is the ORed hash values of all the
labels along a NFA path of a state node (called
path signatures) - query processing with s-NFA reduces the search
space
39s-NFA
lt Path Signatures gt
40Summary
- s-DOM
- add a signature to each node in DOM
- the signature of a node is the ORed signature
values of its descendents - s-NFA
- add a signature to each state in NFA
- the signature of a state is the ORed signature
values of the path to the node - Using signature methods, the search space for
tree traversal is reduced.
41An Index Scheme for Efficient Retrieval of XML
Documents
42Contents
- Problem Definition
- Related Work
- the inverted file
- Motivation
- The Proposed Index Structure
- Analysis
- An Improvement
- Summary
43Problem Definition
- Input
- Set of XML documents
- Set of path information
- Path query
- Regular path expression
- Output
- ID of documents which contains the path that
satisfies the path query
44Related Work
45Motivation
- Traditional inverted file
- No false match for the plain documents
- False match occurs for the XML documents
- Do not consider the hierarchy for the elements
- Can only provide the candidate set
- How about using paths for inversion ?
- No false match !
- But, tremendous replication will occur.
- e.g.
- a, a/b, a/b/c, a/b/c/d
- a is replicated 4 times, b is replicated 3
times, c is replicated twice.
46The Proposed Method
- Transform to reduce replication
/invoice /invoice/buyer /invoice/buyer/name /in
voice/buyer/address
47The Proposed Index
48Analysis
- Space analysis
- the number of nodes in a k-ary tree with depth n
- the number of nodes in case of no transform
- thus, we can save space by more than (n-1) times
49Analysis
- Worst cases in query processing
- if the query contains operator
- e.g.
- /address
- all nodes in the tree must be traversed
- /invoices//person
- all nodes in sub-trees below /invoice must be
traversed
50An Improvement
- A solution for handling
- construct short-cuts for every vocabulary such
that - it must be easy to get the list of nodes which
are located behind in the query - it must be easy to determine the
ancestor/descendant relation between the
before-nodes and behind-nodes of in the query
51An Improvement
52An Improvement
- Query processing
- e.g. /a/b/_/c//d/e
- 1. normal tree traversal before
- make a candidate node list A
- 2. vocabulary lookup when appears
- acquire all nodes with the tag behind ,
candidate node list B - check ancestor/descendant relationships between
nodes in A and B
53Experiment
- Environment
- Windows XP, Pentium4 2GHz, 512MB
- JDK 1.4, Xerces 1.4.4
DocBook
NITF
54Experiment Result
Processing Time for Document Retrieval
DocBook
NITF
55Experiment Result
The Number of Filtered Documents
DocBook
NITF
56Summary
- Inversion of path information of XML documents
- a method for XML document retrieval
- also, a preprocessing method for XML query
processing. - an index structure for a set of XML documents,
not a single XML document.
57XTRACT A System for Extracting Document Type
Descriptors from XML Documents
- Minos Garofalakis, et. al.
- SIGMOD, 2000
58Contents
- Introduction
- Problem Definition
- System Architecture
- Generalization Subsystem
- Factoring Subsystem
- MDL Subsystem
- Summary
59Introduction
- Document Type Descriptor (DTD)
- a schema which specifies the internal structure
of an XML document - plays a crucial role in
- the efficient storage of XML data
- the effective formulation and optimization of XML
queries - XTRACT
- a system for inferring a DTD for a database of
XML documents
60Problem Definition
Given a set I of N input sequences nested
within element e, compute a DTD for e such that
every sequence in I conforms to the DTD.
ex) I ab, abab, ababab (1) (a b) ? ANY
(allows any arbitrary sequences of as and bs)
(2) ab abab ababab ? or of all the sequences
in I (3) ab ab(ab abab) ? derived from (2)
by factoring ab (4) (ab) ? concise (i.e.,
small in size) and precise (i.e. not cover
too many sequences not contained in I)
61System Architecture
62Generalization Subsystem
- Generates general candidate DTDs for each input
sequence - finds patterns in the input sequence
- replaces patterns with appropriate regular
expressions - metacharacters such as and
- Inspired by real-life DTDs for limiting the set
of candidate DTDs
ex) I abab, bbbe Candidate DTDs (ab), (a
b), be
ex) I ababaabb Candidate DTDs (a b), (a
b)ab, (ab)(a b), (ab)ab
63Factoring Subsystem
- Factors candidate DTDs in the output of the
generalization module - Uses adaptations of algorithms from the logic
optimization literature
ex) (1) SG bd, be ? SF b(d e) (2)
SG ac, ad, bc, bd ? SF (a b)(c d)
SG the output of the generalization module SF
the output of the factoring module
64MDL Subsystem
- Minimum Description Length (MDL) principle
- the best theory to infer from a set of data is
the one which minimizes the sum of - the length of the theory
- the length of the data when encoded with the help
of the theory - the above sum is referred to as the MDL cost
ex) I ab, abab, ababab
65MDL Subsystem
- Applies the MDL principle to find the best DTD D
among the candidates - D covers all sequences in I
- D has minimum MDL cost
- Optimal DTD selection based on MDL cost is
NP-complete - a heuristic algorithm is proposed.
- For algorithms of generalization, factoring
and minimum MDL-cost selection, refer to the
paper.
66Summary
- DTD is very important for XML storage and query
processing - DTD extraction from a set of XML documents using
data mining techniques - generalization
- factorization
- MDL-based optimal DTD selection
67Efficient Filtering of XML Documents for
Selective Dissemination of Information
- Mehmet Altinel and Michael J. Franklin
- VLDB, 2000
68Contents
- Introduction XML-based SDI system
- XFilter architecture
- Filtering Method
- Summary
69Introduction
User Profiles
Filtered Data
XML Documents
XML Conversion
Filter Engine
Users
Data Sources
70XFilter Architecture
User Profiles (XPath Queries)
/a//b/c //b/d//e /c//d//e
/a/bc/d/e //d///e /b/e
XPath Parser
71Query Index
- Construction of Query Index in XFilter System
Q1/a/b/c Q2/a//c/b Q3/b/a
CL
CL(Candidate List) current node WL(Wait
List) path nodes representing future
states
WL
CL
WL
CL
WL
Query Index
72XFilter Filtering Method
- Filtering Example in XFilter System
Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
73XFilter Filtering Method
- Filtering Example in XFilter System
Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
74XFilter Filtering Method
- Filtering Example in XFilter System
Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Query Index
Document
matching Q1
75XFilter Filtering Method
- Filtering Example in XFilter System
Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Query Index
Document
matching Q1
76XFilter Filtering Method
- Filtering Example in XFilter System
Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
77XFilter Filtering Method
- Filtering Example in XFilter System
Q1/a/b/c Q2/a//c/b Q3/b/a
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
78Summary
- Information filtering methods are needed for
XML-based SDI systems - The paper proposed the XFilter filtering system
- user profiles are constructed with XPath queries
- Query Index indexing XPath queries
- FSM-based Filtering method through Query Index