CS760: XML Research 2 - PowerPoint PPT Presentation

1 / 78
About This Presentation
Title:

CS760: XML Research 2

Description:

Optimizing an XML query requires estimating the selectivity of path expressions. Database statistics used for selectivity estimation must be summarized to fit in ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 79
Provided by: ydch
Category:

less

Transcript and Presenter's Notes

Title: CS760: XML Research 2


1
CS760 XML Research 2
  • September 16, 2002.
  • Yon Dohn Chung

2
Outline
  • Selectivity Estimation of Path Expressions
  • Indexing and Querying XML Data on RDBMS
  • XML Query processing using Signatures
  • Path Indexing for XML Document Retrieval
  • Extraction of DTD information from XML Documents
  • Filtering of XML Documents in SDI Environments

3
Estimating the Selectivity of XML Path
Expressionsfor Internet Scale Applications
  • Ashraf Aboulnaga, et. al.
  • VLDB, 2001

4
Contents
  • Introduction
  • Path Trees
  • Markov Tables
  • Experimental Evaluation
  • Summary

5
Introduction
  • XML queries use path expressions to navigate
    through the structure of XML data
  • Optimizing an XML query requires estimating the
    selectivity of path expressions
  • Database statistics used for selectivity
    estimation must be summarized to fit in the
    available memory

6
Path Trees
  • Construct a tree representing the structure of an
    XML document

tag name
frequency
7
Path Trees
  • Summarize the path tree by
  • deleting low-frequency nodes
  • adding ?-nodes which represent the information
    contained in the deleted nodes at a coarser
    granularity
  • Summarization Methods
  • sibling-?
  • level-?
  • global-?
  • no-?

8
Path Trees
  • Sibling-?
  • mark the lowest-frequency node A for deletion
  • coalesce A and its sibling B into one ?-node if B
    is a ?-node or a marked regular node

delete A, I, J, E, H, D, C, G
9
Path Trees
  • Level-?
  • delete the lowest-frequency nodes
  • coalesce all deleted nodes into a ?-node at each
    level

delete A, I, J, E, H, D
10
Path Trees
  • Global-?
  • a single ?-node represents all deleted nodes

delete A, I, J, E
11
Path Trees
  • No-?
  • low-frequency nodes are simply deleted and not
    replaced with ?-nodes
  • assumes that nodes not in the summarized path
    tree did not exist in the original tree
  • To reduce the size of a path tree by n nodes,
    of nodes that each method deletes is as follows

12
Path Trees
  • Selectivity Estimation
  • scan the path tree looking for all nodes whose
    tags match the first tag of the path query
  • navigate down the tree matching tags in the path
    query with tags in the tree
  • match a tag in the path query to a ?-node if it
    cannot be matched to a node with a regular tag
  • e.g., //A/B/C matches all of //A/?/C, //A/?/?,
    and //?/B/?
  • the selectivity of the path query is the total
    frequency of the nodes which correspond to the
    path query

13
Markov Tables
  • Construct a table of all the distinct paths of
    length up to m and their frequency

(m 2)
14
Markov Tables
  • The frequency of longer paths can be estimated
    using the following formula
  • The paths in XML data are modeled as a Markov
    process of order m - 1

15
Markov Tables
  • Summarize the Markov table by
  • deleting low-frequency paths
  • replacing the deleted paths of length 1 or 2 with
    ?-paths (paths of length greater than 2 are
    discarded)
  • Summarization Methods
  • suffix-?
  • global-?
  • no-?

16
Markov Tables
  • Suffix-?

SD
SD
SDA/D
SD
SDB/D
SD
17
Markov Tables
  • Global-?
  • ? represents all deleted paths of length 1
  • ?/? represents all deleted paths of length 2
  • No-?
  • low-frequency paths are simply discarded
  • assumes that paths not in the summarized Markov
    table did not exist in the original table

18
Experimental Evaluation
  • Data Sets
  • synthetic data set and real data set
  • Query Workloads
  • random paths all queries have a non-zero result
    size
  • random tags most queries have a result size of
    zero
  • Path Tree Summarization
  • random paths the methods using ?-nodes are
    better than no-?
  • random tags no-? is the best method
  • Markov Table Summarization
  • random paths suffix-? and m2 is best
  • random tags no-? and m2 is best

19
Summary
  • The selectivity of path expressions are very
    important for query optimization.
  • The paper proposed two estimation methods
  • Path Tree
  • Markov table

20
Indexing and Querying XML Data for Regular
Expressions
  • Q. Li and B. Moon
  • VLDB, 2001

21
Contents
  • Introduction
  • Numbering Scheme for A-D Relationship
  • Index and Data Organization
  • Path-Join Algorithms
  • Summary

22
Introduction
  • XML as a standard for data representation and
    exchange
  • Challenge Indexing and Querying XML
  • Use relational DBMS to XML data.
  • Fast access to XML data via path expressions
  • Path expressions to navigate through and retrieve
    XML data
  • Q1 /chapter/_/figure_at_captionTree Frogs
  • Q2 (E1/E2)/E3/((E4_at_Av)(E5/_/E6))

23
Numbering Scheme
  • XML objects are modeled by a tree structure
  • nodes are XML elements and attributes
  • parent-child represents nesting between objects
  • To process path expression queries
  • (e.g.) chapter3/section, chapter3/_/figure
  • conventional approach traverse XML trees
  • new approach
  • collect two object sets
  • determine A-D relationship between objects

24
Extended Preorder
  • Annotate a node with a pair of ltorder, sizegt
  • for Y and its parent X,
  • order(X) lt order(Y) and
  • order(Y) size (Y) lt order(X) size(X)
  • for sibling X and Y, if X us before Y in
    preorder,
  • order(X) size(X) lt order(Y)
  • Lemma
  • X is an ancestor of Y iff order(X) lt order(Y) lt
    order(X) size(X)

25
Extended Preorder Examples
  • (1,100) is an ancestor of (17, 5)
  • 1 lt 17, 175 lt 1100
  • (11, 5) and (25, 5) are siblings
  • 115 lt 25
  • (10, 30) is not an ancestor of (45,4)
  • 10 lt 45
  • 455 gt 1030

26
Index and Data Organization
  • Two supplementary structures
  • name index (in B tree)
  • a name string ? nid
  • value table stores all string values
  • Element index (B tree)
  • nid ? a list of element records grouped by
    document ID (did)
  • an element record contains (order,size), depth,
    parent ID
  • quickly find all elements having the same name
    string
  • Attribute index (B tree)
  • same to element index except mapping value id. to
    attribute value in value table
  • Structure index (B tree)
  • did ? a list of element and attribute records
    nid, ltorder, sizegt, etc.
  • quickly find all objects belonging to the same
    document

27
Path-Join Algorithms
  • Decompose a path expression
  • Q2 (E1/E2)/E3/((E4_at_Av)(E5/_/E6))

E1
E2
E3
E4
_at_Av
E5
E6
/

/_/
EE-Join
EA-Join
EE-Join

KC-Join

Union
/
EE-Join
/
EE-Join
28
EA-Join
  • Join an element set and attribute set by A-D
  • (e.g.) figure_at_captionTree Frogs
  • Input
  • ..., Ei, ..., Ei is a set of elements from a
    document did
  • ..., Aj, ..., Aj is a set of attributes from a
    document did
  • Output
  • a set of (e, a) pairs such that e is a parent of
    a
  • Algorithm
  • foreach Ei and Aj with the same did do
  • foreach e ? Ei and a ? Aj do
  • if (e is parent of a) then output (e, a)

29
EE-Join
  • Join two element sets by A-D relationship
  • (e.g.) chapter/_/figure
  • Input
  • ..., Ei, ... and ..., Fj, ..., Ei and Fj
    are sets of elements from a document did
  • Output
  • a set of (e, f) pairs such that e is a an
    ancestor of f
  • Algorithm
  • foreach Ei and Fj with the same did do
  • foreach e ? Ei and f ? Fj do
  • if (e is ancestor of f) then output (e, f)

30
KC-Join
  • (e.g.) chapter, figure, chapter/chapter
  • Input
  • ..., Ei, ..., Ei is a set of elements from a
    document did
  • Output
  • a Kleene closure of ..., Ei, ...
  • Algorithm
  • i 1 Ki ..., Ei, ...
  • repeat
  • ii1 Ki EE-Join(Ki-1, K1)
  • until (Ki is empty)
  • output union of K1, K2, ..., Ki-1

31
Summary of Contributions
  • Design a numbering schme
  • Extended Preorder
  • Determine ancestor-descendant relationship
  • Propose Path-Join algorithms
  • Conventional tree traversal is slow
  • Join algorithms to avoid tree traversal
  • Design indexing and storage strictures
  • XISS
  • Element index, Attribute index, Structure index

32
A New Query Processing Technique for XML Based on
Signature
  • S. Park and H.J.Kim
  • DASFAA, 2001

33
Contents
  • Introduction
  • s-DOM
  • Query Processing with s-NFA
  • Summary

34
Introduction
  • The previous index methods (path index in OODB
    and T-index) do not cover all possible regular
    path expressions for the storage requirement.
  • It is also a problem that the index itself is a
    semi-structured data
  • The signature is one of methods that reduce the
    search space
  • Our idea
  • add signature information to each node of XML
    documents
  • the signature gives hints as to whether some
    nodes exist in the sub-tree of the specific node
  • the size of signature is so small

35
s-DOM
  • s-DOM is a DOM where we add a signature to each
    node
  • The signature of a node is the ORing of all the
    hash values of its child nodes
  • Algorithm
  • MakeSignature(node)
  • s 0
  • if node is an Element or Attribute node then
  • foreach ChildNode of node do
  • s s V MakeSignature(ChildNode)
  • s s V Hash(ChildNode.Name)
  • end for
  • end if
  • node.signature s

36
DOM An Example
37
s-DOM
lt Hash value of strings gt
lt Signature of a node in s-DOM gt
38
Query Processing
  • Query processing with NFA
  • a regular path expression is a regular
    expression, thus can be transformed into NFA
  • therefore, a regular path expression can be
    processed through an NFA
  • s-NFA is an NFA of which state nodes have
    signatures
  • the signature is the ORed hash values of all the
    labels along a NFA path of a state node (called
    path signatures)
  • query processing with s-NFA reduces the search
    space

39
s-NFA
lt Path Signatures gt
40
Summary
  • s-DOM
  • add a signature to each node in DOM
  • the signature of a node is the ORed signature
    values of its descendents
  • s-NFA
  • add a signature to each state in NFA
  • the signature of a state is the ORed signature
    values of the path to the node
  • Using signature methods, the search space for
    tree traversal is reduced.

41
An Index Scheme for Efficient Retrieval of XML
Documents
  • J. H. Kim, et. al.

42
Contents
  • Problem Definition
  • Related Work
  • the inverted file
  • Motivation
  • The Proposed Index Structure
  • Analysis
  • An Improvement
  • Summary

43
Problem Definition
  • Input
  • Set of XML documents
  • Set of path information
  • Path query
  • Regular path expression
  • Output
  • ID of documents which contains the path that
    satisfies the path query

44
Related Work
  • The inverted file

45
Motivation
  • Traditional inverted file
  • No false match for the plain documents
  • False match occurs for the XML documents
  • Do not consider the hierarchy for the elements
  • Can only provide the candidate set
  • How about using paths for inversion ?
  • No false match !
  • But, tremendous replication will occur.
  • e.g.
  • a, a/b, a/b/c, a/b/c/d
  • a is replicated 4 times, b is replicated 3
    times, c is replicated twice.

46
The Proposed Method
  • Transform to reduce replication

/invoice /invoice/buyer /invoice/buyer/name /in
voice/buyer/address
47
The Proposed Index
  • The architecture

48
Analysis
  • Space analysis
  • the number of nodes in a k-ary tree with depth n
  • the number of nodes in case of no transform
  • thus, we can save space by more than (n-1) times

49
Analysis
  • Worst cases in query processing
  • if the query contains operator
  • e.g.
  • /address
  • all nodes in the tree must be traversed
  • /invoices//person
  • all nodes in sub-trees below /invoice must be
    traversed

50
An Improvement
  • A solution for handling
  • construct short-cuts for every vocabulary such
    that
  • it must be easy to get the list of nodes which
    are located behind in the query
  • it must be easy to determine the
    ancestor/descendant relation between the
    before-nodes and behind-nodes of in the query

51
An Improvement
  • Architecture

52
An Improvement
  • Query processing
  • e.g. /a/b/_/c//d/e
  • 1. normal tree traversal before
  • make a candidate node list A
  • 2. vocabulary lookup when appears
  • acquire all nodes with the tag behind ,
    candidate node list B
  • check ancestor/descendant relationships between
    nodes in A and B

53
Experiment
  • Environment
  • Windows XP, Pentium4 2GHz, 512MB
  • JDK 1.4, Xerces 1.4.4

DocBook
NITF
54
Experiment Result
Processing Time for Document Retrieval
DocBook
NITF
55
Experiment Result
The Number of Filtered Documents
DocBook
NITF
56
Summary
  • Inversion of path information of XML documents
  • a method for XML document retrieval
  • also, a preprocessing method for XML query
    processing.
  • an index structure for a set of XML documents,
    not a single XML document.

57
XTRACT A System for Extracting Document Type
Descriptors from XML Documents
  • Minos Garofalakis, et. al.
  • SIGMOD, 2000

58
Contents
  • Introduction
  • Problem Definition
  • System Architecture
  • Generalization Subsystem
  • Factoring Subsystem
  • MDL Subsystem
  • Summary

59
Introduction
  • Document Type Descriptor (DTD)
  • a schema which specifies the internal structure
    of an XML document
  • plays a crucial role in
  • the efficient storage of XML data
  • the effective formulation and optimization of XML
    queries
  • XTRACT
  • a system for inferring a DTD for a database of
    XML documents

60
Problem Definition
Given a set I of N input sequences nested
within element e, compute a DTD for e such that
every sequence in I conforms to the DTD.
ex) I ab, abab, ababab (1) (a b) ? ANY
(allows any arbitrary sequences of as and bs)
(2) ab abab ababab ? or of all the sequences
in I (3) ab ab(ab abab) ? derived from (2)
by factoring ab (4) (ab) ? concise (i.e.,
small in size) and precise (i.e. not cover
too many sequences not contained in I)
61
System Architecture
62
Generalization Subsystem
  • Generates general candidate DTDs for each input
    sequence
  • finds patterns in the input sequence
  • replaces patterns with appropriate regular
    expressions
  • metacharacters such as and
  • Inspired by real-life DTDs for limiting the set
    of candidate DTDs

ex) I abab, bbbe Candidate DTDs (ab), (a
b), be
ex) I ababaabb Candidate DTDs (a b), (a
b)ab, (ab)(a b), (ab)ab
63
Factoring Subsystem
  • Factors candidate DTDs in the output of the
    generalization module
  • Uses adaptations of algorithms from the logic
    optimization literature

ex) (1) SG bd, be ? SF b(d e) (2)
SG ac, ad, bc, bd ? SF (a b)(c d)
SG the output of the generalization module SF
the output of the factoring module
64
MDL Subsystem
  • Minimum Description Length (MDL) principle
  • the best theory to infer from a set of data is
    the one which minimizes the sum of
  • the length of the theory
  • the length of the data when encoded with the help
    of the theory
  • the above sum is referred to as the MDL cost

ex) I ab, abab, ababab
65
MDL Subsystem
  • Applies the MDL principle to find the best DTD D
    among the candidates
  • D covers all sequences in I
  • D has minimum MDL cost
  • Optimal DTD selection based on MDL cost is
    NP-complete
  • a heuristic algorithm is proposed.
  • For algorithms of generalization, factoring
    and minimum MDL-cost selection, refer to the
    paper.

66
Summary
  • DTD is very important for XML storage and query
    processing
  • DTD extraction from a set of XML documents using
    data mining techniques
  • generalization
  • factorization
  • MDL-based optimal DTD selection

67
Efficient Filtering of XML Documents for
Selective Dissemination of Information
  • Mehmet Altinel and Michael J. Franklin
  • VLDB, 2000

68
Contents
  • Introduction XML-based SDI system
  • XFilter architecture
  • Filtering Method
  • Summary

69
Introduction
  • XML-based SDI system

User Profiles
Filtered Data
XML Documents
XML Conversion
Filter Engine
Users
Data Sources
70
XFilter Architecture
User Profiles (XPath Queries)
/a//b/c //b/d//e /c//d//e
/a/bc/d/e //d///e /b/e
XPath Parser
71
Query Index
  • Construction of Query Index in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
CL
CL(Candidate List) current node WL(Wait
List) path nodes representing future
states
WL
CL
WL
CL
WL
Query Index
72
XFilter Filtering Method
  • Filtering Example in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
73
XFilter Filtering Method
  • Filtering Example in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
74
XFilter Filtering Method
  • Filtering Example in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Query Index
Document
matching Q1
75
XFilter Filtering Method
  • Filtering Example in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Query Index
Document
matching Q1
76
XFilter Filtering Method
  • Filtering Example in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
CL
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
77
XFilter Filtering Method
  • Filtering Example in XFilter System

Q1/a/b/c Q2/a//c/b Q3/b/a
ltagt ltbgt ltcgt lt/cgt
lt/bgt lt/agt
WL
CL
WL
CL
WL
Document
Query Index
78
Summary
  • Information filtering methods are needed for
    XML-based SDI systems
  • The paper proposed the XFilter filtering system
  • user profiles are constructed with XPath queries
  • Query Index indexing XPath queries
  • FSM-based Filtering method through Query Index
Write a Comment
User Comments (0)
About PowerShow.com