Title: Approximate XML Query Answers
1- Approximate XML Query Answers
Authors N. polyzotis, M. Garofalakis, Y.
Ioannidis
Presenter Hongyu Guo
2Outline of this talk
- Motivation
- TreeSketch Approach
- Experimental Results
- Contributions and Limitations
3Outline
- Motivation
- TreeSketch Approach
- Experimental Results
- Contributions and Limitations
4Motivations
--Need fast feedback
- XML de-facto standard for data exchange
- Need to explore large XML data sets and get fast
feedback from complex XML queries - Conflict between fast on-line response and
query execution cost
5XML Query Challenges
- Involve complex traversals of the XML data
hierarchy - Complex queries over massive tree-structured
data--very expensive - Approaches Optimize the query or optimize the
data structure - No need for accurate results, we can instead
return approximate query answers
6Approximate Query answers
R
Query
.
R
- Obtain an approximation to the true result
- Currently employed in relational systems
successfully - Use approximate result to get timely feedback
7Outline
--A technique being used to return fast,
approximate results
- Motivation
- TreeSketch Approach
- Experimental Results
- Contributions and Limitations
8Data and Query Model
--Some background, XML document
a author n name b book p paper y year
k keyword t title
9Data and Query Process
--Twig Query, Query Tree, and Nested Result Tree
10Basic Query Scenario
Approximate Nesting Tree
True Nesting Tree
- Key idea is to return fast, accurate feedback
11Approximate Query Answers
-- Two key problems
- How to construct concise XML synopses, which
capture the statistical traits of the true data - How to produce approximate query answers over the
synopsis efficiently
12TreeSketch Construction
--Construction Algorithm
- Step 1
- Given an XML tree T, build a graph synopsis each
node represents a set of same tag elements, large
tree - Step2
- Compress synopsis by merging nodes with similar
sub-structures (i.e. clustering of the XML
elements) - Step 3
- Repeat Step 2 until the predefined space budget
constraint is met - Step 4
- Return the TreeSketch Synopsis
Space Budget
Perfect
13More Discussions
--of the construction procedure
- Graph synopsis construction
- Use node to represent a set of same tag elements
- Query can be retrieved with zero-error
- The size can become very large-it can easily be
in the order of the original document size - TreeSketch synopsis construction
- Compress the synopsis by merging nodes
- Bottom-up merging clustering algorithm
- Key technique to compress ? Clustering
- Based on structure
- Model accuracy depends on quality of clustering
- Tight clusters ? Accurate synopsis, but large
model - Loose clusters ? Less accuracy, but small model
14Construction Example
--Count same tag elements
XML Document
(Graph Synopsis)
- Synopsis node ? Set of elements of the same tag
- Synopsis edge ? Document edge(s)
15Construction Example
--Calculate number of children per element
- Calculate the number of children for each edge
- Count r, p mean children in p per element in r
16Merging Nodes
--Less space budget
TreeSkech synopsis
R(1)
1
P(1)
More Concise Synopsis
2
S(2)
2
F(4)
1
0.5
C(4)
E(2)
17Compute Approximate Answers
--more like the traditional way
- Travel down the tree
- Match a pattern in the structure and return a
sub-tree - TreeSketch Fast response
- Concise synopsis
- Keep statistical information
- Node number of same tag elements
- Edge number of children per element
18Compute Approximate Answers
--Example
TreeSketch
Query
Approximate Nesting Tree
R
q0
//section
q1
.//caption
.//equation
q2
q3
Approximate results with structure 1) Take
advantage of the concise structure 2) and the
statistical data
19Outline
- Motivation
- TreeSketch Approach
- Experimental Results
- Contributions and Limitations
20Experimental Setup
- Focus on
- the quality of the approximate answers generated
- the efficiency of the construction process
- Data Set
- Data Sets XMark, DBLP, IMDB, SwissProt
- Workload 1000 random twig queries
21Evaluation Methods
- Error ? Distance between R and R
- Popular metric Tree-edit distance
- Min-cost sequence of operations that transform R
to R - Argument not capture the structure similarity
- New Evaluation metrics ESD (Element Simulation
Distance) - Calculate the number of children for each edge in
the tree to capture the complete structure of the
tree - model how well the structure of two trees match
from each other - degree of simulation between two trees
- Average ESD for evaluation
22Experimental Results
--Approximate answers, compared with TwigXsketches
23Experimental Results
--Relative Errors
lt 5 i.e. 95 accuracy
24Outline
-Strengths and Weaknesses
- Motivation
- TreeSketch Approach
- Experimental Results
- Contributions and Limitations
25TreeSketch Approach
-In this paper
- Propose an effective XML-summarization mechanism
- Captures the complete tree structure of large XML
data - Experimental results produce fast and accurate
approximate query answers - Author claim The first work to address the
timely problem of producing approximate
tree-structured answers for complex XML queries - Comparison with the related work 2 options
- Either compute the exact answer to a path query
expensive - Or use an approach such as twig-XSketch, which
does not capture the complete tree structure of
the underlying XML database
26Limitations
-Nice research, Next steps for further
investigation
- Difficult to optimize some pre-defined
parameters, such as the space budget - which directly related to the accuracy of the
approximate query answers - too large ? affect the efficiency, too small ?
quality of the answers depends on the query,
data set, and the computing resources - Expecting incremental model construction process
- XML data always increase incrementally, we need
to construct the synopsis model incrementally - More experiments or some real applications are
needed to justify the scalability of this
technique
27Thank You / Merci