Title: Approximate XML Query Answers
1Approximate XML Query Answers
- Alkis Polyzotis (UC Santa Cruz)
- Minos Garofalakis (Bell Labs)
- Yannis Ioannidis (U. of Athens, Hellas)
2Motivation
- XML de-facto standard for data exchange
- Development of the XML Warehouse
- Conflict between on-line and query execution
cost - Increased query response times
- Users might wait for un-interesting results
Q
Warehouse
R
3Approximate Query Answers
- Evaluate query over a concise data synopsis and
obtain an approximation R of the true result - Use approximate result as timely feedback
- User can assess the value of the query
- Goal reduce number of evaluated queries
R
Q
Warehouse
R
4Contributions
- TreeSketch Synopses
- Structural summaries for XML data
- Approximate answers for complex twig queries
- Summarization model ? Structural clustering of
elements - Efficient processing and construction
- Element Simulation Distance
- Novel distance metric for XML data
- Captures approximate similarity between two XML
trees - Experimental Results
- Accurate approximate answers for low space
budgets - Low-error selectivity estimates
- Efficient construction algorithm
5Outline
- Preliminaries
- TreeSketches
- Synopsis model
- Computing approximate answers
- Summary construction
- Element Simulation Distance
- Experimental Study
- Conclusions
6Data and Query Model
XML Document
7Problem Definition
Approximate Nesting Tree
True Nesting Tree
- Process twig query over a synopsis
- Compute approximation of nesting tree
8TreeSketch Model
9Graph Synopsis
XML Document
Graph Synopsis
- Synopsis node ? Set of elements of the same tag
- Synopsis edge ? Document edge(s)
10TreeSketch Synopsis
XML Document
TreeSketch
- Augment graph-synopsis with edge counts
- countu,v mean children in v per element in u
11TreeSketch Synopsis
XML Document
TreeSketch
- Is there a lossless synopsis?
- What is the quality of a lossy synopsis?
12Count Stability
XML Document
TreeSketch
1
2
1
1
1
1
1
- (u,v) count-stable all elements in u have the
same child-count in v
13Count-Stable TreeSketch
XML Document
TreeSketch
R(1)
1
P(1)
1
1
S(1)
S(1)
2
2
F(2)
F(2)
1
1
1
C(4)
E(2)
- A count-stable synopsis can recover the input
tree - Efficient one-pass construction
- Stable summary can be too large for practical use!
14Lossy TreeSketch
XML Document
TreeSketch
15TreeSketches and Clustering
- TreeSketch ? Element clustering
- All elements in a node are mapped to a centroid
- Tight clusters ? Accurate synopsis
- Synopsis quality ? Clustering error
- Options Manhattan Distance, Squared Error,
- Quality can be measured independent of a workload
- Key for effective construction
16Computing Approximate Answers
TreeSketch
Query
Approximate Nesting Tree
R
q0
//section
q1
.//caption
.//equation
q2
q3
- Compute TreeSketch of approximate answer
- Accuracy depends on quality of clustering
17TreeSketch Construction
- Given an XML tree T, build a TreeSketch of size B
- Difficult clustering problem
- Space dimensionality depends on the clustering
itself - Construction based on bottom-up clustering
- Compress perfect synopsis by merging clusters
- Best merge determined by marginal gains
- Heuristic to reduce number of candidate merges
Space Budget
Perfect
18Element Simulation Distance
19Error of Approximation
- Error ? Distance between R and R
- Popular metric Tree-edit distance
- Min-cost sequence of operations that transform R
to R - Measures syntactic differences between R and R
- Not intuitive for approximate answers!
Different counts Similar Trait
Same counts Opposite Trait
T1
T
T2
20Element Simulation Distance
- Capture approximate similarity between R and R
- u simulates v u and v have identical structure
- ESD(u,v) degree of simulation between u,v
- How well the structure of u matches the structure
of v - Modeled as the distance between multi-sets
- Efficient computation using perfect summaries
21Experimental Results
22Methodology
- Data Sets XMark, DBLP, IMDB, SwissProt
- Workload 1000 random twig queries
- Evaluation metrics
- Average ESD for approximate answers
- Mean absolute relative error for selectivity
estimation
23Approximate Answers - IMDB
IMDB (102K Elements) Avg. Result Size 3,477
tuples
24Selectivity Estimation - SwissProt
SwissProt (182K Elements) Avg. Result Size
104,592 tuples
25Selectivity Estimation - ALL
26Conclusions
- Approximate query answering for XML databases
- TreeSketch Synopses
- Structural summaries for tree-structured XML
- Approximate answers for twig-queries
- Model Graph Synopsis Edge-counts
- Efficient processing and construction
- Element Simulation Distance
- Capture approximate similarity between XML trees
- Experimental Results
- High accuracy for low space budgets
- Efficient construction
27Questions?
28TreeSketch Model (2/2)
XML Document
TreeSketch
r
R
1
p1
P(1)
2
S(2)
s2
s3
1
1
F(2)
F(2)
f7
f9
f9
f5
1
1
1
C(4)
E(2)
c14
c17
e11
c12
e13
c17
- Average number of children Edge count
29XML
XML Document
r
p1
p paper s section c caption t title f
figure e equation
s2
s3
f7
f9
f9
f5
c14
c17
e11
c12
e13
c17
30TreeSketch Synopsis
XML Document
TreeSketch
R(1)
1
P(1)
2
S(2)
F
2
F(4)
1
0.5
C(4)
E(2)
- Augment graph-synopsis with edge counts
- countu,v mean children in v per element in u
31Depth-Guided Merging
- Key observation Two elements have similar
structure, if their children have similar
structure - Bottom-up merging, based on depth
- Depth distance from the leaves of the tree
- Build a pool of candidate merges by increasing
depth - Replenish the pool when it falls below a given
threshold - Reduced construction time - Accurate synopses
32Depth-Guided Merging
- Observation Two elements have similar structure,
if their children have similar structure - Heuristic If a merge of two clusters is good,
then merges of the child clusters are likely to
have been good as well - Bottom-up merging strategy
- Savings in construction time - Accurate synopses