Relax and Adapt: Computing Top-k Matches to XPath Queries - PowerPoint PPT Presentation

About This Presentation

Title:

Relax and Adapt: Computing Top-k Matches to XPath Queries

Description:

Instantiation of Whirlpool for various 'routing strategies' and 'prioritization' alternatives ... of partial matches created by Whirlpool-M as a function of ... – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 22

Provided by: amliem

Learn more at: https://people.cs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Relax and Adapt: Computing Top-k Matches to XPath Queries

1
Relax and Adapt Computing Top-k Matches to XPath
Queries

Amélie Marian (Columbia University)
Joint work with
Sihem Amer-Yahia (ATT Research)
Nick Koudas (University of Toronto)
Divesh Srivastava (ATT Research)

2
Example
book
book
info
author (Dickens)
info
title (Great Expectations)
edition (paperback)
title (Great Expectations)
author (Dickens)

Heterogeneous XML Data about books
Query
book./info/titleGreat Expectations and
./info/authorDickens and ./editionpaperbac
k

Query root node Distinguished node
3
XML Query Relaxation
Query
Amer-Yahia et al. EDBT02

Tree pattern relaxations
Leaf node deletion
Edge generalization
Subtree promotion

book
book
Data
edition?
info
author (Dickens)
info
title (Great Expectations)
edition (paperback)
title (Great Expectations)
author (Dickens)
4
Top-k Queries over XML DataMotivations and
Challenges

Structure heterogeneity
Efficient identification of approximate matches
Top-k
Ranking of approximate matches based on
similarity to query
Early pruning
Query processing cost
Cost increases with number of matches evaluated
Data explosion
Many approximate matches
XML path queries akin to joins
Prioritization to increase pruning

5
Contributions

Whirlpool adaptive architecture and top-k query
processing strategy for XPath queries
Goal early pruning of non-top-k partial matches
Approach partial matches may follow different
plans, and may be at different stages of query
execution
Real prototype implementation of Whirlpool
Instantiation of Whirlpool for various routing
strategies and prioritization alternatives

6
Closely Related Work

Adaptive query processing
Eddies
Dynamic query join plans to adapt to processing
environment
No pruning
Adaptive top-k query processing
Upper
Prioritization of partial matches based on
maximum possible scores
Adaptive routing based on scores
No joins

Avnur and Hellerstein. SIGMOD00
Bruno et al. ICDE01
7
Outline

Whirlpool Architecture
Query Processing
Strategy
Alternatives
Evaluation Settings
Evaluation Results

8
Whirlpool Architecture
book
edition (paperback)
info
Router
author (Dickens)
title (Great Expectations)
book server
edition server
title server
info server
author server
Top-k Set
9
Whirlpool ArchitectureComponents

Top-k Set
Only one match with a given root node
Used for pruning
Complete matches are not processed further,
incomplete matches are sent to the router
Router
Router Queue is based on partial matches maximum
possible final scores
Dynamically choose which server to send partial
match based on routing strategy

10
Whirlpool ArchitectureComponents

Root server
Generates candidate matches
Node servers
Maintain priority queue of partial matches
For each partial match that is processed
Compute a set of extended partial (or complete)
matches
Compute scores of new matches
Checks partial matches against current top-k set

11
Query Processing Alternatives

Prioritization Strategies (at each server)
FIFO
Current Score
Maximum Possible Next Score
Maximum Possible Final Score
Routing Decisions (at the router)
Static
Score-based
Likely to increase score the most
Likely to increase score the least
Size-based
Likely to produce the fewest matches

12
Evaluation Strategies

Lockstep (Static)
Partial matches follow same execution plan
Partial matches have gone through exactly the
same number of operations
Whirlpool Single-threaded (Adaptive)
Partial matches adaptively routed
Process the partial match with the highest
maximum final score (Query processing similar to
Upper)
Only one partial match processed at a time
Whirlpool Multi-threaded (Adaptive)
Prioritization strategy at server decides which
partial match to process next at server
System determines which server to process next

13
Evaluation Metrics

Parameters
Query size
Document size
k
Parallelism
Scoring function (tf.idf proposed in paper)
Measures
Query execution time
Number of server operations
Number of partial matches created

14
Evaluation Setting

C implementation, with POSIX threads
Default machine
Red Hat 7.1 Linux
1.4GHz dual processor
2Gb RAM
XML Documents generated using XMark generating
tool
XPath Queries chosen from XMark to illustrate
different relaxations
XML nodes stored using Dewey encoding

15
Comparison of Adaptive Routing Strategies
Whirlpool-S and Whirlpool-M perform approximately
the same number of server operations
16
Static Routing Strategies vs. Best Adaptive
17
Effect of Parallelism
18
Varying Query Size and k (log scale)
60
48
20
For large queries and high values of k,
Whirlpool-M performs less server operations that
Whirlpool-S (and is faster even on a
one-processor machine)! (27 less server
operations for q3 k75)
19
Varying Query Size and Document Size
Almost twice as fast
20
Scalability
Document Size 1M 10M 50M
Q1 100 93.12 85.66
Q2 100 49.56 67.66
Q3 100 39.59 31.20
Percentage of partial matches created by
Whirlpool-M as a function of the maximum possible
number of partial matches
21
Conclusions