Title: Relax and Adapt: Computing Top-k Matches to XPath Queries
1Relax and Adapt Computing Top-k Matches to XPath
Queries
- Amélie Marian (Columbia University)
- Joint work with
- Sihem Amer-Yahia (ATT Research)
- Nick Koudas (University of Toronto)
- Divesh Srivastava (ATT Research)
2Example
book
book
info
author (Dickens)
info
title (Great Expectations)
edition (paperback)
title (Great Expectations)
author (Dickens)
- Heterogeneous XML Data about books
- Query
- book./info/titleGreat Expectations and
- ./info/authorDickens and ./editionpaperbac
k
Query root node Distinguished node
3XML Query Relaxation
Query
Amer-Yahia et al. EDBT02
- Tree pattern relaxations
- Leaf node deletion
- Edge generalization
- Subtree promotion
book
book
Data
edition?
info
author (Dickens)
info
title (Great Expectations)
edition (paperback)
title (Great Expectations)
author (Dickens)
4Top-k Queries over XML DataMotivations and
Challenges
- Structure heterogeneity
- Efficient identification of approximate matches
- Top-k
- Ranking of approximate matches based on
similarity to query - Early pruning
- Query processing cost
- Cost increases with number of matches evaluated
- Data explosion
- Many approximate matches
- XML path queries akin to joins
- Prioritization to increase pruning
5Contributions
- Whirlpool adaptive architecture and top-k query
processing strategy for XPath queries - Goal early pruning of non-top-k partial matches
- Approach partial matches may follow different
plans, and may be at different stages of query
execution - Real prototype implementation of Whirlpool
- Instantiation of Whirlpool for various routing
strategies and prioritization alternatives
6Closely Related Work
- Adaptive query processing
- Eddies
- Dynamic query join plans to adapt to processing
environment - No pruning
- Adaptive top-k query processing
- Upper
- Prioritization of partial matches based on
maximum possible scores - Adaptive routing based on scores
- No joins
Avnur and Hellerstein. SIGMOD00
Bruno et al. ICDE01
7Outline
- Whirlpool Architecture
- Query Processing
- Strategy
- Alternatives
- Evaluation Settings
- Evaluation Results
8Whirlpool Architecture
book
edition (paperback)
info
Router
author (Dickens)
title (Great Expectations)
book server
edition server
title server
info server
author server
Top-k Set
9Whirlpool ArchitectureComponents
- Top-k Set
- Only one match with a given root node
- Used for pruning
- Complete matches are not processed further,
incomplete matches are sent to the router - Router
- Router Queue is based on partial matches maximum
possible final scores - Dynamically choose which server to send partial
match based on routing strategy
10Whirlpool ArchitectureComponents
- Root server
- Generates candidate matches
- Node servers
- Maintain priority queue of partial matches
- For each partial match that is processed
- Compute a set of extended partial (or complete)
matches - Compute scores of new matches
- Checks partial matches against current top-k set
11Query Processing Alternatives
- Prioritization Strategies (at each server)
- FIFO
- Current Score
- Maximum Possible Next Score
- Maximum Possible Final Score
- Routing Decisions (at the router)
- Static
- Score-based
- Likely to increase score the most
- Likely to increase score the least
- Size-based
- Likely to produce the fewest matches
12Evaluation Strategies
- Lockstep (Static)
- Partial matches follow same execution plan
- Partial matches have gone through exactly the
same number of operations - Whirlpool Single-threaded (Adaptive)
- Partial matches adaptively routed
- Process the partial match with the highest
maximum final score (Query processing similar to
Upper) - Only one partial match processed at a time
- Whirlpool Multi-threaded (Adaptive)
- Prioritization strategy at server decides which
partial match to process next at server - System determines which server to process next
13Evaluation Metrics
- Parameters
- Query size
- Document size
- k
- Parallelism
- Scoring function (tf.idf proposed in paper)
- Measures
- Query execution time
- Number of server operations
- Number of partial matches created
14Evaluation Setting
- C implementation, with POSIX threads
- Default machine
- Red Hat 7.1 Linux
- 1.4GHz dual processor
- 2Gb RAM
- XML Documents generated using XMark generating
tool - XPath Queries chosen from XMark to illustrate
different relaxations - XML nodes stored using Dewey encoding
15Comparison of Adaptive Routing Strategies
Whirlpool-S and Whirlpool-M perform approximately
the same number of server operations
16Static Routing Strategies vs. Best Adaptive
17Effect of Parallelism
18Varying Query Size and k (log scale)
60
48
20
For large queries and high values of k,
Whirlpool-M performs less server operations that
Whirlpool-S (and is faster even on a
one-processor machine)! (27 less server
operations for q3 k75)
19Varying Query Size and Document Size
Almost twice as fast
20Scalability
Document Size 1M 10M 50M
Q1 100 93.12 85.66
Q2 100 49.56 67.66
Q3 100 39.59 31.20
Percentage of partial matches created by
Whirlpool-M as a function of the maximum possible
number of partial matches
21Conclusions
- Efficient adaptive top-k query processing
strategy - Minimize number of partial matches evaluated
- Benefit from parallelism with little threading
overhead - Adapt to different environments
- Score distribution
- Selectivity distribution
- Extensive experimental evaluation
- Good scalability