Title: An Efficient and Versatile Query Engine for TopX Search
1An Efficient and Versatile Query Engine for
TopX Search
- Martin Theobald
- Ralf Schenkel
- Gerhard Weikum
- Max-Planck Institute for Informatics
- Saarbrücken
- Germany
VLDB 05
2An XML-IR Scenario
//article.//secabout(.//, XML retrieval)
and .//parabout(.//, native XML
database) //bibabout(.//item, W3C)
3TopX Efficient XML-IR
Goal Efficiently retrieve the best results of a
similarity query
- Extend top-k query processing algorithms for
sorted lists Buckley 85 Güntzer, Balke
Kießling 00 Fagin 01 to XML data - Combined inverted index for content structure
- Avoid full index scans, postpone expensive random
I/Os to large disk-resident data structures - Exploit cheap disk space for redundant indexing
- Probabilistic candidate pruning for approximate
top-k
4XML-IR History and Related Work
IR on structured data (SGML)
Web query languages
1995
W3QS (Technion Haifa)
OED etc. (U Waterloo)
Araneus (U Roma)
HySpirit (U Dortmund)
Lorel (Stanford U)
HyperStorM (GMD Darmstadt)
WebSQL (U Toronto)
WHIRL (CMU)
IR on XML
XML query languages
XIRQL (U Dortmund / Essen)
XML-QL (ATT Labs)
XXL TopX (U Saarland / MPI)
2000
XPath 1.0 (W3C)
ApproXQL (U Berlin / U Munich)
ELIXIR (U Dublin)
NEXI (INEX benchmark) XPath XQuery Full-Text (
W3C)
PowerDB-IR (ETH Zurich)
JuruXML (IBM Haifa )
XSearch (Hebrew U)
XPath 2.0 (W3C)
Timber (U Michigan)
XRank Quark (Cornell U)
XQuery (W3C)
FleXPath (ATT Labs)
XKeyword (UCSD)
TeXQuery (ATT Labs)
Commercial software MarkLogic, Verity?,
IBM?, Oracle?, ...
2005
5Outline
- Data Scoring model
- Database schema indexing
- Top-k query processing for XML
- Scheduling probabilistic candidate pruning
- Experiments Conclusions
6Computational Model
- Precomputed content scores score(ti,e)? ?
- E.g., term/element frequencies, probabilistic
models (Okapi BM25), etc. - Typically normalized to score(ti,e)? 0,1
- Monotonous score aggregation
- aggr (D1Dm ) ? (D1Dm ) ? ?
- E.g., sum, max, product (using log), cosine
(using L2 norm) - Structural query conditions
- Complex query DAGs
- Aggregate constant score c for each matched
structural condition (edges) - Similarity queries (aka. andish)
- Non-conjunctive query evaluations
- Weak content matches can be compensated
- Vague structural matches
- Access model
- Disk-resident inverted index
- ? Inexpensive sequential accesses (SA) to
inverted lists getNextItem() - ? More expensive random accesses (RA)
getItemBy(Id)
7Data Model
ftf(xml, article1 ) 3
ltarticlegt lttitlegtXML-IRlt/titlegt ltabsgt
IR techniques for XMLlt/absgt ltsecgt
lttitlegt Clustering on XML lt/titlegt
ltpargtEvaluationlt/pargt lt/secgt lt/articlegt
- Simplified XML model
- disregarding IDRef/XLink/XPointer
- Redundant full-contents
- Per-element term frequencies ftf(ti,e) for full
contents - Inverted lists for tag-term pairs
- Pre/postorder labels for structural joins
8Full-Content Scoring Model
element statistics
tag N avglength k1 b
article 16,850 2,903 10.5 0.75
sec 96,709 413 10.5 0.75
par 1,024,907 32 10.5 0.75
fig 109,230 13 10.5 0.75
- Basic scoring idea within IR-style family of
TFIDF ranking functions
- Full-content scores cast into an Okapi-BM25
probabilistic model with - element-specific model parameterization
- Additional static score mass c for relaxable
structural conditions
9Outline
- Data Scoring model
- Database schema indexing
- Top-k query processing for XML
- Scheduling probabilistic candidate pruning
- Experiments Conclusions
10Inverted Block-Index for Content Structure
secclustering
titlexml
parevaluation
- Inverted index over tag-term pairs
(full-contents) - Benefits from increased selectivity of combined
tag-term pairs - Accelerates child-or-descendant axis, e.g.,
sec//clustering
secclustering
titlexml
parevaluation
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
eid docid score pre post max- score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
eid docid score pre post max- score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4
- Sequential block-scans
- Re-order elements in descending order of
(maxscore, docid, score) per list - Fetch all tag-term pairs per doc in one
sequential block-access - docid limits range of in-memory structural joins
- Stored as inverted files or database tables
(using B-trees)
11Navigational Index
sec
titlexml
parevaluation
sec
title
par
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
eid docid pre post
216 17 2 15
72 3 10 8
51 2 4 12
671 31 12 23
eid docid pre post
3 1 1 21
28 2 8 14
182 5 3 7
96 4 6 4
- Additional navigational index
- Non-redundant element directory
- Support for element paths and branching path
queries - Random accesses using (docid, tag) as key
- Schema oblivious indexing querying
12Outline
- Data Scoring model
- Database schema indexing
- Top-k query processing for XML
- Scheduling probabilistic candidate pruning
- Experiments Conclusions
13TopX Query Processing
Fagin et al., PODS 01 Güntzer et al., VLDB
00 BuckleyLewit, SigIR 85
- Adapt Threshold Algorithm (TA) paradigm
- Focus on inexpensive sequential/sorted accesses
- Postpone expensive random accesses
- Candidate d connected sub-pattern with element
ids and scores - Incrementally evaluate path constraints (using
pre/postorder labels) - In-memory structural joins (nested loops,
staircase, or holistic twig joins) - Upper/lower score guarantees per candidate
- Remember set of evaluated dimensions E(d)
- worstscore(d) ?i?E(d) score(ti,e)
- bestscore(d) worstscore(d) ?i?E(d) highi
- Early threshold termination
- Candidate queuing
- Stop, if maxbestscore(d) d in candidates
minworstscore(d) d in top-k - Extensions
- Queue management
- Cost model for random access scheduling
- Probabilistic candidate pruning for approximate
top-k results - Theobald, Schenkel Weikum, VLDB 04
14TopX Query Processing By Example
Top-2 results
secclustering
titlexml
parevaluation
min-20.0
min-20.5
min-20.9
min-21.6
secclustering
titlexml
parevaluation
1.0
1.0
1.0
1.0
0.9
0.9
eid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
eid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
0.8
0.8
0.85
0.5
0.75
0.1
doc2
doc17
doc1
doc5
Candidate queue
doc3
Pseudo- Element
15Incremental Path Validations
//article//sec//par//xml java
//bib//item//title//security
Query
child-or-descendant
- Complex query DAGs
- Transitive closure of structural constraints
- Aggregate additional static score mass c for a
structural condition i, if all edges rooted at i
are satisfiable - Incrementally test structural constraints
- Quickly decrease best scores for early pruning
- Schedule random accesses in ascending order of
structural selectivities
Promising candidate
RA
bib
0.0
RA
item
0.0
0.7
worst(d) 1.5 best(d) 5.5
worst(d) 1.5 best(d) 4.5
worst(d) 1.5 best(d) 6.5
0.8
min-k4.8
16Outline
- Data Scoring model
- Database schema indexing
- Top-k query processing for XML
- Scheduling probabilistic candidate pruning
- Experiments Conclusions
17Random Access Scheduling - Minimal Probes
- MinProbe-Scheduling
- Structural conditions as soft filters
- (Expensive Predicates Minimal Probes
Chang Hwang, SIGMOD 02) - Schedule random accesses only for the most
promising candidates - Schedule batch of RAs on d, if
- worstscore(d) od c gt min-k
evaluated content structure- related score
not yet evaluated structural score mass
(constant!)
18Random Access Scheduling - Cost Model
- BenProbe-Scheduling
- Analytic cost model
- Basic idea
- Compare expected access costs to the costs of an
optimal schedule - Access costs on d are wasted, if d does not make
it into the final top-k (considering both content
structure) - Compare different Expected Wasted Costs (EWC)
- EWC-RAs(d) of looking up d in the structure
- EWC-RAc(d) of looking up d in the content
- EWC-SA(d) of not seeing d in the next batch of b
sorted accesses - Schedule batch of RAs on d, if
- EWC-RA(d) RA lt EWC-SA SA
EWC-SA
19Structural Selectivity Estimator
//sec//figurejava //parxml
//bibvldb
- Split the query into a set of characteristic
patterns, e.g., twigs, descendants tag-term
pairs - Consider structural selectivities
- Pd satisfies all structural conditions Y
- Pd satisfies a subset Y of structural
conditions Y -
- Consider binary correlations between structural
patterns and/or tag-term pairs - Pd satisfies all structural conditions Y
-
- With cov( Xi, Xj ) estimated by
- data samples, query logs, etc.
sec
EWC-RAs(d)
bib vldb
p1 0.682 p2 0.001 p3 0.002 p4 0.688 p5
0.968 p6 0.002 p7 0.986 p8 0.067 p9
0.011 p10 0.023
//sec//figure//par //sec//figure//bib //sec/
/par//bib //sec//figure //sec//par //sec//bib //s
ec //parxml //figurejava //bibvldb
20Full-content Score Predictor
- For each inverted list Li (i.e., tag-term pairs)
- Approximate each local score distribution Si
using an equi-width histogram - with n buckets
Probabilistic candidate pruning Drop d from
the candidate queue, if Pd gets in the
final top-k lt e
- For all d in the candidate queue
- Consider the convolution of score distributions
Si with i? E(d) - Consider Pd gets in the final top-k
Pworstscore(d) gt min-k
EWC-RAc(d)
titlexml
eid docid score pre post max- score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
parevaluation
eid docid score pre post max- score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
21Outline
- Data Scoring model
- Database schema indexing
- Top-k query processing for XML
- Scheduling probabilistic candidate pruning
- Experiments Conclusions
22Experiments Data Collections Competitors
- INEX 04 benchmark setting
- 12,223 docs 12M elemts 119M index entries
534MB - 46 queries, e.g., //article.//bibQBIC and
.//pimage retrieval - IMDB (Internet Movie Database)
- 386,529 docs 34M elemts 130M index entries
1,117 MB - 20 queries, e.g., //movie.//casting.//actorJoh
n Wayne //roleSheriff//.//year1959 and
.//genreWestern - Competitors
- DBMS-style JoinSort
- Using the TopX schema
- StructIndex Kaushik et al, Sigmod 04
- Top-k with separate inverted indexes for content
structure - DataGuide-like structural index
- Full evaluations ? no uncertainty about final
document scores - No candidate queuing, eager random accesses
- StructIndex
- Extent chaining technique for DataGuide-based
extent identifiers - (skip scans)
23INEX Results
24IMDB Results
P_at_k
MAP_at_k
SA
epsilon
RA
relPrec
k
CPU
37.7
0
14,510077
n/a
10
JoinSort
0.16
291,655
346,697
n/a
10
StructIndex
n/a
1.00
0.17
301,647
22,445
n/a
10
StructIndex
0.08
72,196
317,380
0.0
10
TopX MinProbe
0.06
50,016
241,471
0.0
10
TopX BenProbe
25INEX with Probabilistic Pruning
26Conclusions Ongoing Work
- Efficient and versatile TopX query processor
- Extensible framework for text, semi-structured
structured data - Probabilistic cost model for random access
scheduling - Very good precision/runtime ratio for
probabilistic candidate pruning - Scalability
- Optimized for runtime, exploits cheap disk space
- (factor 4-5 for INEX)
- Experiments on TREC Terabyte text collection (see
paper) - Support for typical IR extensions
- Phrase matching, mandatory terms , negation
- - Query weights (e.g., relevance feedback,
ontological similarities) - Dynamic and self-tuning query expansions SigIR
05 - Incrementally merges inverted lists on demand
- Dynamically opens scans on additional expansion
terms - Vague Content Structure (VCAS) queries
27Thank you!