An Efficient and Versatile Query Engine for TopX Search presentation

About This Presentation

Title:

An Efficient and Versatile Query Engine for TopX Search

Description:

An Efficient and Versatile Query Engine for TopX Search. 2 ... In-memory structural joins (nested loops, staircase, or holistic twig joins) ... –

Number of Views:31

Avg rating:3.0/5.0

Slides: 26

Provided by: MartinT59

Category:

more less

Transcript and Presenter's Notes

Title: An Efficient and Versatile Query Engine for TopX Search

1
An Efficient and Versatile Query Engine for
TopX Search

Martin Theobald
Ralf Schenkel
Gerhard Weikum
Max-Planck Institute for Informatics
Saarbrücken
Germany

VLDB 05
2
An XML-IR Scenario
//article.//secabout(.//, XML retrieval)
and .//parabout(.//, native XML
database) //bibabout(.//item, W3C)
3
TopX Efficient XML-IR
Goal Efficiently retrieve the best results of a
similarity query

Extend top-k query processing algorithms for
sorted lists Buckley 85 Güntzer, Balke
Kießling 00 Fagin 01 to XML data
Combined inverted index for content structure
Avoid full index scans, postpone expensive random
I/Os to large disk-resident data structures
Exploit cheap disk space for redundant indexing
Probabilistic candidate pruning for approximate
top-k

4
XML-IR History and Related Work
IR on structured data (SGML)
Web query languages
1995
W3QS (Technion Haifa)
OED etc. (U Waterloo)
Araneus (U Roma)
HySpirit (U Dortmund)
Lorel (Stanford U)
HyperStorM (GMD Darmstadt)
WebSQL (U Toronto)
WHIRL (CMU)
IR on XML
XML query languages
XIRQL (U Dortmund / Essen)
XML-QL (ATT Labs)
XXL TopX (U Saarland / MPI)
2000
XPath 1.0 (W3C)
ApproXQL (U Berlin / U Munich)
ELIXIR (U Dublin)
NEXI (INEX benchmark) XPath XQuery Full-Text (
W3C)
PowerDB-IR (ETH Zurich)
JuruXML (IBM Haifa )
XSearch (Hebrew U)
XPath 2.0 (W3C)
Timber (U Michigan)
XRank Quark (Cornell U)
XQuery (W3C)
FleXPath (ATT Labs)
XKeyword (UCSD)
TeXQuery (ATT Labs)
Commercial software MarkLogic, Verity?,
IBM?, Oracle?, ...
2005
5
Outline

Data Scoring model
Database schema indexing
Top-k query processing for XML
Scheduling probabilistic candidate pruning
Experiments Conclusions

6
Computational Model

Precomputed content scores score(ti,e)? ?
E.g., term/element frequencies, probabilistic
models (Okapi BM25), etc.
Typically normalized to score(ti,e)? 0,1
Monotonous score aggregation
aggr (D1Dm ) ? (D1Dm ) ? ?
E.g., sum, max, product (using log), cosine
(using L2 norm)
Structural query conditions
Complex query DAGs
Aggregate constant score c for each matched
structural condition (edges)
Similarity queries (aka. andish)
Non-conjunctive query evaluations
Weak content matches can be compensated
Vague structural matches
Access model
Disk-resident inverted index
? Inexpensive sequential accesses (SA) to
inverted lists getNextItem()
? More expensive random accesses (RA)
getItemBy(Id)

7
Data Model
ftf(xml, article1 ) 3
ltarticlegt lttitlegtXML-IRlt/titlegt ltabsgt
IR techniques for XMLlt/absgt ltsecgt
lttitlegt Clustering on XML lt/titlegt
ltpargtEvaluationlt/pargt lt/secgt lt/articlegt

Simplified XML model
disregarding IDRef/XLink/XPointer
Redundant full-contents
Per-element term frequencies ftf(ti,e) for full
contents
Inverted lists for tag-term pairs
Pre/postorder labels for structural joins

8
Full-Content Scoring Model
element statistics
tag N avglength k1 b
article 16,850 2,903 10.5 0.75
sec 96,709 413 10.5 0.75
par 1,024,907 32 10.5 0.75
fig 109,230 13 10.5 0.75

Basic scoring idea within IR-style family of
TFIDF ranking functions

Full-content scores cast into an Okapi-BM25
probabilistic model with
element-specific model parameterization

Additional static score mass c for relaxable
structural conditions

9
Outline

Data Scoring model
Database schema indexing
Top-k query processing for XML
Scheduling probabilistic candidate pruning
Experiments Conclusions

10
Inverted Block-Index for Content Structure
secclustering
titlexml
parevaluation

Inverted index over tag-term pairs
(full-contents)
Benefits from increased selectivity of combined
tag-term pairs
Accelerates child-or-descendant axis, e.g.,
sec//clustering

secclustering
titlexml
parevaluation
eid docid score pre post max-score
46 2 0.9 2 15 0.9
9 2 0.5 10 8 0.9
171 5 0.85 1 20 0.85
84 3 0.1 1 12 0.1
eid docid score pre post max- score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
96 4 0.75 6 4 0.75
eid docid score pre post max- score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
671 31 0.4 12 23 0.4

Sequential block-scans
Re-order elements in descending order of
(maxscore, docid, score) per list
Fetch all tag-term pairs per doc in one
sequential block-access
docid limits range of in-memory structural joins
Stored as inverted files or database tables
(using B-trees)

11
Navigational Index
sec
titlexml
parevaluation
sec
title
par
eid docid pre post
46 2 2 15
9 2 10 8
171 5 1 20
84 3 1 12
eid docid pre post
216 17 2 15
72 3 10 8
51 2 4 12
671 31 12 23
eid docid pre post
3 1 1 21
28 2 8 14
182 5 3 7
96 4 6 4

Additional navigational index
Non-redundant element directory
Support for element paths and branching path
queries
Random accesses using (docid, tag) as key
Schema oblivious indexing querying

12
Outline

Data Scoring model
Database schema indexing
Top-k query processing for XML
Scheduling probabilistic candidate pruning
Experiments Conclusions

13
TopX Query Processing
Fagin et al., PODS 01 Güntzer et al., VLDB
00 BuckleyLewit, SigIR 85

Adapt Threshold Algorithm (TA) paradigm
Focus on inexpensive sequential/sorted accesses
Postpone expensive random accesses
Candidate d connected sub-pattern with element
ids and scores
Incrementally evaluate path constraints (using
pre/postorder labels)
In-memory structural joins (nested loops,
staircase, or holistic twig joins)
Upper/lower score guarantees per candidate
Remember set of evaluated dimensions E(d)
worstscore(d) ?i?E(d) score(ti,e)
bestscore(d) worstscore(d) ?i?E(d) highi
Early threshold termination
Candidate queuing
Stop, if maxbestscore(d) d in candidates
minworstscore(d) d in top-k
Extensions
Queue management
Cost model for random access scheduling
Probabilistic candidate pruning for approximate
top-k results
Theobald, Schenkel Weikum, VLDB 04

14
TopX Query Processing By Example
Top-2 results
secclustering
titlexml
parevaluation
min-20.0
min-20.5
min-20.9
min-21.6
secclustering
titlexml
parevaluation
1.0
1.0
1.0
1.0
0.9
0.9
eid docid score pre post
216 17 0.9 2 15
72 3 0.8 10 8
51 2 0.5 4 12
671 31 0.4 12 23
eid docid score pre post
3 1 1.0 1 21
28 2 0.8 8 14
182 5 0.75 3 7
96 4 0.75 6 4
eid docid score pre post
46 2 0.9 2 15
9 2 0.5 10 8
171 5 0.85 1 20
84 3 0.1 1 12
0.8
0.8
0.85
0.5
0.75
0.1
doc2
doc17
doc1
doc5
Candidate queue
doc3
Pseudo- Element
15
Incremental Path Validations
//article//sec//par//xml java
//bib//item//title//security
Query
child-or-descendant

Complex query DAGs
Transitive closure of structural constraints
Aggregate additional static score mass c for a
structural condition i, if all edges rooted at i
are satisfiable
Incrementally test structural constraints
Quickly decrease best scores for early pruning
Schedule random accesses in ascending order of
structural selectivities

Promising candidate
RA
bib
0.0
RA
item
0.0
0.7
worst(d) 1.5 best(d) 5.5
worst(d) 1.5 best(d) 4.5
worst(d) 1.5 best(d) 6.5
0.8
min-k4.8
16
Outline

Data Scoring model
Database schema indexing
Top-k query processing for XML
Scheduling probabilistic candidate pruning
Experiments Conclusions

17
Random Access Scheduling - Minimal Probes

MinProbe-Scheduling
Structural conditions as soft filters
(Expensive Predicates Minimal Probes
Chang Hwang, SIGMOD 02)
Schedule random accesses only for the most
promising candidates
Schedule batch of RAs on d, if
worstscore(d) od c gt min-k

evaluated content structure- related score
not yet evaluated structural score mass
(constant!)
18
Random Access Scheduling - Cost Model

BenProbe-Scheduling
Analytic cost model
Basic idea
Compare expected access costs to the costs of an
optimal schedule
Access costs on d are wasted, if d does not make
it into the final top-k (considering both content
structure)
Compare different Expected Wasted Costs (EWC)
EWC-RAs(d) of looking up d in the structure
EWC-RAc(d) of looking up d in the content
EWC-SA(d) of not seeing d in the next batch of b
sorted accesses
Schedule batch of RAs on d, if
EWC-RA(d) RA lt EWC-SA SA

EWC-SA
19
Structural Selectivity Estimator
//sec//figurejava //parxml
//bibvldb

Split the query into a set of characteristic
patterns, e.g., twigs, descendants tag-term
pairs
Consider structural selectivities
Pd satisfies all structural conditions Y
Pd satisfies a subset Y of structural
conditions Y
Consider binary correlations between structural
patterns and/or tag-term pairs
Pd satisfies all structural conditions Y
With cov( Xi, Xj ) estimated by
data samples, query logs, etc.

sec
EWC-RAs(d)
bib vldb
p1 0.682 p2 0.001 p3 0.002 p4 0.688 p5
0.968 p6 0.002 p7 0.986 p8 0.067 p9
0.011 p10 0.023
//sec//figure//par //sec//figure//bib //sec/
/par//bib //sec//figure //sec//par //sec//bib //s
ec //parxml //figurejava //bibvldb
20
Full-content Score Predictor

For each inverted list Li (i.e., tag-term pairs)
Approximate each local score distribution Si
using an equi-width histogram
with n buckets

Probabilistic candidate pruning Drop d from
the candidate queue, if Pd gets in the
final top-k lt e

For all d in the candidate queue
Consider the convolution of score distributions
Si with i? E(d)
Consider Pd gets in the final top-k
Pworstscore(d) gt min-k

EWC-RAc(d)
titlexml
eid docid score pre post max- score
216 17 0.9 2 15 0.9
72 3 0.8 10 8 0.8
51 2 0.5 4 12 0.5
parevaluation
eid docid score pre post max- score
3 1 1.0 1 21 1.0
28 2 0.8 8 14 0.8
182 5 0.75 3 7 0.75
21
Outline

Data Scoring model
Database schema indexing
Top-k query processing for XML
Scheduling probabilistic candidate pruning
Experiments Conclusions

22
Experiments Data Collections Competitors

INEX 04 benchmark setting
12,223 docs 12M elemts 119M index entries
534MB
46 queries, e.g., //article.//bibQBIC and
.//pimage retrieval
IMDB (Internet Movie Database)
386,529 docs 34M elemts 130M index entries
1,117 MB
20 queries, e.g., //movie.//casting.//actorJoh
n Wayne //roleSheriff//.//year1959 and
.//genreWestern
Competitors
DBMS-style JoinSort
Using the TopX schema
StructIndex Kaushik et al, Sigmod 04
Top-k with separate inverted indexes for content
structure
DataGuide-like structural index
Full evaluations ? no uncertainty about final
document scores
No candidate queuing, eager random accesses
StructIndex
Extent chaining technique for DataGuide-based
extent identifiers
(skip scans)

23
INEX Results
24
IMDB Results
P_at_k
MAP_at_k
SA
epsilon
RA
relPrec
k
CPU
37.7
0
14,510077
n/a
10
JoinSort
0.16
291,655
346,697
n/a
10
StructIndex
n/a
1.00
0.17
301,647
22,445
n/a
10
StructIndex
0.08
72,196
317,380
0.0
10
TopX MinProbe
0.06
50,016
241,471
0.0
10
TopX BenProbe
25
INEX with Probabilistic Pruning
26
Conclusions Ongoing Work

Efficient and versatile TopX query processor
Extensible framework for text, semi-structured
structured data
Probabilistic cost model for random access
scheduling
Very good precision/runtime ratio for
probabilistic candidate pruning
Scalability
Optimized for runtime, exploits cheap disk space
(factor 4-5 for INEX)
Experiments on TREC Terabyte text collection (see
paper)
Support for typical IR extensions
Phrase matching, mandatory terms , negation
-
Query weights (e.g., relevance feedback,
ontological similarities)
Dynamic and self-tuning query expansions SigIR
05
Incrementally merges inverted lists on demand
Dynamically opens scans on additional expansion
terms
Vague Content Structure (VCAS) queries

27
Thank you!

Write a Comment

User Comments (0)

About PowerShow.com