Distributed Query Sampling: A QualityConscious Approach - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

Distributed Query Sampling: A QualityConscious Approach

Description:

Basic Reference Model. Adaptive Architecture for Distributed Query-Sampling. Experiments ... Query-based sampling: for collecting document samples from DBs, which are ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 24

Provided by: admisFu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Query Sampling: A QualityConscious Approach

1
Distributed Query Sampling A Quality-Conscious
Approach

SIGIR06

2
Outline

Introduction
Basic Reference Model
Adaptive Architecture for Distributed
Query-Sampling
Experiments

3
Introduction

Query-based sampling for collecting document
samples from DBs, which are autonomous and offer
limited query capability.
Three challenges
1. how to sample different interface
2. when to sample update
3. what to sample given realistic network
constraints and the size and growth rate of
distributed text DBs

Distributed query-sampling problem
Given a set of n distributed databases, each of
which provides access primarily through a
query-based mechanism, and given limited
resources for sampling all n databases, can we
identify an effective strategy for extracting
high-quality samples from the n databases?
Adaptive distributed query-sampling framework
Seed sampling phase
Quality-aware iterative sampling phase
dynamically scheduled based on estimated size
and quality parameters derived during the
previous phase

5
Basic Reference Model

UD1,D2,Dn universe of n text DBs
Ddoc1,doc2,docn
Vocabulary V of D the set of terms in D
V the number of unique terms
c(t,D) the count of documents in D each term t
occurs in
f(t,D) frequency of occurrence of each term t
across all the documents in D
Ds document sample a set of documents from D
DsltltD
Vs, Vs, c(t,Ds), f(t,Ds)

6
Query-Based Sampling

To generate estimates of text DB by examining
only a fraction of the total documents
It works by repeatedly sending one-term keyword
queries to a text DB and extracting the response
documents
Steps of Query-Based Sampling from a Database D
1 Initialize a query dictionary Q.
2 Select a one-term query q from Q.
3 Issue the query q to the database D.
4 Retrieve the top-m documents from D in
response to q.
5 Optional Update Q with the terms in the
retrieved documents.
6 Goto Step 2, until a stopping condition is
met.
Critical factors to performance choice of Q,
query selection algorithm, stopping condition

7
Sampling from Distributed Text DB

Problem to determine what to sample from each of
the n text databases under a given sampling
resource constraint.
Goal to identify the optimal allocation of the S
documents to the n databases.
Naïve-Solution Uniform Sampling
to uniformly allocate the S sample documents to
each database, meaning that for each DB an equal
number of documents will be sampled, i.e. S/n.
Indifferent to the relative size of each DB or
relative quality of the document samples

8
Adaptive Architecture for Distributed
Query-Sampling

The proposed adaptive sampling framework
dynamically determines the amount of sampling at
each text database based on an analysis of the
relative merits of each database sample during
the sampling process.

9
General Procedure

Step 1 Seed Sampling
collect an initial seed sample for each DB
Total sample doc S seed sample Sseed
Sseed(Di)Sseed/n (uniform allocation)
Step 2 Dynamic Sampling Allocation (3 schemes)
analyze seed sample ? estimate size and
quality parameters of DBs ? allocate the
remaining sample doc SdynS-Sseed dynamically in
m iterations (Sdyn/m)
Number of sample doc allocated to database
Di in iteration j Sdyn(Di,j)
Step 3 Dynamic Sampling Execution
total doc sampled in Di
Stot(Di)Sseed(Di)?mj1Sdyn(Di,j)

10
(No Transcript)
11
Quality-Conscious Sampling Schemes

Each scheme recommends for each Di a total number
of documents to sample, denoted by s(Di) s(Di)
recommended if complete information about each
database were available
Then discuss how to find the dynamic sampling
allocation sdyn(Di j) for each database based on
s(Di).

12
Scheme 1 Proportional Document Ratio PD

RatioPD
Since DB size is unknown,
To estimate DB size
1. Capture-recapture algorithm
2. Sample-resample algorithm
Assumption DB indicates total doc number in
which query occurs
Sample phase collect a document sample
Resample phase issue a handful of additional
queries and collect c(t,D)

13
Scheme 2 Proportional Vocabulary Ratio PV

PV scheme is intuitively more closely linked to
the quality of the DB samples
Heaps Law a text of n words will have a
vocabulary size of
, where
refers to total frequency of all terms in D

14
Scheme 3 Vocabulary Growth VG

Goal to extract the most vocabulary terms from
across the space of distributed text DBs
Let x denote No. of doc sampled from a DB, then
The expected new vocabulary size for the x-th doc
is
With the formula above, we can find out which DB
can provide the most new vocabulary at any time.
Select the top-S doc from across all the DBs as
scored by ?(V(x))

15
Dynamic Sampling Allocation

To determine sampling No. from each DB in
iteration k Sdyn(Di,k)
Case 1
Case 2
(a). drop the oversampled DB
(b). redistribute the remaining docs

16
Experiments

TREC123-A/B/C

17
Estimation Error

Seed sample is used to estimate doc number and
vocabulary size in DB

18
About ED EV

About ED
Sample size 50?500
For TREC123, 14?18
For TRE4, 13?18
For large DBs, 20?30
About Ev
Encouraging
Error falls quickly

19
Database Sample Quality

PD, PV,VG vs. U
S300n, where n is the number of DBs in the
dataset. (TREC123100 TREC4100 TREC123-A62
TREC123-B83 TREC123-C62)
For U, sample 300 docs from each DB
For the others, allocate half (150) as seed
sample, and dynamically allocate the left
Report the average results based on repeating 5
times for all tested schemes

20
Sample Quality Metrics

Weighted common terms
Term ranking (by c(t,D) c(t,Ds))
Spearman correlation coefficient
1(identical) 0(uncorrelated) -1
(reverse)
Distributional similarity 0,1
Jensen-Shannon divergence
the lower the JS-divergence, the more
similar the two distributions

21
Sample Quality Results

Under strong constraints, PD PV always
outperform U over all 5 datasets and all 3
quality metrics
VG underperforms U significantly in all cases,
but resulted in 1.5 to 3 times vocabulary as the
other approaches
Reason since it focuses solely on sampling from
the most efficient databases in terms of
vocabulary production, VG tended to allocate all
of the sampling documents to a few small
databases each with a fairly large vocabulary,
ignoring those large ones with slower growth rate

22
Analysis of some factors