Distributed Query Sampling: A QualityConscious Approach - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

Distributed Query Sampling: A QualityConscious Approach

Description:

Basic Reference Model. Adaptive Architecture for Distributed Query-Sampling. Experiments ... Query-based sampling: for collecting document samples from DBs, which are ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 24
Provided by: admisFu
Category:

less

Transcript and Presenter's Notes

Title: Distributed Query Sampling: A QualityConscious Approach


1
Distributed Query Sampling A Quality-Conscious
Approach
  • SIGIR06

2
Outline
  • Introduction
  • Basic Reference Model
  • Adaptive Architecture for Distributed
    Query-Sampling
  • Experiments

3
Introduction
  • Query-based sampling for collecting document
    samples from DBs, which are autonomous and offer
    limited query capability.
  • Three challenges
  • 1. how to sample different interface
  • 2. when to sample update
  • 3. what to sample given realistic network
    constraints and the size and growth rate of
    distributed text DBs

4
  • Distributed query-sampling problem
  • Given a set of n distributed databases, each of
    which provides access primarily through a
    query-based mechanism, and given limited
    resources for sampling all n databases, can we
    identify an effective strategy for extracting
    high-quality samples from the n databases?
  • Adaptive distributed query-sampling framework
  • Seed sampling phase
  • Quality-aware iterative sampling phase
  • dynamically scheduled based on estimated size
    and quality parameters derived during the
    previous phase

5
Basic Reference Model
  • UD1,D2,Dn universe of n text DBs
  • Ddoc1,doc2,docn
  • Vocabulary V of D the set of terms in D
  • V the number of unique terms
  • c(t,D) the count of documents in D each term t
    occurs in
  • f(t,D) frequency of occurrence of each term t
    across all the documents in D
  • Ds document sample a set of documents from D
  • DsltltD
  • Vs, Vs, c(t,Ds), f(t,Ds)

6
Query-Based Sampling
  • To generate estimates of text DB by examining
    only a fraction of the total documents
  • It works by repeatedly sending one-term keyword
    queries to a text DB and extracting the response
    documents
  • Steps of Query-Based Sampling from a Database D
  • 1 Initialize a query dictionary Q.
  • 2 Select a one-term query q from Q.
  • 3 Issue the query q to the database D.
  • 4 Retrieve the top-m documents from D in
    response to q.
  • 5 Optional Update Q with the terms in the
    retrieved documents.
  • 6 Goto Step 2, until a stopping condition is
    met.
  • Critical factors to performance choice of Q,
    query selection algorithm, stopping condition

7
Sampling from Distributed Text DB
  • Problem to determine what to sample from each of
    the n text databases under a given sampling
    resource constraint.
  • Goal to identify the optimal allocation of the S
    documents to the n databases.
  • Naïve-Solution Uniform Sampling
  • to uniformly allocate the S sample documents to
    each database, meaning that for each DB an equal
    number of documents will be sampled, i.e. S/n.
  • Indifferent to the relative size of each DB or
    relative quality of the document samples

8
Adaptive Architecture for Distributed
Query-Sampling
  • The proposed adaptive sampling framework
    dynamically determines the amount of sampling at
    each text database based on an analysis of the
    relative merits of each database sample during
    the sampling process.

9
General Procedure
  • Step 1 Seed Sampling
  • collect an initial seed sample for each DB
  • Total sample doc S seed sample Sseed
    Sseed(Di)Sseed/n (uniform allocation)
  • Step 2 Dynamic Sampling Allocation (3 schemes)
  • analyze seed sample ? estimate size and
    quality parameters of DBs ? allocate the
    remaining sample doc SdynS-Sseed dynamically in
    m iterations (Sdyn/m)
  • Number of sample doc allocated to database
    Di in iteration j Sdyn(Di,j)
  • Step 3 Dynamic Sampling Execution
  • total doc sampled in Di
  • Stot(Di)Sseed(Di)?mj1Sdyn(Di,j)

10
(No Transcript)
11
Quality-Conscious Sampling Schemes
  • Each scheme recommends for each Di a total number
    of documents to sample, denoted by s(Di) s(Di)
    recommended if complete information about each
    database were available
  • Then discuss how to find the dynamic sampling
    allocation sdyn(Di j) for each database based on
    s(Di).

12
Scheme 1 Proportional Document Ratio PD
  • RatioPD
  • Since DB size is unknown,
  • To estimate DB size
  • 1. Capture-recapture algorithm
  • 2. Sample-resample algorithm
  • Assumption DB indicates total doc number in
    which query occurs
  • Sample phase collect a document sample
  • Resample phase issue a handful of additional
    queries and collect c(t,D)

13
Scheme 2 Proportional Vocabulary Ratio PV
  • PV scheme is intuitively more closely linked to
    the quality of the DB samples
  • Heaps Law a text of n words will have a
    vocabulary size of
  • , where
    refers to total frequency of all terms in D

14
Scheme 3 Vocabulary Growth VG
  • Goal to extract the most vocabulary terms from
    across the space of distributed text DBs
  • Let x denote No. of doc sampled from a DB, then
  • The expected new vocabulary size for the x-th doc
    is
  • With the formula above, we can find out which DB
    can provide the most new vocabulary at any time.
  • Select the top-S doc from across all the DBs as
    scored by ?(V(x))

15
Dynamic Sampling Allocation
  • To determine sampling No. from each DB in
    iteration k Sdyn(Di,k)
  • Case 1
  • Case 2
  • (a). drop the oversampled DB
  • (b). redistribute the remaining docs

16
Experiments
  • TREC123-A/B/C

17
Estimation Error
  • Seed sample is used to estimate doc number and
    vocabulary size in DB

18
About ED EV
  • About ED
  • Sample size 50?500
  • For TREC123, 14?18
  • For TRE4, 13?18
  • For large DBs, 20?30
  • About Ev
  • Encouraging
  • Error falls quickly

19
Database Sample Quality
  • PD, PV,VG vs. U
  • S300n, where n is the number of DBs in the
    dataset. (TREC123100 TREC4100 TREC123-A62
    TREC123-B83 TREC123-C62)
  • For U, sample 300 docs from each DB
  • For the others, allocate half (150) as seed
    sample, and dynamically allocate the left
  • Report the average results based on repeating 5
    times for all tested schemes

20
Sample Quality Metrics
  • Weighted common terms
  • Term ranking (by c(t,D) c(t,Ds))
  • Spearman correlation coefficient
  • 1(identical) 0(uncorrelated) -1
    (reverse)
  • Distributional similarity 0,1
  • Jensen-Shannon divergence
  • the lower the JS-divergence, the more
    similar the two distributions

21
Sample Quality Results
  • Under strong constraints, PD PV always
    outperform U over all 5 datasets and all 3
    quality metrics
  • VG underperforms U significantly in all cases,
    but resulted in 1.5 to 3 times vocabulary as the
    other approaches
  • Reason since it focuses solely on sampling from
    the most efficient databases in terms of
    vocabulary production, VG tended to allocate all
    of the sampling documents to a few small
    databases each with a fairly large vocabulary,
    ignoring those large ones with slower growth rate

22
Analysis of some factors
  • S100/300/500n
  • S?, the result ?
  • Seed ratio
  • PD50,250, PD150,150, PD250, 50
  • a) PD PV schemes result in
  • higher quality samples in all
  • cases
  • b) As Sseed increases, advantage
  • of the quality-conscious schemes is
    diminished only slightly
  • relative to the uniform approach.
  • Iteration
  • iteration?, slightly improve

23
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com