Topical Query Decomposition - PowerPoint PPT Presentation

1 / 22
About This Presentation
Title:

Topical Query Decomposition

Description:

Topical Query Decomposition Francesco Bonchi Carlos Castillo Debora Donato Aristides Gionis Yahoo! Research Barcelona, Spain KDD 08 – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 23
Provided by: Ming155
Category:

less

Transcript and Presenter's Notes

Title: Topical Query Decomposition


1
Topical Query Decomposition
Francesco Bonchi Carlos
Castillo Debora Donato
Aristides Gionis Yahoo! Research Barcelona,
Spain KDD 08
2
Abstract
  • Given a query and a document retrieval system
  • To produce a small set of queries whose union of
    resulting documents corresponds approximately to
    that of the original query.
  • Set cover problem
  • Greedy algorithm
  • Clustering problem
  • Two-phase algorithm based on hierarchical
    agglomerative clustering (dynamic programming)

3
Introduction
  • A query log L
  • A list of pairs lt q, D(q) gt
  • q query,
  • D(q) its result a set of documents that answer
    query q
  • Q(q) the maximal set of queries pi, where for
    each pi, the set D(pi) has at least one document
    in common with the documents returned by q

4
(No Transcript)
5
  • The goal is to compute a cover.
  • Selecting a subcollection C ? Q(q7) such that it
    covers almost all of D(q7)

6
Problem Statement 1/3
  • Red-Blue set cover problem
  • Ub1,bn, r1,rm ( for a query q )
  • Bb1,bn (i.e.,
    document set)
  • Rr1,rm (i.e., query
    set)
  • SS1,,Sk is provided from L (query log L)
  • Si ? U
  • SiB blue points in Si (SiB Si ? B)
  • SiR red points in Si (SiR Si ? B)
  • Goal To find a subcollection C ? S that covers
    many blue points of U without covering too many
    red points.

7
Problem Statement 2/3
  • For each query q, the candidate queries Q(q)
  • For each set Si with blue and red points, its
    weight is
  • scatter sc(Si) (coherence opposite of scatter)

8
Problem Statement 3/3
  • Our goal is to find a subcollection C ? S that
    covers almost all the blue points of U and has
    large coherence.
  • More precisely, we want that C satisfies the
    following properties
  • Cover-blue
  • Not-cover-red
  • Small-overlap
  • Coherence

9
Greedy Algorithm 1/2
  • At i-th iteration , minimizes s(S,VB,VR)
  • lC, lR, lO are parameters that weight the
    relative importance of the three terms.
  • VB blue balls were already selected at before
    iterations
  • VR red balls were already selectedat before
    iterations

D. Peleg. Approximation algorithm for the
label-covermax and red-blue set cover problem.
Journal of Discrete Algorithms, 2007
10
Greedy Algorithm 2/2
11
Integer Programming
  • SiS2.Sl lt10
  • Si lt 1

12
Clustering-Based Method
  • Two-phase approach
  • First phase all points in set B are clustered
    using a hierarchical agglomerative clustering
    algorithm. (CLUTO toolkit)
  • Second phases to match the clusters of the
    hierarchy produced by the agglomerative algorithm
    with the sets of S.
  • The main idea is to match sets of S into clusters
    of G
  • Every node T ? G corresponds to a cluster
  • T(B) be the set of points in B

13
Clustering-Based Method
Dendrogram G
14
Clustering-Based Method -Dynamic Programming -
1/2
  • Complete Coverage
  • for each set S ? S v.s. for each node T? G ,
  • Matching score m(T, S)
  • m(T) the score of the best matching set in S.
  • Optimal cost of covering the points of TB with
    sets in S.

15
Clustering-Based Method -Dynamic Programming -
2/2
  • Partial Coverage
  • lU weights the relative importance between the
    two terms, the scatter cost of the sets S and the
    number of uncovered points.

16
Application
  • Query log L 2.9 million distinct queries
  • A majority of users only looks at the first page
    of results, while few users request more result
    pages.
  • D(q) any user asking for q in the query log
    navigated, and consider the set of result
    documents for the query
  • 24 million distinct documents seen by the users

17
Application - Candidate queries for the cover
  • For each query q, the candidate queries Qk(q)

18
Application - Results
  • A set of 100 queries were randomly picked from
    top 10,000 queries submitted by users.
  • Cost of k queries
  • The number of documents included outside the set
    D(q)
  • Average numbre of queries covering each element
  • Coverage after the top k candidates have been
    picked

19
(No Transcript)
20
(No Transcript)
21
(No Transcript)
22
Conclusions
  • A novel problem
  • Topical query decomposition
  • Elegant solutions
  • red-blue metric set cover
  • clustering with predefined clusters. (
    hierarchical agglomerative clustering )
  • The set-cover formulation provides solutions of
    better quality
  • Code and data for reproducing the results shown
    in Table 3 is available at
  • http//www.yr-bcn.es/querydecomp/ .
Write a Comment
User Comments (0)
About PowerShow.com