Efficient Computation of Diverse Query Results - PowerPoint PPT Presentation

About This Presentation
Title:

Efficient Computation of Diverse Query Results

Description:

or looking for cars on Yahoo! Autos, and ... (one for Honda, one for Toyota... Wants another Toyota. Calls next(1.0.0.0.0) Finds 10000. 28 ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 41
Provided by: Yah968
Category:

less

Transcript and Presenter's Notes

Title: Efficient Computation of Diverse Query Results


1
Efficient Computation of Diverse Query Results
Erik Vee joint work with Utkarsh Srivastava,
Jayavel Shanmugasundaram,Prashant Bhat, Sihem
Amer Yahia Talk modified for CS 632 by S.
Sudarshan
2
Motivation
  • Imagine looking for shoes on Yahoo! Shopping, and
    seeing only Reeboks

3
Motivation
  • Imagine looking for shoes on Yahoo! Shopping, and
    seeing only Reeboks
  • or looking for cars on Yahoo! Autos, andseeing
    only Hondas

4
Motivation
  • Imagine looking for shoes on Yahoo! Shopping, and
    seeing only Reeboks
  • or looking for cars on Yahoo! Autos, andseeing
    only Hondas
  • or looking for jobs on Yahoo! Hotjobs,
    andseeing only jobs from Yahoo!
  • It is not enough to simply give the best response
  • Need diversity of answers

5
Diversity Search
  • If we display 30 results in 5 categories, then
    should show 6 items from each category
  • NB Our goal is to show range of choices, not
    representative sample
  • Recurse on each subgroup of items
  • Diversity crucial for users looking for range of
    results
  • e.g. Shopping, information gathering/research
  • Useful for aiding navigation
  • Users tend to favor search-and-click over
    hierarchies
  • Likely to give at least one good answer on first
    page

6
Contributions
  • Formally define diversity search
  • Other diversity-like approaches use extensive
    post-processing or are not query-dependent
  • Proved that traditional IR engines cannot produce
    guaranteed diverse results
  • Gave novel algorithms to produce diverse results
  • Both one-pass (datastreaming) and probing
    algorithms
  • Experimentally verified that these results are
    nearly as fast as normal top-k processing
  • Much faster than post-processing techniques

7
What about other approaches?
  • If not diverse enough, query again
  • E.g. If all results are from one company, issue
    another query
  • Bad for latency
  • Issue multiple queries (one for Honda, one for
    Toyota...)
  • Can be prohibitively expensive (kills throughput)
  • latency fine
  • Some applications may have dozens of top-level
    categories
  • Fetch extra results, then find most diverse set
    from this
  • Not guaranteed to get good results
  • Requires fetching additional results
    unnecessarily
  • Fetch all results, then find diverse set
  • Many times slower
  • Random sample of results
  • Miss important results this way

8
What about clever scoring?
  • Can we give each item a global diversity
    score, then find top-k using this?
  • Prove in paper There is no global score that
    gives guaranteed diversity
  • Can we give each item a local diversity score,
    so that it has a different score in each list of
    the inverted index?
  • Prove in paper There is no list-based scoring of
    the item that gives guaranteed diversity

9
Outline
  • Definition of diversity
  • Overview of our algorithms
  • Our experimental results

10
Diversity search
  • Over all possible sets of top-k results that
    match query, return set with most diversity
  • Paper defines diversity more precisely
  • Focus on hierarchy view of diversity (in next
    slides)
  • For scored diversity (in which each item has a
    score)
  • Over all possible sets of top-k results with
    maximum score, return set with highest diversity
  • Note Diversity only useful when score not too
    fine-grained

11
Diversity definition (by picture)
Determine a category ordering
Make
Implicitly defines hierarchy
Model
Color
Year
Text
12
Hierarchy after a query
Diversity search always returns valid results
E.g. Query text contains Low
13
Hierarchy after a query
All siblings return the same number of
results (or as close as possible)
Diversity search always returns valid results
E.g. Query text contains Low
14
Returning top-k diverse results
Diversity search always returns valid results
E.g. Query text contains Low
Suppose return k4 results
Must return 2 Hondas and 2 Toyotas
Will not return2 green Civics
15
Outline
  • Definition of diversity
  • Overview of our algorithms
  • Our experimental results

16
Algorithms
  • One Pass
  • Never goes backward (just one pass over dataset)
  • Maintains a top-k diverse set based on what has
    been seen
  • Jumps ahead if more results will not help
    diversity
  • Optimal one-pass algorithm
  • Probe
  • May jump forward or backward (i.e. probes)
  • Prove at most 2k probes for top-k diverse result
    set
  • Both also work for scored diversity

17
Dewey IDs
Every branch gets a number
Every item then labeled, e.g. 0.2.0.1.0 is Honda
Odyssey Green 06 Good miles
Create inverted index
low ? 00000, 00010, 00100, 00200, 00300, 00310,
10000, 11000, 12000, 13000
18
Next and Prev
Supports two basic operations Next and Prev
E.g. Query text contains Low
Next(0.0.3.2.2) 1.0.0.0.0 Prev(2.0.0.0.0)
1.3.0.0.0
Inverted index for Low lists all items in Dewey
ID order
In general, must find intersection of lists
(still easy)
low ? 00000, 00010, 00100, 00200, 00300, 00310,
10000, 11000, 12000, 13000
19
One pass (for k 2)
First finds 00000, 00010
Now knows Civic Green no longer helps
Jumps by calling next(0.0.1.0.0)
20
One pass (for k 2)
First finds 00000, 00010
Now knows Civic Green no longer helps!
Jumps by calling next(0.0.1.0.0)
Finds 00100 Removes 00010
Now knows Civic no longer helps!
Jumps by calling next(0.1.0.0.0)
21
One pass (for k 2)
First finds 00000, 00010
Now knows Civic Green no longer helps!
Jumps by calling next(0.0.1.0.0)
Finds 00100 Removes 00010
Now knows Civic no longer helps!
Jumps by calling next(0.1.0.0.0)
Finds 01000 Removes 00100
Knows to stop
22
Unscored One-Pass Algorithm
Remove 1st element in queue
Key step deciding where to skip to
23
One-Pass Algorithm (Cont.)
  • Complexity k lnd(3k)
  • Scored One Pass Algo same algo as for unscored
    case, except
  • replace line 11 of the unscored one-pass
    algorithm with the line
  • id mergedList.next(id1, skipId, root,
    minScore)
  • The semantics of the above line is to return the
    smallest id greater than or equal to id1 such
    that either
  • score(id) gt root.minScore, or
  • score(id) gt root.minScore, and the return id is
    greater than skipId.

24
Probe (for k 4)
Discovers there are only 2 top-level categories
Calls next(0.0.0.0.0) and prev(?. ?. ?. ?. ?) to
find first and last items
Wants another Honda
Calls prev(0. ?. ?. ?. ?)
25
Probe (for k 4)
Calls next(0.0.0.0.0) and prev(?. ?. ?. ?. ?) to
find first and last items
Wants another Honda
Calls prev(0. ?. ?. ?. ?)
Why not next(0.1.0.0.0)?
If Honda has only one child, then will return a
Toyota!
26
Probe (for k 4)
Calls next(0.0.0.0.0) and prev(?. ?. ?. ?. ?) to
find first and last items
Wants another Honda
Calls prev(0. ?. ?. ?. ?)
Finds 00310
Wants another Toyota
Calls next(1.0.0.0.0)
27
Probe (for k 4)
Calls next(0.0.0.0.0) and prev(?. ?. ?. ?. ?) to
find first and last items
Wants another Honda
Calls prev(0. ?. ?. ?. ?)
Finds 00310
Wants another Toyota
Calls next(1.0.0.0.0)
Finds 10000
28
Unscored Probing Algorithm
29
Unscored Probing (Cont.)
30
Unscored Probing (Cont.)
31
Unscored Probing (Cont.)
32
Unscored Probing
  • Invariant Whenever id ? node, either id belongs
    to some child of node in our data structure, or
    node.edgeLEFT lt id lt node.edgeRIGHT
  • Invariant Let node be some node in our data
    structure, and suppose during the execution of
    the algorithm, we call node.getProbeId(),
    returning (probeId, dir). Then we have
    mergedList.next(probeId, dir) ? node.
  • Theorem 2 The unscored probing algorithm given
    in Algorithms 2, 3 makes at most 2k calls to
    next.

33
Scored Probing (Cont.)
  • Let ? be the score of the lowest-scoring item in
    thetop-K list returned. Diversity is only
    guaranteed among items whose score is ?.
  • The difficulty comes from not knowing the exact
    value of ?.

34
Scored Probing
35
Outline
  • Definition of diversity
  • Overview of our algorithms
  • Our experimental results

36
Results
  • Dataset consisted of listing from Yahoo! Autos
  • Queries were synthetic to test various parameters
  • Selectivity, predicates, results
  • Preprocessing time for 100K listings lt 5min
  • Times shown are for 5K queries
  • 4 algorithms
  • Basic No diversity
  • Naïve Fetch everything, post-process
  • OnePass Our algorithm. Takes just one pass over
    data
  • Probe Our algorithm. May make multiple probes
    into data

37
Comparable time for diversity search
unscored
scored
Probe Within factor 2 of no diversity
Basic No diversity
Naïve Many times slower
OnePass Close to probe
MultiQuery (not shown) Latency close to Basic,
but throughput many times worse
38
Results summary
  • Getting diverse results not too much slower than
    getting non-diverse results
  • Many times faster than naïve approaches
  • Multi-query approach has even worse throughput
    than naïve
  • But keeps latency low
  • How does this compare to getting extra results,
    then finding a diverse subset?
  • Getting 2k results instead of k is about twice as
    slow
  • Plus, does not guarantee diverse results

39
Conclusions
  • Can get guaranteed diversity, taking time close
    to normal top-k query
  • Almost as fast or faster than non-guaranteed
    results
  • Diversity at every level
  • Works even when items have scores
  • Needs a different algorithm than traditional IR
    engines
  • Proved this in paper (under standard notions)
  • Are there approximate notions that can use
    existing IR machinery?

40
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com