Mining search engine query logs via suggestion sampling - PowerPoint PPT Presentation

1 / 17

About This Presentation

Title:

Mining search engine query logs via suggestion sampling

Description:

private query log and other aspects of the search engine from ... Mathematical Synopsis. Basic Monte Carlo Framework. target function. target distribution ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 18

Provided by: hopp3

Category:

more less

Transcript and Presenter's Notes

Title: Mining search engine query logs via suggestion sampling

1
Mining search engine query logs via suggestion
sampling

Ziv Bar-Yossef, Maxim Gurevich
VLDB (2008)

2
Suggestion Sampling - Background
public interface
mad
Suggestion Database
private database
madoff, mad men, etc
Infer information about private query log and
other aspects of the search engine from public
suggestion interface
Google Query Log
private database
search keywords
3
Motivation

Keyword popularity estimation
Optimize online advertising budgets
Search engine evaluation
Coverage of web (size of index)
Quality of the results
freshness
negative content
User behavior studies

Papers Focus
4
Suggestion Mining Tree Mining

TRIE data structure
All strings of length lmax over alphabet
Marked nodes represent suggestions
Marked nodes have a score based on popularity

5
Problem

No direct access to this tree
Unknowable target distribution (which nodes are
marked?)
Solution
Local Suggestion Service
Monte Carlo Simulation
trial distribution to model target distribution

6
Local Suggestion Service

AOL Query Log
10M unique query strings (with popularity)
Removed non-ASCII query strings
Simple TRIE implementation

7
Monte Carlo Simulation

Idea Sufficiently random sampled subset can
accurately describe the full set
Uniform distribution
Popularity distribution
Importance Sampling Estimate properties of one
distribution (target distribution) based on
properties of another distribution (trial
distribution).
Bias Determine what bias this substitution
introduced in the results (want to minimize)
Variance Determine how drastically Monte Carlo
approximations differ between independent trials
(want to minimize)

8
Mathematical Synopsis
Basic Monte Carlo Framework

NOTE We can think of
target function algorithm to compute
suggestions and return the marked nodes
target distribution the score of the marked
nodes

target function
target distribution
Importance Sampling (X1, , Xn random samples
from trial distribution p)

NOTE
p(Xi) probability of selecting Xi from trial
distribution p
target weight think of the score of the
marked nodes

target weight w(Xi)
9
Target Weight Computation

Uniform distribution Trivial
Popularity (score distribution)
Problem No precise score information from
arbitrary suggestion server
Solution Use relative order of returned
suggestions as an approximation to scores. (order
implies popularity)

10
Score Distribution score(x)

Define score rank (fraction of nodes whose score
is at least as high as x)
Use power law distribution from AOL empirical
data to connect score rank and score(x).
Define Score Order and rewrite score rank in
terms of score order.
Approx. Score Order by introducing a function, b,
that is order-consistent with s.

highest exposing ancestor of x
position of x in a(x)
11
Bias Variance

Bias Introductions (Popularity)
Power Law approximation
Estimated score rank
Estimated score order
Empirical Variance (AOL dataset)
22.74
Good?

12
Volume Estimation

Present various models for estimating the size of
the suggestion database
Naïve
Sample-based
Scored-based
Combined

13
Results Bias Estimation

10,000 samples and collected score
Uniform practically unbiased
Popularity minor bias towards medium scored
queries

Uniform sampler against local suggestion service
Bias becomes 0 after 400K queries.
At most 3 after 50K queries

14
Results Real Services

Popularity trial distribution from AOL log
Converted to lowercase
Calculated target weights
2 real suggestion services
Fail to name them explicitly
Collected suggestion data
SE1 (5.7M queries, 10,000 samples)
SE2 (2.1M queries, 3,500 samples)
Submitted suggestions to the search engine
directly and collected the results

15
Suggestion Size Dead Links

Suggestion Size
Surprising difference in absolute size
Score-induced coverage much closer
SE1 does a better job minimizing suggestion DB
size while still covering relevant information
Dead Links
Both search engines do a good job keeping dead
pages out of search results.
No big surprise here

16
Wikipedia Coverage

Wikipedia entries exist for a large number of
suggestions (exact matches)
Wikipedia covers a fairly significant number of
popular topics (exactly)

Almost 20 return Wikipedia article as first
result
Over 60 return a Wikipedia result in the top 10
Implication Search engines rely on Wikipedia to
deliver relevant information to users

17
Query Popularity