Title: Mining search engine query logs via suggestion sampling
1Mining search engine query logs via suggestion
sampling
- Ziv Bar-Yossef, Maxim Gurevich
- VLDB (2008)
2Suggestion Sampling - Background
public interface
mad
Suggestion Database
private database
madoff, mad men, etc
Infer information about private query log and
other aspects of the search engine from public
suggestion interface
Google Query Log
private database
search keywords
3Motivation
- Keyword popularity estimation
- Optimize online advertising budgets
- Search engine evaluation
- Coverage of web (size of index)
- Quality of the results
- freshness
- negative content
- User behavior studies
Papers Focus
4Suggestion Mining Tree Mining
- TRIE data structure
- All strings of length lmax over alphabet
- Marked nodes represent suggestions
- Marked nodes have a score based on popularity
5Problem
- No direct access to this tree
- Unknowable target distribution (which nodes are
marked?) - Solution
- Local Suggestion Service
-
- Monte Carlo Simulation
- trial distribution to model target distribution
6Local Suggestion Service
- AOL Query Log
- 10M unique query strings (with popularity)
- Removed non-ASCII query strings
- Simple TRIE implementation
7Monte Carlo Simulation
- Idea Sufficiently random sampled subset can
accurately describe the full set - Uniform distribution
- Popularity distribution
- Importance Sampling Estimate properties of one
distribution (target distribution) based on
properties of another distribution (trial
distribution). - Bias Determine what bias this substitution
introduced in the results (want to minimize) - Variance Determine how drastically Monte Carlo
approximations differ between independent trials
(want to minimize)
8Mathematical Synopsis
Basic Monte Carlo Framework
- NOTE We can think of
- target function algorithm to compute
suggestions and return the marked nodes - target distribution the score of the marked
nodes
target function
target distribution
Importance Sampling (X1, , Xn random samples
from trial distribution p)
- NOTE
- p(Xi) probability of selecting Xi from trial
distribution p - target weight think of the score of the
marked nodes
target weight w(Xi)
9Target Weight Computation
- Uniform distribution Trivial
- Popularity (score distribution)
- Problem No precise score information from
arbitrary suggestion server - Solution Use relative order of returned
suggestions as an approximation to scores. (order
implies popularity)
10Score Distribution score(x)
- Define score rank (fraction of nodes whose score
is at least as high as x) - Use power law distribution from AOL empirical
data to connect score rank and score(x). - Define Score Order and rewrite score rank in
terms of score order. - Approx. Score Order by introducing a function, b,
that is order-consistent with s.
highest exposing ancestor of x
position of x in a(x)
11Bias Variance
- Bias Introductions (Popularity)
- Power Law approximation
- Estimated score rank
- Estimated score order
- Empirical Variance (AOL dataset)
- 22.74
- Good?
12Volume Estimation
- Present various models for estimating the size of
the suggestion database - Naïve
- Sample-based
- Scored-based
- Combined
13Results Bias Estimation
- 10,000 samples and collected score
- Uniform practically unbiased
- Popularity minor bias towards medium scored
queries
- Uniform sampler against local suggestion service
- Bias becomes 0 after 400K queries.
- At most 3 after 50K queries
14Results Real Services
- Popularity trial distribution from AOL log
- Converted to lowercase
- Calculated target weights
- 2 real suggestion services
- Fail to name them explicitly
- Collected suggestion data
- SE1 (5.7M queries, 10,000 samples)
- SE2 (2.1M queries, 3,500 samples)
- Submitted suggestions to the search engine
directly and collected the results
15Suggestion Size Dead Links
- Suggestion Size
- Surprising difference in absolute size
- Score-induced coverage much closer
- SE1 does a better job minimizing suggestion DB
size while still covering relevant information - Dead Links
- Both search engines do a good job keeping dead
pages out of search results. - No big surprise here
16Wikipedia Coverage
- Wikipedia entries exist for a large number of
suggestions (exact matches) - Wikipedia covers a fairly significant number of
popular topics (exactly)
- Almost 20 return Wikipedia article as first
result - Over 60 return a Wikipedia result in the top 10
- Implication Search engines rely on Wikipedia to
deliver relevant information to users
17Query Popularity
- Relative popularity of queries
- Helpful for advertisers
- NOTE This type of service is already exists via
GoogleTrends - http//www.google.com/trends