Mining search engine query logs via suggestion sampling - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Mining search engine query logs via suggestion sampling

Description:

private query log and other aspects of the search engine from ... Mathematical Synopsis. Basic Monte Carlo Framework. target function. target distribution ... – PowerPoint PPT presentation

Number of Views:125
Avg rating:3.0/5.0
Slides: 18
Provided by: hopp3
Category:

less

Transcript and Presenter's Notes

Title: Mining search engine query logs via suggestion sampling


1
Mining search engine query logs via suggestion
sampling
  • Ziv Bar-Yossef, Maxim Gurevich
  • VLDB (2008)

2
Suggestion Sampling - Background
public interface
mad
Suggestion Database
private database
madoff, mad men, etc
Infer information about private query log and
other aspects of the search engine from public
suggestion interface
Google Query Log
private database
search keywords
3
Motivation
  • Keyword popularity estimation
  • Optimize online advertising budgets
  • Search engine evaluation
  • Coverage of web (size of index)
  • Quality of the results
  • freshness
  • negative content
  • User behavior studies

Papers Focus
4
Suggestion Mining Tree Mining
  • TRIE data structure
  • All strings of length lmax over alphabet
  • Marked nodes represent suggestions
  • Marked nodes have a score based on popularity

5
Problem
  • No direct access to this tree
  • Unknowable target distribution (which nodes are
    marked?)
  • Solution
  • Local Suggestion Service
  • Monte Carlo Simulation
  • trial distribution to model target distribution

6
Local Suggestion Service
  • AOL Query Log
  • 10M unique query strings (with popularity)
  • Removed non-ASCII query strings
  • Simple TRIE implementation

7
Monte Carlo Simulation
  • Idea Sufficiently random sampled subset can
    accurately describe the full set
  • Uniform distribution
  • Popularity distribution
  • Importance Sampling Estimate properties of one
    distribution (target distribution) based on
    properties of another distribution (trial
    distribution).
  • Bias Determine what bias this substitution
    introduced in the results (want to minimize)
  • Variance Determine how drastically Monte Carlo
    approximations differ between independent trials
    (want to minimize)

8
Mathematical Synopsis
Basic Monte Carlo Framework
  • NOTE We can think of
  • target function algorithm to compute
    suggestions and return the marked nodes
  • target distribution the score of the marked
    nodes

target function
target distribution
Importance Sampling (X1, , Xn random samples
from trial distribution p)
  • NOTE
  • p(Xi) probability of selecting Xi from trial
    distribution p
  • target weight think of the score of the
    marked nodes

target weight w(Xi)
9
Target Weight Computation
  • Uniform distribution Trivial
  • Popularity (score distribution)
  • Problem No precise score information from
    arbitrary suggestion server
  • Solution Use relative order of returned
    suggestions as an approximation to scores. (order
    implies popularity)

10
Score Distribution score(x)
  • Define score rank (fraction of nodes whose score
    is at least as high as x)
  • Use power law distribution from AOL empirical
    data to connect score rank and score(x).
  • Define Score Order and rewrite score rank in
    terms of score order.
  • Approx. Score Order by introducing a function, b,
    that is order-consistent with s.

highest exposing ancestor of x
position of x in a(x)
11
Bias Variance
  • Bias Introductions (Popularity)
  • Power Law approximation
  • Estimated score rank
  • Estimated score order
  • Empirical Variance (AOL dataset)
  • 22.74
  • Good?

12
Volume Estimation
  • Present various models for estimating the size of
    the suggestion database
  • Naïve
  • Sample-based
  • Scored-based
  • Combined

13
Results Bias Estimation
  • 10,000 samples and collected score
  • Uniform practically unbiased
  • Popularity minor bias towards medium scored
    queries
  • Uniform sampler against local suggestion service
  • Bias becomes 0 after 400K queries.
  • At most 3 after 50K queries

14
Results Real Services
  • Popularity trial distribution from AOL log
  • Converted to lowercase
  • Calculated target weights
  • 2 real suggestion services
  • Fail to name them explicitly
  • Collected suggestion data
  • SE1 (5.7M queries, 10,000 samples)
  • SE2 (2.1M queries, 3,500 samples)
  • Submitted suggestions to the search engine
    directly and collected the results

15
Suggestion Size Dead Links
  • Suggestion Size
  • Surprising difference in absolute size
  • Score-induced coverage much closer
  • SE1 does a better job minimizing suggestion DB
    size while still covering relevant information
  • Dead Links
  • Both search engines do a good job keeping dead
    pages out of search results.
  • No big surprise here

16
Wikipedia Coverage
  • Wikipedia entries exist for a large number of
    suggestions (exact matches)
  • Wikipedia covers a fairly significant number of
    popular topics (exactly)
  • Almost 20 return Wikipedia article as first
    result
  • Over 60 return a Wikipedia result in the top 10
  • Implication Search engines rely on Wikipedia to
    deliver relevant information to users

17
Query Popularity
  • Relative popularity of queries
  • Helpful for advertisers
  • NOTE This type of service is already exists via
    GoogleTrends
  • http//www.google.com/trends
Write a Comment
User Comments (0)
About PowerShow.com