On Ranking the Effectiveness of Searches - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

On Ranking the Effectiveness of Searches

Description:

There is a considerable interest in estimating the effectiveness of search. ... Bayesian model selection using the Laplace criterion [Minka, 1999] ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 22
Provided by: nlgCsie
Category:

less

Transcript and Presenter's Notes

Title: On Ranking the Effectiveness of Searches


1
On Ranking the Effectiveness of Searches
  • Vishwa Vinay, Ingemar J. Cox
  • University College London
  • Natasa Milic-Fraying, Ken Wood
  • Microsoft Research Ltd., Cambridge
  • SIGIR 2006

2
Introduction
  • There is a considerable interest in estimating
    the effectiveness of search.
  • Such an estimation is useful for
  • Providing feedback to the user
  • Providing feedback to the search engine
  • Providing feedback to the database creators
  • Optimizing information fusion for meta-search
    engines

3
Previous Work
  • Two classes strategies have been proposed
  • Analysis of the query
  • Ex query length, IDF values of query terms and
    clarity score depends on a language model
  • Achieved limited success
  • Analysis of the retrieved document set
  • The best performance to date (0.439 in Kendalls
    t statistics)

4
Four Proposed Measures
  • Focus on the geometry of the retrieved document
    set
  • The clustering tendency as measured by the
    Cox-Lewis statistic
  • The sensitivity of document perturbation
  • The sensitivity to query perturbation
  • The local intrinsic dimensionality

5
The Clustering Tendency
  • The cluster hypothesis
  • Document relevant to a given query are likely to
    be similar to each other
  • The lack of clusters in the retrieved data sets
    implies that the set does not contain relevant
    documents
  • Detecting the randomness in the retrieved set

6
Measuring the Clustering Tendency
  • The Cox-Lewis statistic (for pattern learning) is
    based on the ratio of two distances
  • The distance from a randomly generated sampling
    point to its nearest neighbor in the dataset,
    called the marked point
  • The distance between the marked point and its
    nearest neighbor
  • When the data contains inherent clusters, the
    distance Drand between the random sampling point
    and its marked point is likely to be much larger
    than the distance Dnn between the marked point
    and its nearest neighbor

7
Approximation of the Cox-Lewis Statistic (1/2)
  • Sampling window
  • A region in the data representation space from
    which the random points are picked
  • The smallest hyper-rectangle that contains all
    the documents in the retrieved set
  • The computed Cox-Lewis ratio is normalized by the
    average length of the sides of the
    hyper-rectangle

8
Approximation of the Cox-Lewis Statistic (2/2)
  • Approximation process
  • Each random point is generated by starting with a
    point randomly selected from the retrieved set of
    documents
  • Each non-zero term weight is replaced with a
    value chosen uniformly from the range that
    corresponds to the side of the hyper-rectangle in
    that dimension
  • The dependency on the sparsity is maintained
  • The clustering tendency of a given set of points
    is dependant on their sparsity

9
Query-Dependent Extension of Cosine
  • where c is the vector of terms common to both di
    and dj with weights ck being the average of dik
    and djk
  • Query-specific Cox-Lewis statistic
  • where psp is the sampling point, dmp is the
    marked point, and dnn is the nearest neighbor of
    the marked point

10
Document Perturbation (1/2)
  • A document is randomly selected from the
    retrieved set and used as a pseudo-query over the
    retrieved set
  • The selected document will be ranked first in the
    new result list
  • Using a perturbed version of the selected
    document as the pseudo-query, the new rank of the
    selected document will increase
  • Plotting the new rank of the selected document
    vs. the level of introduced noise (a)
  • The slope is expected to be related to clustering
    tendency

11
Document Perturbation (2/2)
  • For each query
  • Issue query to dataset
  • Collect 100 results
  • Calculate the variance along each dimension in
    the retrieved set
  • For each document di in this set
  • For a 0.01, 0.1, 1, 10
  • For s 1 10
  • For each term j present
    in this document
  • Weight
    Original_weight Gaussian(0, avj )
  • Find similarity of the
    noisy doc with all 100 original documents
  • Find rank of di in the
    list of similarities
  • Find average rank over multiple
    samples for doc di and this a

12
Query Perturbation (1/2)
  • The original query is perturbed to retrieve
    documents from the collection
  • We attempt to measure how distant the original
    retrieved set is from the set retrieved by the
    perturbed query
  • The rationale
  • If the originally retrieved set forms a tight
    cluster that is significantly distant from other
    topical clusters in the collection, and if the
    magnitude of the added noise is small, then a
    noisy query will still retrieve most documents
    from the original set

13
Query Perturbation (2/2)
  • For each query
  • Issue query to dataset
  • Collect 100 results called the original_set
  • Calculate the variance along each dimension in
    the entire collection
  • For a 0.01, 0.1, 1, 10
  • For s 1 10
  • For each term k present in this
    query
  • Weight Original_weight
    Gaussian(0, avk)
  • Issue query to the entire
    dataset
  • Collect 100 results called the
    noisy_set
  • Find the Levenshtein distance
    between original_set and noisy_set
  • Find average distance over multiple
    samples for this a
  • Find average distance for each given a
  • Plot average distance VS. a for the range of
    alphas

14
Local Intrinsic Dimensionality (1/2)
  • The number of parameters required to represent a
    set of N points in a D dimensional space is
    called the intrinsic dimensionality and will
    always be less than min(N, D)
  • Methods for intrinsic dimensionality analysis
  • Latent semantic analysis (LSA)
  • Requires a threshold for significance of
    eigenvalues
  • Bayesian model selection using the Laplace
    criterion Minka, 1999
  • Suggest the optimal number of components for
    principal component analysis (PCA)

15
Local Intrinsic Dimensionality (2/2)
  • Given the set of retrieved documents, for each
    point in the set, we identify its closest K
    neighbors within the retrieved set, where K
    ranges from 5 to 20 in 5 steps
  • The number of components suggested by the Laplace
    criterion for the K1 data points is the
    intrinsic dimensionality
  • As we increase K, we can observe the rate of
    change in the intrinsic dimensionality
  • The underlying assumption
  • A high dimensional dataset can be decomposed
    into a lower dimensionality component and noise.
    If there is a large amount of noise in the data,
    the number of parameters required to model
    essentially random set of points is small

16
Experiments
  • Document collection TREC disk 45
  • 200 TREC topics 301-450 and 601-650
  • The description field is used to formulate the
    query
  • IR system TF-IDF weighting in the Lemur toolkit
  • The average precision for each query achieved by
    the IR system forms a ground truth ranking of
    queries according to the search effectiveness

17
Experiment Results (1/4)
  • Table 1 Correlation between each of the
    features and the Average Precision
  • Note the best performance to date on the same
    dataset is 0.439

18
Experiment Results (2/4)
  • Combining predictive measures
  • The same mean normalization
  • Adjust x to (x.mean(Y))/mean(X)
  • Min-max normalization
  • Adjust x to (x-min(X))/(max(X)-min(X))
  • Inverse tan (arctan) normalization
  • Adjust x to (2.tan-1(x)/p), which is between 0,1

19
Experiment Results (3/4)
  • Table 2 Combining four search effectiveness
    measures

20
Experiment Results (4/4)
  • Table 3 Effectiveness of identifying the poorly
    performing searches

21
Conclusions
  • Methods for estimating search effectiveness are
    presented by examining properties of the
    retrieved set
  • Four measures are investigated the clustering
    tendency, the sensitivity to the document
    perturbation and the query perturbation,
    respectively, and the rate of change in the local
    intrinsic dimensionality
  • The measures explored can be used for comparing
    the complexity of different document collections
    and the effects of different document
    representations on search
Write a Comment
User Comments (0)
About PowerShow.com