Recall: Query Reformulation Approaches - PowerPoint PPT Presentation

About This Presentation
Title:

Recall: Query Reformulation Approaches

Description:

Local analysis: derive information from retrieved document set ... Use click through counts to order URLs for subsequent submissions of Q ... – PowerPoint PPT presentation

Number of Views:96
Avg rating:3.0/5.0
Slides: 19
Provided by: CSU67
Category:

less

Transcript and Presenter's Notes

Title: Recall: Query Reformulation Approaches


1
Recall Query Reformulation Approaches
  • Relevance feedback based
  • vector model (Rocchio )
  • probabilistic model (Robertson Sparck Jones,
    Croft)
  • Cluster based Query Expansion
  • Local analysis derive information from retrieved
    document set
  • Global analysis derive information from corpus

2
Local Analysis
  • Known relevant documents contain terms which can
    be used to describe a larger cluster of relevant
    documents. MIR
  • In relevance feedback, clusters are built from
    interaction with user about documents.
  • Local analysis automatically exploits the
    documents retrieved by identifying terms related
    to those in the query.

3
Term Clusters
  • Association Clusters model co-occurrence of
    stems in retrieved documents, expand using
    co-occurring terms
  • unnormalized groups by large frequencies
  • normalized groups by rarity
  • Metric Clusters factor in intra-document
    distance
  • Problem Expensive to compute on the fly

4
Global Analysis
  • All documents are analyzed for term
    relationships.
  • Two Approaches
  • Similarity thesaurus relates whole query to new
    terms. Focus is on concept underlying terms each
    term is indexed by the documents in which it
    appears.
  • Statistical thesaurus cluster documents into
    class hierarchy

5
Similarity Thesaurus Basis
  • where inverse term frequency (itf) for doc dj is

N is the number of documents, t is number of
distinct terms in collection and tj is number of
distinct terms in document j
6
Similarity Thesaurus Creation
  • Thesaurus is a matrix of correlation factors
    between indexing terms

7
Relationship between terms and Query
from Qiu Frei, Concept Based Query Expansion,
SIGIR-93
8
Query Expansion w/Similarity Thesaurus
  • Represent the query in the concept space of the
    index terms (weight vector)
  • Based on the global similarity thesaurus, compute
    a similarity sim(q,kv)
  • Expand the query with the top r ranked terms and
    weight with

9
Global 2 Statistical Thesaurus
  • Thesaurus construction relies on high
    discrimination/low frequency terms.
  • Hard to cluster
  • So, build classes based on clustering similar
    docs instead.
  • Similarity is minimum of cosine vector model
    similarity between any two docs (one from each
    cluster).

10
Complete Link Algorithm Crouch Yang
  • Place each document in a distinct cluster.
  • Compute the similarity between all pairs of
    clusters.
  • Determine the pair of clusters Cu,Cv with the
    highest inter-cluster similarity.
  • Merge the clusters Cu and Cv
  • Verify a stop criterion. If this criterion is not
    met then go back to step 2.
  • Return a hierarchy of clusters.

11
Hierarchy Example
  • Doc1D,D,A,B,C,A,B,C
  • Doc2E,C,E,A,A,D
  • Doc3D,C,B,B,D,A,B,C,A
  • Doc4A
  • from MIR notes

12
Query Expansion w/Statistical Thesaurus
  • Select the terms for each class
  • Threshold on similarity determines which clusters
  • NDC determines max number of docs in cluster
  • MIDF determines minimum IDF for any term (i.e.,
    how rare)
  • Compute thesaurus class weight for terms

13
Global Analysis Summary
  • Thesaurus approach has been effective for
    improving queries
  • However
  • requires expensive processing (static corpus
    required)
  • statistical generation exploits small frequencies
    better but is sensitive to parameter settings.

14
Relevance Feedback/Query Reformulation Summary
  • Relevance feedback and query expansion approaches
    have been shown to be effective at improving
    relevance, sometimes at expense of precision.
  • Users resist relevance feedback, takes time and
    understanding.
  • Query reformulation can be costly (expensive
    computation) for search engines/IR systems.

15
Search Engine Use of Query Feedback
  • Relevance feedback
  • explicit tried, but mostly abandoned.
  • indirect Teoma (ranks documents higher that
    users look at more often)
  • Similar/Related Pages or searches
  • suggest expanded queries or ask to search for
    related pages (Altavista and MSN Search used to
    do this)
  • Google- Find Similar
  • Teoma
  • Web log data mining

16
Behavior-Based Ranking
  • AskJeeves used user behavior to change results
    ranking
  • For each query Q, record which URLs are followed
  • Use click through counts to order URLs for
    subsequent submissions of Q
  • Pseudo-relevance feedback

17
Teoma Indirect Relevance
  • Combines indirect relevancy judgments with their
    own link analysis
  • Subject-Specific Popularity ranks a site based
    on the number of same-subject specific pages that
    reference it. Teoma.com page
  • Clustering Usage
  • Refine Models communities to suggest search
    classification
  • Resources Suggests authoritative sites within
    designated community

18
Web Log Mining
  • SOP for large search engines to monitor what
    people are querying
  • Goals
  • learn associations between common terms based on
    large number of queries
  • Identify trends in user behavior that should be
    addressed by system
Write a Comment
User Comments (0)
About PowerShow.com