Title: Recall: Query Reformulation Approaches
1Recall Query Reformulation Approaches
- Relevance feedback based
- vector model (Rocchio )
- probabilistic model (Robertson Sparck Jones,
Croft) - Cluster based Query Expansion
- Local analysis derive information from retrieved
document set - Global analysis derive information from corpus
2Local Analysis
- Known relevant documents contain terms which can
be used to describe a larger cluster of relevant
documents. MIR - In relevance feedback, clusters are built from
interaction with user about documents. - Local analysis automatically exploits the
documents retrieved by identifying terms related
to those in the query.
3Term Clusters
- Association Clusters model co-occurrence of
stems in retrieved documents, expand using
co-occurring terms - unnormalized groups by large frequencies
- normalized groups by rarity
- Metric Clusters factor in intra-document
distance - Problem Expensive to compute on the fly
4Global Analysis
- All documents are analyzed for term
relationships. - Two Approaches
- Similarity thesaurus relates whole query to new
terms. Focus is on concept underlying terms each
term is indexed by the documents in which it
appears. - Statistical thesaurus cluster documents into
class hierarchy
5Similarity Thesaurus Basis
- where inverse term frequency (itf) for doc dj is
N is the number of documents, t is number of
distinct terms in collection and tj is number of
distinct terms in document j
6Similarity Thesaurus Creation
- Thesaurus is a matrix of correlation factors
between indexing terms
7Relationship between terms and Query
from Qiu Frei, Concept Based Query Expansion,
SIGIR-93
8Query Expansion w/Similarity Thesaurus
- Represent the query in the concept space of the
index terms (weight vector) - Based on the global similarity thesaurus, compute
a similarity sim(q,kv) - Expand the query with the top r ranked terms and
weight with
9Global 2 Statistical Thesaurus
- Thesaurus construction relies on high
discrimination/low frequency terms. - Hard to cluster
- So, build classes based on clustering similar
docs instead. - Similarity is minimum of cosine vector model
similarity between any two docs (one from each
cluster).
10Complete Link Algorithm Crouch Yang
- Place each document in a distinct cluster.
- Compute the similarity between all pairs of
clusters. - Determine the pair of clusters Cu,Cv with the
highest inter-cluster similarity. - Merge the clusters Cu and Cv
- Verify a stop criterion. If this criterion is not
met then go back to step 2. - Return a hierarchy of clusters.
11Hierarchy Example
- Doc1D,D,A,B,C,A,B,C
- Doc2E,C,E,A,A,D
- Doc3D,C,B,B,D,A,B,C,A
- Doc4A
- from MIR notes
12Query Expansion w/Statistical Thesaurus
- Select the terms for each class
- Threshold on similarity determines which clusters
- NDC determines max number of docs in cluster
- MIDF determines minimum IDF for any term (i.e.,
how rare) - Compute thesaurus class weight for terms
13Global Analysis Summary
- Thesaurus approach has been effective for
improving queries - However
- requires expensive processing (static corpus
required) - statistical generation exploits small frequencies
better but is sensitive to parameter settings.
14Relevance Feedback/Query Reformulation Summary
- Relevance feedback and query expansion approaches
have been shown to be effective at improving
relevance, sometimes at expense of precision. - Users resist relevance feedback, takes time and
understanding. - Query reformulation can be costly (expensive
computation) for search engines/IR systems.
15Search Engine Use of Query Feedback
- Relevance feedback
- explicit tried, but mostly abandoned.
- indirect Teoma (ranks documents higher that
users look at more often) - Similar/Related Pages or searches
- suggest expanded queries or ask to search for
related pages (Altavista and MSN Search used to
do this) - Google- Find Similar
- Teoma
- Web log data mining
16Behavior-Based Ranking
- AskJeeves used user behavior to change results
ranking - For each query Q, record which URLs are followed
- Use click through counts to order URLs for
subsequent submissions of Q - Pseudo-relevance feedback
17Teoma Indirect Relevance
- Combines indirect relevancy judgments with their
own link analysis - Subject-Specific Popularity ranks a site based
on the number of same-subject specific pages that
reference it. Teoma.com page - Clustering Usage
- Refine Models communities to suggest search
classification - Resources Suggests authoritative sites within
designated community
18Web Log Mining
- SOP for large search engines to monitor what
people are querying - Goals
- learn associations between common terms based on
large number of queries - Identify trends in user behavior that should be
addressed by system