Recall: Query Reformulation Approaches - PowerPoint PPT Presentation

About This Presentation

Title:

Recall: Query Reformulation Approaches

Description:

Local analysis: derive information from retrieved document set ... Use click through counts to order URLs for subsequent submissions of Q ... – PowerPoint PPT presentation

Number of Views:96

Avg rating:3.0/5.0

Slides: 19

Provided by: CSU67

Learn more at: https://www.cs.colostate.edu

Category:

more less

Transcript and Presenter's Notes

Title: Recall: Query Reformulation Approaches

1
Recall Query Reformulation Approaches

Relevance feedback based
vector model (Rocchio )
probabilistic model (Robertson Sparck Jones,
Croft)
Cluster based Query Expansion
Local analysis derive information from retrieved
document set
Global analysis derive information from corpus

2
Local Analysis

Known relevant documents contain terms which can
be used to describe a larger cluster of relevant
documents. MIR
In relevance feedback, clusters are built from
interaction with user about documents.
Local analysis automatically exploits the
documents retrieved by identifying terms related
to those in the query.

3
Term Clusters

Association Clusters model co-occurrence of
stems in retrieved documents, expand using
co-occurring terms
unnormalized groups by large frequencies
normalized groups by rarity
Metric Clusters factor in intra-document
distance
Problem Expensive to compute on the fly

4
Global Analysis

All documents are analyzed for term
relationships.
Two Approaches
Similarity thesaurus relates whole query to new
terms. Focus is on concept underlying terms each
term is indexed by the documents in which it
appears.
Statistical thesaurus cluster documents into
class hierarchy

5
Similarity Thesaurus Basis

where inverse term frequency (itf) for doc dj is

N is the number of documents, t is number of
distinct terms in collection and tj is number of
distinct terms in document j
6
Similarity Thesaurus Creation

Thesaurus is a matrix of correlation factors
between indexing terms

7
Relationship between terms and Query
from Qiu Frei, Concept Based Query Expansion,
SIGIR-93
8
Query Expansion w/Similarity Thesaurus

Represent the query in the concept space of the
index terms (weight vector)
Based on the global similarity thesaurus, compute
a similarity sim(q,kv)
Expand the query with the top r ranked terms and
weight with

9
Global 2 Statistical Thesaurus

Thesaurus construction relies on high
discrimination/low frequency terms.
Hard to cluster
So, build classes based on clustering similar
docs instead.
Similarity is minimum of cosine vector model
similarity between any two docs (one from each
cluster).

10
Complete Link Algorithm Crouch Yang

Place each document in a distinct cluster.
Compute the similarity between all pairs of
clusters.
Determine the pair of clusters Cu,Cv with the
highest inter-cluster similarity.
Merge the clusters Cu and Cv
Verify a stop criterion. If this criterion is not
met then go back to step 2.
Return a hierarchy of clusters.

11
Hierarchy Example

Doc1D,D,A,B,C,A,B,C
Doc2E,C,E,A,A,D
Doc3D,C,B,B,D,A,B,C,A
Doc4A
from MIR notes

12
Query Expansion w/Statistical Thesaurus

Select the terms for each class
Threshold on similarity determines which clusters
NDC determines max number of docs in cluster
MIDF determines minimum IDF for any term (i.e.,
how rare)
Compute thesaurus class weight for terms

13
Global Analysis Summary

Thesaurus approach has been effective for
improving queries
However
requires expensive processing (static corpus
required)
statistical generation exploits small frequencies
better but is sensitive to parameter settings.

14
Relevance Feedback/Query Reformulation Summary

Relevance feedback and query expansion approaches
have been shown to be effective at improving
relevance, sometimes at expense of precision.
Users resist relevance feedback, takes time and
understanding.
Query reformulation can be costly (expensive
computation) for search engines/IR systems.

15
Search Engine Use of Query Feedback

Relevance feedback
explicit tried, but mostly abandoned.
indirect Teoma (ranks documents higher that
users look at more often)
Similar/Related Pages or searches
suggest expanded queries or ask to search for
related pages (Altavista and MSN Search used to
do this)
Google- Find Similar
Teoma
Web log data mining

16
Behavior-Based Ranking

AskJeeves used user behavior to change results
ranking
For each query Q, record which URLs are followed
Use click through counts to order URLs for
subsequent submissions of Q
Pseudo-relevance feedback

17
Teoma Indirect Relevance

Combines indirect relevancy judgments with their
own link analysis
Subject-Specific Popularity ranks a site based
on the number of same-subject specific pages that
reference it. Teoma.com page
Clustering Usage
Refine Models communities to suggest search
classification
Resources Suggests authoritative sites within
designated community

18
Web Log Mining