Information Retrieval

About This Presentation

Title:

Information Retrieval

Description:

http://www.sims.berkeley.edu/~hearst/irbook ... 2. Operations TOC. Introduction. Relevance Feedback. Query Expansion. Term Reweighting ... – PowerPoint PPT presentation

Number of Views:145

Avg rating:3.0/5.0

Slides: 63

Provided by: bert193

Learn more at: https://s2.smu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval

1
Information Retrieval

CSE 8337
Spring 2007
Query Operations
Material for these slides obtained from
Modern Information Retrieval by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto
http//www.sims.berkeley.edu/hearst/irbook/
Prof. Raymond J. Mooney in CS378 at University of
Texas
Introduction to Modern Information Retrieval by
Gerald Salton and Michael J. McGill,
1983, McGraw-Hill.
Automatic Text Processing, by Gerard Salton,
Addison-Wesley,1989.

2
Operations TOC

Introduction
Relevance Feedback
Query Expansion
Term Reweighting
Automatic Local Analysis
Query Expansion using Clustering
Automatic Global Analysis
Query Expansion using Thesaurus
Similarity Thesaurus
Statistical Thesaurua
Complete Link Algorithm

3
Query Operations Introduction

IR queries as stated by the user may not be
precise or effective.
There are many techniques to improve a stated
query and then process that query instead.

4
Relevance Feedback

Use assessments by users as to the relevance of
previously returned documents to create new
(modify old) queries.
Technique
Increase weights of terms from relevant
documents.
Decrease weight of terms from nonrelevant
documents.
Figure 10.4 in Automatic Text Processing
Figure 6-10 in Introduction to Modern Information
Retrieval

5
Relevance Feedback

After initial retrieval results are presented,
allow the user to provide feedback on the
relevance of one or more of the retrieved
documents.
Use this feedback information to reformulate the
query.
Produce new results based on reformulated query.
Allows more interactive, multi-pass process.

6
Relevance Feedback Architecture
Document corpus
Rankings
IR System
7
Query Reformulation

Revise query to account for feedback
Query Expansion Add new terms to query from
relevant documents.
Term Reweighting Increase weight of terms in
relevant documents and decrease weight of terms
in irrelevant documents.
Several algorithms for query reformulation.

8
Query Reformulation for VSR

Change query vector using vector algebra.
Add the vectors for the relevant documents to the
query vector.
Subtract the vectors for the irrelevant docs from
the query vector.
This both adds both positive and negatively
weighted terms to the query as well as
reweighting the initial terms.

9
Optimal Query

Assume that the relevant set of documents Cr are
known.
Then the best query that ranks all and only the
relevant queries at the top is

Where N is the total number of documents.
10
Standard Rochio Method

Since all relevant documents unknown, just use
the known relevant (Dr) and irrelevant (Dn) sets
of documents and include the initial query q.

? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant documents.
11
Ide Regular Method

Since more feedback should perhaps increase the
degree of reformulation, do not normalize for
amount of feedback

? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant documents.
12
Ide Dec Hi Method

Bias towards rejecting just the highest ranked of
the irrelevant documents

? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant document.
13
Comparison of Methods

Overall, experimental results indicate no clear
preference for any one of the specific methods.
All methods generally improve retrieval
performance (recall precision) with feedback.
Generally just let tunable constants equal 1.

14
Fair Evaluation of Relevance Feedback

Remove from the corpus any documents for which
feedback was provided.
Measure recall/precision performance on the
remaining residual collection.
Compared to complete corpus, specific
recall/precision numbers may decrease since
relevant documents were removed.
However, relative performance on the residual
collection provides fair data on the
effectiveness of relevance feedback.
Fig 10.5 in Automatic Text Processing

15
Evaluating Relevance Feedback

Test-and-control Collection
Divide document collection in two parts
Use test portion to perform relevance feedback
and to modify query
Perform test on control portion using both
original and modified query
Compare results

16
Why is Feedback Not Widely Used?

Users sometimes reluctant to provide explicit
feedback.
Results in long queries that require more
computation to retrieve, and search engines
process lots of queries and allow little time for
each one.
Makes it harder to understand why a particular
document was retrieved.

17
Pseudo Feedback

Use relevance feedback methods without explicit
user input.
Just assume the top m retrieved documents are
relevant, and use them to reformulate the query.
Allows for query expansion that includes terms
that are correlated with the query terms.

18
PseudoFeedback Results

Found to improve performance on TREC competition
ad-hoc retrieval task.
Works even better if top documents must also
satisfy additional boolean constraints in order
to be used in feedback.

19
Term Reweighting for PM

Use statistics found in retrieved documents
Dr Set of relevant and retrieved
Dr,i Set of relevant and retrieved that contain
ki.

20
Term Reweighting

No query expansion
Document term weights not used
Query term weights not used
Therefore, not usually as effective as previous
vector approach.

21
Local vs. Global Automatic Analysis

Local Documents retrieved are examined to
automatically determine query expansion. No
relevance feedback needed.
Global Thesaurus used to help select terms for
expansion.

22
Automatic Local Analysis

At query time, dynamically determine similar
terms based on analysis of top-ranked retrieved
documents.
Base correlation analysis on only the local set
of retrieved documents for a specific query.
Avoids ambiguity by determining similar
(correlated) terms only within relevant
documents.
Apple computer ?
Apple computer
Powerbook laptop

23
Automatic Local Analysis

Expand query with terms found in local clusters.
Dl set of documents retireved for query q.
Vl Set of words used in Dl.
Sl Set of distinct stems in Vl.
fsi,j Frequency of stem si in document dj found
in Dl.
Construct stem-stem association matrix.

24
Association Matrix
cij Correlation factor between stems si and stem
sj
fik Frequency of term i in document k
25
Normalized Association Matrix

Frequency based correlation factor favors more
frequent terms.
Normalize association scores
Normalized score is 1 if two stems have the same
frequency in all documents.

26
Metric Correlation Matrix

Association correlation does not account for the
proximity of terms in documents, just
co-occurrence frequencies within documents.
Metric correlations account for term proximity.

Vi Set of all occurrences of term i in any
document. r(ku,kv) Distance in words between
word occurrences ku and kv (?
if ku and kv are occurrences in different
documents).
27
Normalized Metric Correlation Matrix

Normalize scores to account for term frequencies

28
Query Expansion with Correlation Matrix

For each term i in query, expand query with the n
terms, j, with the highest value of cij (sij).
This adds semantically related terms in the
neighborhood of the query terms.

29
Problems with Local Analysis

Term ambiguity may introduce irrelevant
statistically correlated terms.
Apple computer ? Apple red fruit computer
Since terms are highly correlated anyway,
expansion may not retrieve many additional
documents.

30
Automatic Global Analysis

Determine term similarity through a pre-computed
statistical analysis of the complete corpus.
Compute association matrices which quantify term
correlations in terms of how frequently they
co-occur.
Expand queries with statistically most similar
terms.

31
Automatic Global Analysis

There are two modern variants based on a
thesaurus-like structure built using all
documents in collection
Query Expansion based on a Similarity Thesaurus
Query Expansion based on a Statistical Thesaurus

32
Thesaurus

A thesaurus provides information on synonyms and
semantically related words and phrases.
Example
physician
syn croaker, doc, doctor, MD, medical,
mediciner, medico, sawbones
rel medic, general practitioner, surgeon,

33
Thesaurus-based Query Expansion

For each term, t, in a query, expand the query
with synonyms and related words of t from the
thesaurus.
May weight added terms less than original query
terms.
Generally increases recall.
May significantly decrease precision,
particularly with ambiguous terms.
interest rate ? interest rate fascinate
evaluate

34
Similarity Thesaurus

The similarity thesaurus is based on term to term
relationships rather than on a matrix of
co-occurrence.
This relationship are not derived directly from
co-occurrence of terms inside documents.
They are obtained by considering that the terms
are concepts in a concept space.
In this concept space, each term is indexed by
the documents in which it appears.
Terms assume the original role of documents while
documents are interpreted as indexing elements

35
Similarity Thesaurus

The following definitions establish the proper
framework
t number of terms in the collection
N number of documents in the collection
fi,j frequency of occurrence of the term ki in
the document dj
tj vocabulary of document dj
itfj inverse term frequency for document dj

36
Similarity Thesaurus

Inverse term frequency for document dj
To ki is associated a vector

37
Similarity Thesaurus

where wi,j is a weight associated to
index-document pairki,dj. These weights are
computed as follows

38
Similarity Thesaurus

The relationship between two terms ku and kv is
computed as a correlation factor c u,v given by
The global similarity thesaurus is built through
the computation of correlation factor cu,v for
each pair of indexing terms ku,kv in the
collection

39
Similarity Thesaurus

This computation is expensive
Global similarity thesaurus has to be computed
only once and can be updated incrementally

40
Query Expansion based on a Similarity Thesaurus

Query expansion is done in three steps as
follows
Represent the query in the concept space used for
representation of the index terms
Based on the global similarity thesaurus, compute
a similarity sim(q,kv) between each term kv
correlated to the query terms and the whole query
q.
Expand the query with the top r ranked terms
according to sim(q,kv)

41
Query Expansion - step one

To the query q is associated a vector q in the
term-concept space given by
where wi,q is a weight associated to the
index-query pairki,q

42
Query Expansion - step two

Compute a similarity sim(q,kv) between each term
kv and the user query q
where cu,v is the correlation factor

43
Query Expansion - step three

Add the top r ranked terms according to sim(q,kv)
to the original query q to form the expanded
query q
To each expansion term kv in the query q is
assigned a weight wv,q given by
The expanded query q is then used to retrieve
new documents to the user

44
Query Expansion Sample

Doc1 D, D, A, B, C, A, B, C
Doc2 E, C, E, A, A, D
Doc3 D, C, B, B, D, A, B, C, A
Doc4 A
c(A,A) 10.991
c(A,C) 10.781
c(A,D) 10.781
...
c(D,E) 10.398
c(B,E) 10.396
c(E,E) 10.224

45
Query Expansion Sample

Query q A E E
sim(q,A) 24.298
sim(q,C) 23.833
sim(q,D) 23.833
sim(q,B) 23.830
sim(q,E) 23.435
New query q A C D E E
w(A,q') 6.88
w(C,q') 6.75
w(D,q') 6.75
w(E,q') 6.64

46
WordNet

A more detailed database of semantic
relationships between English words.
Developed by famous cognitive psychologist George
Miller and a team at Princeton University.
About 144,000 English words.
Nouns, adjectives, verbs, and adverbs grouped
into about 109,000 synonym sets called synsets.

47
WordNet Synset Relationships

Antonym front ? back
Attribute benevolence ? good (noun to adjective)
Pertainym alphabetical ? alphabet (adjective to
noun)
Similar unquestioning ? absolute
Cause kill ? die
Entailment breathe ? inhale
Holonym chapter ? text (part-of)
Meronym computer ? cpu (whole-of)
Hyponym tree ? plant (specialization)
Hypernym fruit ? apple (generalization)

48
WordNet Query Expansion

Add synonyms in the same synset.
Add hyponyms to add specialized terms.
Add hypernyms to generalize a query.
Add other related terms to expand query.

49
Statistical Thesaurus

Existing human-developed thesauri are not easily
available in all languages.
Human thesuari are limited in the type and range
of synonymy and semantic relations they
represent.
Semantically related terms can be discovered from
statistical analysis of corpora.

50
Query Expansion Based on a Statistical Thesaurus

Global thesaurus is composed of classes which
group correlated terms in the context of the
whole collection
Such correlated terms can then be used to expand
the original user query
This terms must be low frequency terms
However, it is difficult to cluster low frequency
terms
To circumvent this problem, we cluster documents
into classes instead and use the low frequency
terms in these documents to define our thesaurus
classes.
This algorithm must produce small and tight
clusters.

51
Query Expansion based on a Statistical Thesaurus

Use the thesaurus class to query expansion.
Compute an average term weight wtc for each
thesaurus class C

52
Query Expansion based on a Statistical Thesaurus

wtc can be used to compute a thesaurus class
weight wc as

53
Query Expansion Sample
Doc1 D, D, A, B, C, A, B, C Doc2 E, C, E, A,
A, D Doc3 D, C, B, B, D, A, B, C, A Doc4 A
q A E E
sim(1,3) 0.99 sim(1,2) 0.40 sim(1,2)
0.40 sim(2,3) 0.29 sim(4,1) 0.00 sim(4,2)
0.00 sim(4,3) 0.00

TC 0.90 NDC 2.00 MIDF 0.2

idf A 0.0 idf B 0.3 idf C 0.12 idf D
0.12 idf E 0.60
q'A B E E
54
Query Expansion based on a Statistical Thesaurus

Problems with this approach
initialization of parameters TC,NDC and MIDF
TC depends on the collection
Inspection of the cluster hierarchy is almost
always necessary for assisting with the setting
of TC.
A high value of TC might yield classes with too
few terms

55
Complete link algorithm

This is document clustering algorithm with
produces small and tight clusters
Place each document in a distinct cluster.
Compute the similarity between all pairs of
clusters.
Determine the pair of clusters Cu,Cv with the
highest inter-cluster similarity.
Merge the clusters Cu and Cv
Verify a stop criterion. If this criterion is not
met then go back to step 2.
Return a hierarchy of clusters.
Similarity between two clusters is defined as the
minimum of similarities between all pair of
inter-cluster documents

56
Selecting the terms that compose each class

Given the document cluster hierarchy for the
whole collection, the terms that compose each
class of the global thesaurus are selected as
follows
Obtain from the user three parameters
TC Threshold class
NDC Number of documents in class
MIDF Minimum inverse document frequency

57
Selecting the terms that compose each class

Use the parameter TC as threshold value for
determining the document clusters that will be
used to generate thesaurus classes
This threshold has to be surpassed by sim(Cu,Cv)
if the documents in the clusters Cu and Cv are to
be selected as sources of terms for a thesaurus
class

58
Selecting the terms that compose each class

Use the parameter NDC as a limit on the size of
clusters (number of documents) to be considered.
A low value of NDC might restrict the selection
to the smaller cluster Cuv

59
Selecting the terms that compose each class

Consider the set of document in each document
cluster pre-selected above.
Only the lower frequency documents are used as
sources of terms for the thesaurus classes
The parameter MIDF defines the minimum value of
inverse document frequency for any term which is
selected to participate in a thesaurus class

60
Global vs. Local Analysis

Global analysis requires intensive term
correlation computation only once at system
development time.
Local analysis requires intensive term
correlation computation for every query at run
time (although number of terms and documents is
less than in global analysis).
But local analysis gives better results.

61
Query Expansion Conclusions

Expansion of queries with related terms can
improve performance, particularly recall.
However, must select similar terms very carefully
to avoid problems, such as loss of precision.

62
Conclusion

Thesaurus is a efficient method to expand queries
The computation is expensive but it is executed
only once
Query expansion based on similarity thesaurus may
use high term frequency to expand the query
Query expansion based on statistical thesaurus
need well defined parameters

Write a Comment

User Comments (0)