Title: Information Retrieval
1Information Retrieval
- CSE 8337
- Spring 2007
- Query Operations
- Material for these slides obtained from
- Modern Information Retrieval by Ricardo
Baeza-Yates and Berthier Ribeiro-Neto
http//www.sims.berkeley.edu/hearst/irbook/ - Prof. Raymond J. Mooney in CS378 at University of
Texas - Introduction to Modern Information Retrieval by
Gerald Salton and Michael J. McGill, - 1983, McGraw-Hill.
- Automatic Text Processing, by Gerard Salton,
Addison-Wesley,1989.
2Operations TOC
- Introduction
- Relevance Feedback
- Query Expansion
- Term Reweighting
- Automatic Local Analysis
- Query Expansion using Clustering
- Automatic Global Analysis
- Query Expansion using Thesaurus
- Similarity Thesaurus
- Statistical Thesaurua
- Complete Link Algorithm
3Query Operations Introduction
- IR queries as stated by the user may not be
precise or effective. - There are many techniques to improve a stated
query and then process that query instead.
4Relevance Feedback
- Use assessments by users as to the relevance of
previously returned documents to create new
(modify old) queries. - Technique
- Increase weights of terms from relevant
documents. - Decrease weight of terms from nonrelevant
documents. - Figure 10.4 in Automatic Text Processing
- Figure 6-10 in Introduction to Modern Information
Retrieval
5Relevance Feedback
- After initial retrieval results are presented,
allow the user to provide feedback on the
relevance of one or more of the retrieved
documents. - Use this feedback information to reformulate the
query. - Produce new results based on reformulated query.
- Allows more interactive, multi-pass process.
6Relevance Feedback Architecture
Document corpus
Rankings
IR System
7Query Reformulation
- Revise query to account for feedback
- Query Expansion Add new terms to query from
relevant documents. - Term Reweighting Increase weight of terms in
relevant documents and decrease weight of terms
in irrelevant documents. - Several algorithms for query reformulation.
8Query Reformulation for VSR
- Change query vector using vector algebra.
- Add the vectors for the relevant documents to the
query vector. - Subtract the vectors for the irrelevant docs from
the query vector. - This both adds both positive and negatively
weighted terms to the query as well as
reweighting the initial terms.
9Optimal Query
- Assume that the relevant set of documents Cr are
known. - Then the best query that ranks all and only the
relevant queries at the top is
Where N is the total number of documents.
10Standard Rochio Method
- Since all relevant documents unknown, just use
the known relevant (Dr) and irrelevant (Dn) sets
of documents and include the initial query q.
? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant documents.
11Ide Regular Method
- Since more feedback should perhaps increase the
degree of reformulation, do not normalize for
amount of feedback
? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant documents.
12Ide Dec Hi Method
- Bias towards rejecting just the highest ranked of
the irrelevant documents
? Tunable weight for initial query. ? Tunable
weight for relevant documents. ? Tunable weight
for irrelevant document.
13Comparison of Methods
- Overall, experimental results indicate no clear
preference for any one of the specific methods. - All methods generally improve retrieval
performance (recall precision) with feedback. - Generally just let tunable constants equal 1.
14Fair Evaluation of Relevance Feedback
- Remove from the corpus any documents for which
feedback was provided. - Measure recall/precision performance on the
remaining residual collection. - Compared to complete corpus, specific
recall/precision numbers may decrease since
relevant documents were removed. - However, relative performance on the residual
collection provides fair data on the
effectiveness of relevance feedback. - Fig 10.5 in Automatic Text Processing
15Evaluating Relevance Feedback
- Test-and-control Collection
- Divide document collection in two parts
- Use test portion to perform relevance feedback
and to modify query - Perform test on control portion using both
original and modified query - Compare results
16Why is Feedback Not Widely Used?
- Users sometimes reluctant to provide explicit
feedback. - Results in long queries that require more
computation to retrieve, and search engines
process lots of queries and allow little time for
each one. - Makes it harder to understand why a particular
document was retrieved.
17Pseudo Feedback
- Use relevance feedback methods without explicit
user input. - Just assume the top m retrieved documents are
relevant, and use them to reformulate the query. - Allows for query expansion that includes terms
that are correlated with the query terms.
18PseudoFeedback Results
- Found to improve performance on TREC competition
ad-hoc retrieval task. - Works even better if top documents must also
satisfy additional boolean constraints in order
to be used in feedback.
19Term Reweighting for PM
- Use statistics found in retrieved documents
- Dr Set of relevant and retrieved
- Dr,i Set of relevant and retrieved that contain
ki.
20Term Reweighting
- No query expansion
- Document term weights not used
- Query term weights not used
- Therefore, not usually as effective as previous
vector approach.
21Local vs. Global Automatic Analysis
- Local Documents retrieved are examined to
automatically determine query expansion. No
relevance feedback needed. - Global Thesaurus used to help select terms for
expansion.
22Automatic Local Analysis
- At query time, dynamically determine similar
terms based on analysis of top-ranked retrieved
documents. - Base correlation analysis on only the local set
of retrieved documents for a specific query. - Avoids ambiguity by determining similar
(correlated) terms only within relevant
documents. - Apple computer ?
Apple computer
Powerbook laptop
23Automatic Local Analysis
- Expand query with terms found in local clusters.
- Dl set of documents retireved for query q.
- Vl Set of words used in Dl.
- Sl Set of distinct stems in Vl.
- fsi,j Frequency of stem si in document dj found
in Dl. - Construct stem-stem association matrix.
24Association Matrix
cij Correlation factor between stems si and stem
sj
fik Frequency of term i in document k
25Normalized Association Matrix
- Frequency based correlation factor favors more
frequent terms. - Normalize association scores
- Normalized score is 1 if two stems have the same
frequency in all documents.
26Metric Correlation Matrix
- Association correlation does not account for the
proximity of terms in documents, just
co-occurrence frequencies within documents. - Metric correlations account for term proximity.
Vi Set of all occurrences of term i in any
document. r(ku,kv) Distance in words between
word occurrences ku and kv (?
if ku and kv are occurrences in different
documents).
27Normalized Metric Correlation Matrix
- Normalize scores to account for term frequencies
28Query Expansion with Correlation Matrix
- For each term i in query, expand query with the n
terms, j, with the highest value of cij (sij). - This adds semantically related terms in the
neighborhood of the query terms.
29Problems with Local Analysis
- Term ambiguity may introduce irrelevant
statistically correlated terms. - Apple computer ? Apple red fruit computer
- Since terms are highly correlated anyway,
expansion may not retrieve many additional
documents.
30Automatic Global Analysis
- Determine term similarity through a pre-computed
statistical analysis of the complete corpus. - Compute association matrices which quantify term
correlations in terms of how frequently they
co-occur. - Expand queries with statistically most similar
terms.
31Automatic Global Analysis
- There are two modern variants based on a
thesaurus-like structure built using all
documents in collection - Query Expansion based on a Similarity Thesaurus
- Query Expansion based on a Statistical Thesaurus
32Thesaurus
- A thesaurus provides information on synonyms and
semantically related words and phrases. - Example
- physician
- syn croaker, doc, doctor, MD, medical,
mediciner, medico, sawbones - rel medic, general practitioner, surgeon,
33Thesaurus-based Query Expansion
- For each term, t, in a query, expand the query
with synonyms and related words of t from the
thesaurus. - May weight added terms less than original query
terms. - Generally increases recall.
- May significantly decrease precision,
particularly with ambiguous terms. - interest rate ? interest rate fascinate
evaluate
34Similarity Thesaurus
- The similarity thesaurus is based on term to term
relationships rather than on a matrix of
co-occurrence. - This relationship are not derived directly from
co-occurrence of terms inside documents. - They are obtained by considering that the terms
are concepts in a concept space. - In this concept space, each term is indexed by
the documents in which it appears. - Terms assume the original role of documents while
documents are interpreted as indexing elements
35Similarity Thesaurus
- The following definitions establish the proper
framework - t number of terms in the collection
- N number of documents in the collection
- fi,j frequency of occurrence of the term ki in
the document dj - tj vocabulary of document dj
- itfj inverse term frequency for document dj
36Similarity Thesaurus
- Inverse term frequency for document dj
- To ki is associated a vector
37Similarity Thesaurus
- where wi,j is a weight associated to
index-document pairki,dj. These weights are
computed as follows
38Similarity Thesaurus
- The relationship between two terms ku and kv is
computed as a correlation factor c u,v given by - The global similarity thesaurus is built through
the computation of correlation factor cu,v for
each pair of indexing terms ku,kv in the
collection
39Similarity Thesaurus
- This computation is expensive
- Global similarity thesaurus has to be computed
only once and can be updated incrementally
40Query Expansion based on a Similarity Thesaurus
- Query expansion is done in three steps as
follows - Represent the query in the concept space used for
representation of the index terms - Based on the global similarity thesaurus, compute
a similarity sim(q,kv) between each term kv
correlated to the query terms and the whole query
q. - Expand the query with the top r ranked terms
according to sim(q,kv)
41Query Expansion - step one
- To the query q is associated a vector q in the
term-concept space given by - where wi,q is a weight associated to the
index-query pairki,q
42Query Expansion - step two
- Compute a similarity sim(q,kv) between each term
kv and the user query q - where cu,v is the correlation factor
43Query Expansion - step three
- Add the top r ranked terms according to sim(q,kv)
to the original query q to form the expanded
query q - To each expansion term kv in the query q is
assigned a weight wv,q given by - The expanded query q is then used to retrieve
new documents to the user
44Query Expansion Sample
- Doc1 D, D, A, B, C, A, B, C
- Doc2 E, C, E, A, A, D
- Doc3 D, C, B, B, D, A, B, C, A
- Doc4 A
- c(A,A) 10.991
- c(A,C) 10.781
- c(A,D) 10.781
- ...
- c(D,E) 10.398
- c(B,E) 10.396
- c(E,E) 10.224
45Query Expansion Sample
- Query q A E E
- sim(q,A) 24.298
- sim(q,C) 23.833
- sim(q,D) 23.833
- sim(q,B) 23.830
- sim(q,E) 23.435
- New query q A C D E E
- w(A,q') 6.88
- w(C,q') 6.75
- w(D,q') 6.75
- w(E,q') 6.64
46WordNet
- A more detailed database of semantic
relationships between English words. - Developed by famous cognitive psychologist George
Miller and a team at Princeton University. - About 144,000 English words.
- Nouns, adjectives, verbs, and adverbs grouped
into about 109,000 synonym sets called synsets.
47WordNet Synset Relationships
- Antonym front ? back
- Attribute benevolence ? good (noun to adjective)
- Pertainym alphabetical ? alphabet (adjective to
noun) - Similar unquestioning ? absolute
- Cause kill ? die
- Entailment breathe ? inhale
- Holonym chapter ? text (part-of)
- Meronym computer ? cpu (whole-of)
- Hyponym tree ? plant (specialization)
- Hypernym fruit ? apple (generalization)
48WordNet Query Expansion
- Add synonyms in the same synset.
- Add hyponyms to add specialized terms.
- Add hypernyms to generalize a query.
- Add other related terms to expand query.
49Statistical Thesaurus
- Existing human-developed thesauri are not easily
available in all languages. - Human thesuari are limited in the type and range
of synonymy and semantic relations they
represent. - Semantically related terms can be discovered from
statistical analysis of corpora.
50Query Expansion Based on a Statistical Thesaurus
- Global thesaurus is composed of classes which
group correlated terms in the context of the
whole collection - Such correlated terms can then be used to expand
the original user query - This terms must be low frequency terms
- However, it is difficult to cluster low frequency
terms - To circumvent this problem, we cluster documents
into classes instead and use the low frequency
terms in these documents to define our thesaurus
classes. - This algorithm must produce small and tight
clusters.
51Query Expansion based on a Statistical Thesaurus
- Use the thesaurus class to query expansion.
- Compute an average term weight wtc for each
thesaurus class C
52Query Expansion based on a Statistical Thesaurus
- wtc can be used to compute a thesaurus class
weight wc as
53Query Expansion Sample
Doc1 D, D, A, B, C, A, B, C Doc2 E, C, E, A,
A, D Doc3 D, C, B, B, D, A, B, C, A Doc4 A
q A E E
sim(1,3) 0.99 sim(1,2) 0.40 sim(1,2)
0.40 sim(2,3) 0.29 sim(4,1) 0.00 sim(4,2)
0.00 sim(4,3) 0.00
- TC 0.90 NDC 2.00 MIDF 0.2
idf A 0.0 idf B 0.3 idf C 0.12 idf D
0.12 idf E 0.60
q'A B E E
54Query Expansion based on a Statistical Thesaurus
- Problems with this approach
- initialization of parameters TC,NDC and MIDF
- TC depends on the collection
- Inspection of the cluster hierarchy is almost
always necessary for assisting with the setting
of TC. - A high value of TC might yield classes with too
few terms
55Complete link algorithm
- This is document clustering algorithm with
produces small and tight clusters - Place each document in a distinct cluster.
- Compute the similarity between all pairs of
clusters. - Determine the pair of clusters Cu,Cv with the
highest inter-cluster similarity. - Merge the clusters Cu and Cv
- Verify a stop criterion. If this criterion is not
met then go back to step 2. - Return a hierarchy of clusters.
- Similarity between two clusters is defined as the
minimum of similarities between all pair of
inter-cluster documents
56Selecting the terms that compose each class
- Given the document cluster hierarchy for the
whole collection, the terms that compose each
class of the global thesaurus are selected as
follows - Obtain from the user three parameters
- TC Threshold class
- NDC Number of documents in class
- MIDF Minimum inverse document frequency
57Selecting the terms that compose each class
- Use the parameter TC as threshold value for
determining the document clusters that will be
used to generate thesaurus classes - This threshold has to be surpassed by sim(Cu,Cv)
if the documents in the clusters Cu and Cv are to
be selected as sources of terms for a thesaurus
class
58Selecting the terms that compose each class
- Use the parameter NDC as a limit on the size of
clusters (number of documents) to be considered. - A low value of NDC might restrict the selection
to the smaller cluster Cuv
59Selecting the terms that compose each class
- Consider the set of document in each document
cluster pre-selected above. - Only the lower frequency documents are used as
sources of terms for the thesaurus classes - The parameter MIDF defines the minimum value of
inverse document frequency for any term which is
selected to participate in a thesaurus class
60Global vs. Local Analysis
- Global analysis requires intensive term
correlation computation only once at system
development time. - Local analysis requires intensive term
correlation computation for every query at run
time (although number of terms and documents is
less than in global analysis). - But local analysis gives better results.
61Query Expansion Conclusions
- Expansion of queries with related terms can
improve performance, particularly recall. - However, must select similar terms very carefully
to avoid problems, such as loss of precision.
62Conclusion
- Thesaurus is a efficient method to expand queries
- The computation is expensive but it is executed
only once - Query expansion based on similarity thesaurus may
use high term frequency to expand the query - Query expansion based on statistical thesaurus
need well defined parameters