Title: Chapter 5' Query Operations
1Chapter 5. Query Operations
2 - Formulating ones information need is difficult.
- Good way initial query formulation, and then
query reformulation. - Query Expansion Expanding the original query
with new terms - Term Reweighting Reweighting the term in the
expanded query
3Outline
- User Relevance Feedback
- Automatic Local Analysis
- Automatic Global Analysis
45.2 User Relevance Feedback
- Query Expansion and Term Reweighting for the
Vector Model - Term Reweighting for the Probabilistic Model
- A Variant of Probabilistic Term Reweighting
- Evaluation of Relevance Feedback Strategies
55.2.1 Query Expansion and Term Reweighting for
the Vector Model
- Basic idea Reformulate the query such that it
gets closer to the term-weight vector space of
the relevant documents - Some symbols for a given query, retrieved
documents DrDn. Let Cr set of relevant
documents in the collection D. - Their relations can be represented as follow.
6Relationship
Dr
D
Cr
Dn
7Formula
- If Cr is known in advance, then the best query
will be represented P119 (5.1). - According to the user feedback, we get the
modified query q_m which can be written in three
ways, i.e. P119 middle.
8- Three techniques above yield similar results.
- Use a positive feedback strategy only r0
- Advantage simplicity and good results
- Disadvantage no optimality criterion
95.2.2 Term Reweighting for the Probabilistic
Model
- sim(dj,q) ? wiq wij (log
P(ki R) log P(ki ?R) ) P(?ki
R) P(?ki ?R) - Probabilities P(ki R) and P(ki ?R) ?
- Estimates based on assumptions
- P(ki R) 0.5
- P(ki ?R) ni N where
ni is the number of docs that contain ki - Use this initial guess to retrieve an initial
ranking - Improve upon this initial ranking
10- Let
- Dr set of relevent docs judged by the users
- Dr,i subset of Dr contained ki
- Reevaluate estimates
- P(ki R) Dr,i Dr
- P(ki ?R) ni - Dr,i N - Dr
11- Some small adjustment factor could be used in
the computing formula. 0.5, ni/N - Advantage term rewighting is optimal under
probabilistic model - Disadvantage no query expansion.
125.2.3 A Variant of Probabilistic Term Reweighting
- In 1983, Croft extended this weighting Scheme
the initial query and the feedback query use
distinct formulations. - The probabilistic formula includes
within-document frequency weights. - some modified formula list on P122.
135.2.4 Evaluation of Relevance Feedback Strategies
- Simple approach compare the two recall-precision
figures for the query q and the modified query
q_m. But the evaluation is unrealistic since the
user has pointed them as relevant documents. - Another approach evaluate the performance by
considering only the residual collection. But the
search results may be worse because of the high
ranked documents moving from the collection.
145.3 Automatic Local Analysis
- 5.3.1 Query Expansion Through Local Clustering
- done by Attar and Fraenkel in 1977
- 5.3.2 Query Expansion Through Local Context
Analysis - combine techniques from local and global
analysis, done by Xu and Croft in 1996
155.3.1 Query Expansion Through Local Clustering
- Basic idea build a association matrix to
quantify the term correlations - Some symbols
- V(s) grammatical variants of a stem s
- D_l retrieved documents for a given query
q - V_l set of distinct words in D_l
- S_l set of distinct stems in V_l
16Association Clusters
- Based on the co-occurrence of stems inside
documents - Definition an association matrix m(m_ij) with
S_l rows and D_l columns, m_ijf_ij, the
frequency of a stem s_i in a doc d_j, then
smmt is a local stem-stem association matrix.
s_ij can be expressed by c_u,v Sum_(d_j \in D_l)
f_s_u,j f_s_v,j
17- S_u,v can be normalized, i.e. P125 (5.7)
- Definition Consider the u-th row in matrix s,
take n largest values s_u,v and get S_u(n) ,
called it a local association cluster around the
stems S_u. - Instead of stem, using keywords is similar.
18Metric Clusters
- Consider the distance r(ki, kj) between the term
ki and kj - Metric correlation c_u,v of stem s_u and stem
s_v can be expressed as - c_u,vsum_ki\inV(s_u) sum_kj\inV(s_v)
1/r(ki,kj) - Let s_u,vc_u,v, get a stem-stem metric
correlation metrix
19Scalar Clusters
- Basic idea two stems with similar neighborhoods
have some synonymity relationship - Use a scalar measure to compare two stems
vectors s_u(n) and s_v(n)
20Interactive Search Formulation
- Neighbor stems have a synonymity relationship,
they are not necessarily synonyms in the
grammatical sense - Query expansion neighbor stems by using
normalized, unnormalized clusters or merging the
two clusters.
215.3.2 Query Expansion Through Local Context
Analysis
- Global analysis Terms as concepts, thesaurus
as a concept relation structure. - Use thesaurus to do Query expansion, and
thesaurus can be used as a browsing tool. - Instead of simple keywords, concepts (noun
groups) are introduced. Query is a concept, not
individual terms
22Local Context Analysis procedure
- Step1. Retrieve the top n ranked passages using
the original query, - Step 2. Compute sim(c,q), (a detail formula, see
P130 ) - Step 3. Add the top m ranked concepts to the
query q
23Automatic Global Analysis
- Motivation
- Methods of local analysis extract information
from local set of documents retrieved to expand
the query - An alternative is to expand the query using
information from the whole set of documents - Until the beginning of the 1990s this techniques
failed to yield consistent improvements in
retrieval performance - Now, with moderns variants, some times based in
thesaurus, this perception has changed
24Automatic global Analysis
- There are two modern variants based on a
thesaurus-like structure built using all
documents in collection - Query Expansion based on a Similarity Thesaurus
- Query Expansion based on a Statistical Thesaurus
25Similarity Thesaurus
- The similarity thesaurus is based on term to term
relationships rather than on a matrix of
co-occurrence. - This relationship are not derived directly from
co-occurrence of terms inside documents. - They are obtained by considering that the terms
are concepts in a concept space. - In this concept space, each term is indexed by
the documents in which it appears. - Terms assume the original role of documents while
documents are interpreted as indexing elements
26Similarity Thesaurus
- The following definitions establish the proper
framework - t number of terms in the collection
- N number of documents in the collection
- fi,j frequency of occurrence of the term ki in
the document dj - tj vocabulary of document dj
- itfj inverse term frequency for document dj
27Similarity Thesaurus
- Inverse term frequency for document dj
- To ki is associated a vector
28Similarity Thesaurus
- where wi,j is a weight associated to
index-document pairki,dj. These weights are
computed as follows
29Similarity Thesaurus
- The relationship between two terms ku and kv is
computed as a correlation factor cu,v given by - The global similarity thesaurus is built through
the computation of correlation factor Cu,v for
each pair of indexing terms ku,kv in the
collection
30Similarity Thesaurus
- This computation is expensive
- Global similarity thesaurus has to be computed
only once and can be updated incrementally
31Query Expansion based on a Similarity Thesaurus
- Query expansion is done in three steps as
follows - Represent the query in the concept space used for
representation of the index terms - Based on the global similarity thesaurus, compute
a similarity sim(q,kv) between each term kv
correlated to the query terms and the whole query
q. - Expand the query with the top r ranked terms
according to sim(q,kv)
32Query Expansion - step one
- To the query q is associated a vector q in the
term-concept space given by - where wi,q is a weight associated to the
index-query pairki,q
33Query Expansion - step two
- Compute a similarity sim(q,kv) between each term
kv and the user query q - where cu,v is the correlation factor
34Query Expansion - step three
- Add the top r ranked terms according to sim(q,kv)
to the original query q to form the expanded
query q - To each expansion term kv in the query q is
assigned a weight wv,q given by - The expanded query q is then used to retrieve
new documents to the user
35Query Expansion Sample
- Doc1 D, D, A, B, C, A, B, C
- Doc2 E, C, E, A, A, D
- Doc3 D, C, B, B, D, A, B, C, A
- Doc4 A
- c(A,A) 10.991
- c(A,C) 10.781
- c(A,D) 10.781
- ...
- c(D,E) 10.398
- c(B,E) 10.396
- c(E,E) 10.224
36Query Expansion Sample
- Query q A E E
- sim(q,A) 24.298
- sim(q,C) 23.833
- sim(q,D) 23.833
- sim(q,B) 23.830
- sim(q,E) 23.435
- New query q A C D E E
- w(A,q') 6.88
- w(C,q') 6.75
- w(D,q') 6.75
- w(E,q') 6.64
37Query Expansion Based on a Global Statistical
Thesaurus
- Global thesaurus is composed of classes which
group correlated terms in the context of the
whole collection - Such correlated terms can then be used to expand
the original user query - This terms must be low frequency terms
- However, it is difficult to cluster low frequency
terms - To circumvent this problem, we cluster documents
into classes instead and use the low frequency
terms in these documents to define our thesaurus
classes. - This algorithm must produce small and tight
clusters.
38Complete link algorithm
- This is document clustering algorithm with
produces small and tight clusters - Place each document in a distinct cluster.
- Compute the similarity between all pairs of
clusters. - Determine the pair of clusters Cu,Cv with the
highest inter-cluster similarity. - Merge the clusters Cu and Cv
- Verify a stop criterion. If this criterion is not
met then go back to step 2. - Return a hierarchy of clusters.
- Similarity between two clusters is defined as the
minimum of similarities between all pair of
inter-cluster documents
39Selecting the terms that compose each class
- Given the document cluster hierarchy for the
whole collection, the terms that compose each
class of the global thesaurus are selected as
follows - Obtain from the user three parameters
- TC Threshold class
- NDC Number of documents in class
- MIDF Minimum inverse document frequency
40Selecting the terms that compose each class
- Use the parameter TC as threshold value for
determining the document clusters that will be
used to generate thesaurus classes - This threshold has to be surpassed by sim(Cu,Cv)
if the documents in the clusters Cu and Cv are to
be selected as sources of terms for a thesaurus
class
41Selecting the terms that compose each class
- Use the parameter NDC as a limit on the size of
clusters (number of documents) to be considered. - A low value of NDC might restrict the selection
to the smaller cluster Cuv
42Selecting the terms that compose each class
- Consider the set of document in each document
cluster pre-selected above. - Only the lower frequency documents are used as
sources of terms for the thesaurus classes - The parameter MIDF defines the minimum value of
inverse document frequency for any term which is
selected to participate in a thesaurus class
43Query Expansion based on a Statistical Thesaurus
- Use the thesaurus class to query expansion.
- Compute an average term weight wtc for each
thesaurus class C
44Query Expansion based on a Statistical Thesaurus
- wtc can be used to compute a thesaurus class
weight wc as
45Query Expansion Sample
Doc1 D, D, A, B, C, A, B, C Doc2 E, C, E, A,
A, D Doc3 D, C, B, B, D, A, B, C, A Doc4 A
q A E E
sim(1,3) 0.99 sim(1,2) 0.40 sim(1,2)
0.40 sim(2,3) 0.29 sim(4,1) 0.00 sim(4,2)
0.00 sim(4,3) 0.00
idf A 0.0 idf B 0.3 idf C 0.12 idf D
0.12 idf E 0.60
- TC 0.90 NDC 2.00 MIDF 0.2
q'A B E E
46Query Expansion based on a Statistical Thesaurus
- Problems with this approach
- initialization of parameters TC,NDC and MIDF
- TC depends on the collection
- Inspection of the cluster hierarchy is almost
always necessary for assisting with the setting
of TC. - A high value of TC might yield classes with too
few terms
475.4 Conclusion
- Thesaurus is a efficient method to expand queries
- The computation is expensive but it is executed
only once - Query expansion based on similarity thesaurus may
use high term frequency to expand the query - Query expansion based on statistical thesaurus
need well defined parameters
485.5 Trends and research Issues
- Graphical interfaces display the docs as points a
2D or 3D - Build a thesaurus to get a hierarchy of concepts,
like Yahoo! - Combination of several techniques is also current
and important research problem for MIR
49Chapter 5. Query Operations
- User Relevance Feedback
- Automatic Local Analysis
- Automatic Global Analysis
50