Chapter 5' Query Operations

About This Presentation

Title:

Chapter 5' Query Operations

Description:

Formulating one's information need is difficult. ... done by Attar and Fraenkel in 1977. 5.3.2 Query Expansion Through Local Context Analysis ... – PowerPoint PPT presentation

Number of Views:32

Avg rating:3.0/5.0

Slides: 51

Provided by: wan147

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5' Query Operations

1
Chapter 5. Query Operations

Wang Jimin
Oct. 24

Formulating ones information need is difficult.
Good way initial query formulation, and then
query reformulation.
Query Expansion Expanding the original query
with new terms
Term Reweighting Reweighting the term in the
expanded query

3
Outline

User Relevance Feedback
Automatic Local Analysis
Automatic Global Analysis

4
5.2 User Relevance Feedback

Query Expansion and Term Reweighting for the
Vector Model
Term Reweighting for the Probabilistic Model
A Variant of Probabilistic Term Reweighting
Evaluation of Relevance Feedback Strategies

5
5.2.1 Query Expansion and Term Reweighting for
the Vector Model

Basic idea Reformulate the query such that it
gets closer to the term-weight vector space of
the relevant documents
Some symbols for a given query, retrieved
documents DrDn. Let Cr set of relevant
documents in the collection D.
Their relations can be represented as follow.

6
Relationship
Dr
D
Cr
Dn
7
Formula

If Cr is known in advance, then the best query
will be represented P119 (5.1).
According to the user feedback, we get the
modified query q_m which can be written in three
ways, i.e. P119 middle.

Three techniques above yield similar results.
Use a positive feedback strategy only r0
Advantage simplicity and good results
Disadvantage no optimality criterion

9
5.2.2 Term Reweighting for the Probabilistic
Model

sim(dj,q) ? wiq wij (log
P(ki R) log P(ki ?R) ) P(?ki
R) P(?ki ?R)
Probabilities P(ki R) and P(ki ?R) ?
Estimates based on assumptions
P(ki R) 0.5
P(ki ?R) ni N where
ni is the number of docs that contain ki
Use this initial guess to retrieve an initial
ranking
Improve upon this initial ranking

Let
Dr set of relevent docs judged by the users
Dr,i subset of Dr contained ki
Reevaluate estimates
P(ki R) Dr,i Dr
P(ki ?R) ni - Dr,i N - Dr

Some small adjustment factor could be used in
the computing formula. 0.5, ni/N
Advantage term rewighting is optimal under
probabilistic model
Disadvantage no query expansion.

12
5.2.3 A Variant of Probabilistic Term Reweighting

In 1983, Croft extended this weighting Scheme
the initial query and the feedback query use
distinct formulations.
The probabilistic formula includes
within-document frequency weights.
some modified formula list on P122.

13
5.2.4 Evaluation of Relevance Feedback Strategies

Simple approach compare the two recall-precision
figures for the query q and the modified query
q_m. But the evaluation is unrealistic since the
user has pointed them as relevant documents.
Another approach evaluate the performance by
considering only the residual collection. But the
search results may be worse because of the high
ranked documents moving from the collection.

14
5.3 Automatic Local Analysis

5.3.1 Query Expansion Through Local Clustering
done by Attar and Fraenkel in 1977
5.3.2 Query Expansion Through Local Context
Analysis
combine techniques from local and global
analysis, done by Xu and Croft in 1996

15
5.3.1 Query Expansion Through Local Clustering

Basic idea build a association matrix to
quantify the term correlations
Some symbols
V(s) grammatical variants of a stem s
D_l retrieved documents for a given query
q
V_l set of distinct words in D_l
S_l set of distinct stems in V_l

16
Association Clusters

Based on the co-occurrence of stems inside
documents
Definition an association matrix m(m_ij) with
S_l rows and D_l columns, m_ijf_ij, the
frequency of a stem s_i in a doc d_j, then
smmt is a local stem-stem association matrix.
s_ij can be expressed by c_u,v Sum_(d_j \in D_l)
f_s_u,j f_s_v,j

S_u,v can be normalized, i.e. P125 (5.7)
Definition Consider the u-th row in matrix s,
take n largest values s_u,v and get S_u(n) ,
called it a local association cluster around the
stems S_u.
Instead of stem, using keywords is similar.

18
Metric Clusters

Consider the distance r(ki, kj) between the term
ki and kj
Metric correlation c_u,v of stem s_u and stem
s_v can be expressed as
c_u,vsum_ki\inV(s_u) sum_kj\inV(s_v)
1/r(ki,kj)
Let s_u,vc_u,v, get a stem-stem metric
correlation metrix

19
Scalar Clusters

Basic idea two stems with similar neighborhoods
have some synonymity relationship
Use a scalar measure to compare two stems
vectors s_u(n) and s_v(n)

20
Interactive Search Formulation

Neighbor stems have a synonymity relationship,
they are not necessarily synonyms in the
grammatical sense
Query expansion neighbor stems by using
normalized, unnormalized clusters or merging the
two clusters.

21
5.3.2 Query Expansion Through Local Context
Analysis

Global analysis Terms as concepts, thesaurus
as a concept relation structure.
Use thesaurus to do Query expansion, and
thesaurus can be used as a browsing tool.
Instead of simple keywords, concepts (noun
groups) are introduced. Query is a concept, not
individual terms

22
Local Context Analysis procedure

Step1. Retrieve the top n ranked passages using
the original query,
Step 2. Compute sim(c,q), (a detail formula, see
P130 )
Step 3. Add the top m ranked concepts to the
query q

23
Automatic Global Analysis

Motivation
Methods of local analysis extract information
from local set of documents retrieved to expand
the query
An alternative is to expand the query using
information from the whole set of documents
Until the beginning of the 1990s this techniques
failed to yield consistent improvements in
retrieval performance
Now, with moderns variants, some times based in
thesaurus, this perception has changed

24
Automatic global Analysis

There are two modern variants based on a
thesaurus-like structure built using all
documents in collection
Query Expansion based on a Similarity Thesaurus
Query Expansion based on a Statistical Thesaurus

25
Similarity Thesaurus

The similarity thesaurus is based on term to term
relationships rather than on a matrix of
co-occurrence.
This relationship are not derived directly from
co-occurrence of terms inside documents.
They are obtained by considering that the terms
are concepts in a concept space.
In this concept space, each term is indexed by
the documents in which it appears.
Terms assume the original role of documents while
documents are interpreted as indexing elements

26
Similarity Thesaurus

The following definitions establish the proper
framework
t number of terms in the collection
N number of documents in the collection
fi,j frequency of occurrence of the term ki in
the document dj
tj vocabulary of document dj
itfj inverse term frequency for document dj

27
Similarity Thesaurus

Inverse term frequency for document dj
To ki is associated a vector

28
Similarity Thesaurus

where wi,j is a weight associated to
index-document pairki,dj. These weights are
computed as follows

29
Similarity Thesaurus

The relationship between two terms ku and kv is
computed as a correlation factor cu,v given by
The global similarity thesaurus is built through
the computation of correlation factor Cu,v for
each pair of indexing terms ku,kv in the
collection

30
Similarity Thesaurus

This computation is expensive
Global similarity thesaurus has to be computed
only once and can be updated incrementally

31
Query Expansion based on a Similarity Thesaurus

Query expansion is done in three steps as
follows
Represent the query in the concept space used for
representation of the index terms
Based on the global similarity thesaurus, compute
a similarity sim(q,kv) between each term kv
correlated to the query terms and the whole query
q.
Expand the query with the top r ranked terms
according to sim(q,kv)

32
Query Expansion - step one

To the query q is associated a vector q in the
term-concept space given by
where wi,q is a weight associated to the
index-query pairki,q

33
Query Expansion - step two

Compute a similarity sim(q,kv) between each term
kv and the user query q
where cu,v is the correlation factor

34
Query Expansion - step three

Add the top r ranked terms according to sim(q,kv)
to the original query q to form the expanded
query q
To each expansion term kv in the query q is
assigned a weight wv,q given by
The expanded query q is then used to retrieve
new documents to the user

35
Query Expansion Sample

Doc1 D, D, A, B, C, A, B, C
Doc2 E, C, E, A, A, D
Doc3 D, C, B, B, D, A, B, C, A
Doc4 A
c(A,A) 10.991
c(A,C) 10.781
c(A,D) 10.781
...
c(D,E) 10.398
c(B,E) 10.396
c(E,E) 10.224

36
Query Expansion Sample

Query q A E E
sim(q,A) 24.298
sim(q,C) 23.833
sim(q,D) 23.833
sim(q,B) 23.830
sim(q,E) 23.435
New query q A C D E E
w(A,q') 6.88
w(C,q') 6.75
w(D,q') 6.75
w(E,q') 6.64

37
Query Expansion Based on a Global Statistical
Thesaurus

Global thesaurus is composed of classes which
group correlated terms in the context of the
whole collection
Such correlated terms can then be used to expand
the original user query
This terms must be low frequency terms
However, it is difficult to cluster low frequency
terms
To circumvent this problem, we cluster documents
into classes instead and use the low frequency
terms in these documents to define our thesaurus
classes.
This algorithm must produce small and tight
clusters.

38
Complete link algorithm

This is document clustering algorithm with
produces small and tight clusters
Place each document in a distinct cluster.
Compute the similarity between all pairs of
clusters.
Determine the pair of clusters Cu,Cv with the
highest inter-cluster similarity.
Merge the clusters Cu and Cv
Verify a stop criterion. If this criterion is not
met then go back to step 2.
Return a hierarchy of clusters.
Similarity between two clusters is defined as the
minimum of similarities between all pair of
inter-cluster documents

39
Selecting the terms that compose each class

Given the document cluster hierarchy for the
whole collection, the terms that compose each
class of the global thesaurus are selected as
follows
Obtain from the user three parameters
TC Threshold class
NDC Number of documents in class
MIDF Minimum inverse document frequency

40
Selecting the terms that compose each class

Use the parameter TC as threshold value for
determining the document clusters that will be
used to generate thesaurus classes
This threshold has to be surpassed by sim(Cu,Cv)
if the documents in the clusters Cu and Cv are to
be selected as sources of terms for a thesaurus
class

41
Selecting the terms that compose each class

Use the parameter NDC as a limit on the size of
clusters (number of documents) to be considered.
A low value of NDC might restrict the selection
to the smaller cluster Cuv

42
Selecting the terms that compose each class

Consider the set of document in each document
cluster pre-selected above.
Only the lower frequency documents are used as
sources of terms for the thesaurus classes
The parameter MIDF defines the minimum value of
inverse document frequency for any term which is
selected to participate in a thesaurus class

43
Query Expansion based on a Statistical Thesaurus

Use the thesaurus class to query expansion.
Compute an average term weight wtc for each
thesaurus class C

44
Query Expansion based on a Statistical Thesaurus

wtc can be used to compute a thesaurus class
weight wc as

45
Query Expansion Sample
Doc1 D, D, A, B, C, A, B, C Doc2 E, C, E, A,
A, D Doc3 D, C, B, B, D, A, B, C, A Doc4 A
q A E E
sim(1,3) 0.99 sim(1,2) 0.40 sim(1,2)
0.40 sim(2,3) 0.29 sim(4,1) 0.00 sim(4,2)
0.00 sim(4,3) 0.00
idf A 0.0 idf B 0.3 idf C 0.12 idf D
0.12 idf E 0.60

TC 0.90 NDC 2.00 MIDF 0.2

q'A B E E
46
Query Expansion based on a Statistical Thesaurus

Problems with this approach
initialization of parameters TC,NDC and MIDF
TC depends on the collection
Inspection of the cluster hierarchy is almost
always necessary for assisting with the setting
of TC.
A high value of TC might yield classes with too
few terms

47
5.4 Conclusion

Thesaurus is a efficient method to expand queries
The computation is expensive but it is executed
only once
Query expansion based on similarity thesaurus may
use high term frequency to expand the query
Query expansion based on statistical thesaurus
need well defined parameters

48
5.5 Trends and research Issues

Graphical interfaces display the docs as points a
2D or 3D
Build a thesaurus to get a hierarchy of concepts,
like Yahoo!
Combination of several techniques is also current
and important research problem for MIR

49
Chapter 5. Query Operations