Chapter 5: Query Operations - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Chapter 5: Query Operations

Description:

approaches based on information derived from the set of ... Idea by Crouch and Yang (1992) Use complete link algorithm to produce small and tight clusters ... – PowerPoint PPT presentation

Number of Views:250

Avg rating:3.0/5.0

Slides: 28

Provided by: csieN5

Category:

more less

Transcript and Presenter's Notes

Title: Chapter 5: Query Operations

1
Chapter 5 Query Operations

Baeza-Yates, 1999
Modern Information Retrieval

2
Query Modification

Improving initial query formulation
Relevance feedback
approaches based on feedback information from
users
Local analysis
approaches based on information derived from the
set of documents initially retrieved (called the
local set of documents)
Global analysis
approaches based on global information derived
from the document collection

3
Relevance Feedback

Relevance feedback process
it shields the user from the details of the query
reformulation process
it breaks down the whole searching task into a
sequence of small steps which are easier to grasp
it provides a controlled process designed to
emphasize some terms and de-emphasize others
Two basic techniques
Query expansion
addition of new terms from relevant documents
Term reweighting
modification of term weights based on the user
relevance judgement

4
Vector Space Model

Definitionwi,j the ith term in the vector for
document djwi,k the ith term in the vector for
query qkt the number of unique terms in the
data set

5
Query Expansion and and Term Reweighting for the
Vector Model

Ideal situation
CR set of relevant documents among all documents
in the collection
Rocchio (1965, 1971)
R set of relevant documents, as identified by
the user among the retrieved documents
S set of non-relevant documents among the
retrieved documents

6
Rocchios Algorithm

Ide_Regular (1971)
Ide_Dec_Hi
Parameters
a b g 1
b gt g 0

7
Probabilistic Model

Definition
pi the probability of observing term ti in the
set of relevant documents
qi the probability of observing term ti in the
set of nonrelevant documents
Initial search assumption
pi is constant for all terms ti (typically 0.5)
qi can be approximated by the distribution of ti
in the whole collection

8
Term Reweighting for the Probabilistic Model

Robertson and Sparck Jones (1976)
With relevance feedback from user
N the number of documents in the collection
R the number of relevant documents for query q
ni the number of documents having term ti
ri the number of relevant documents having term
ti

Document Relevance
Document Indexing

-
N-ni-Rri
9
Term Reweighting for the Probabilistic Model
(cont.)
Initial search assumption pi is constant for all
terms ti (typically 0.5) qi can be approximated
by the distribution of ti in the whole
collection With relevance feedback from users pi
and qi can be approximated by hence the term
weight is updated by

10
Term Reweighting for the Probabilistic Model
(Cont.)

However, the last formula poses problems for
certain small values of R and ri (R1, ri0)
Instead of 0.5, alternative adjustments have been
propsed

11
Term Reweighting for the Probabilistic Model
(Cont.)

Characteristics
Advantage
the term reweighting is optimal under the
asumptions of
term independence
binary document indexing (wi,q ?0,1 and wi,j
?0,1)
Disadvantage
no query expansion is used
weights of terms in the previous query
formulations are also disregarded
document term weights are not taken into account
during the feedback loop

12
Evaluation of relevance feedback

Standard evaluation method is not suitable
(i.e., recall-precision) because the relevant
documents used to reweight the query terms are
moved to higher ranks.
The residual collection method
the set of all documents minus the set of
feedback documents provided by the user
because highly ranked documents are removed from
the collection, the recall-precision figures for
tend to be lower than the figures for the
original query
as a basic rule of thumb, any experimentation
involving relevance feedback strategies should
always evaluate recall-precision figures relative
to the residual collection

13
Automatic Local Analysis

Definition
local document set Dl the set of documents
retrieved by a query
local vocabulary Vl the set of all distinct
words in Dl
stemed vocabulary Sl the set of all distinct
stems derived from Vl
Building local clusters
association clusters
metric clusters
scalar clusters

14
Association Clusters

Idea
co-occurrence of stems (or terms) inside
documents
fu,j the frequency of a stem ku in a document dj
local association cluster for a stem ku
the set of k largest values c(ku, kv)
given a query q, find clusters for the q query
terms
normalized form

15
Metric Clusters

Idea
consider the distance between two terms in the
same cluster
Definition
V(ku) the set of keywords which have the same
stem form as ku
distance r(ki, kj)the number of words between
term ku and kv
normalized form

16
Scalar Clusters

Idea
two stems with similar neighborhoods have some
synonymity relationships
Definition
cu,vc(ku, kv)
vectors of correlation values for stem ku and kv
scalar association matrix
scalar clusters
the set of k largest values of scalar association

17
Automatic Global Analysis

A thesaurus-like structure
Short history
Until the beginning of the 1990s, global analysis
was considered to be a technique which failed to
yield consistent improvements in retrieval
performance with general collections
This perception has changed with the appearance
of modern procedures for global analysis

18
Query Expansion based on a Similarity Thesaurus

Idea by Qiu and Frei 1993
Similarity thesaurus is based on term to term
relationships rather than on a matrix of
co-occurrence
Terms for expansion are selected based on their
similarity to the whole query rather than on
their similarities to individual query terms
Definition
N total number of documents in the collection
t total number of terms in the collection
tfi,j occurrence frequency of term ki in the
document dj
tj the number of distinct index terms in the
document dj
itfj the inverse term frequency for document dj

19
Similarity Thesaurus

Each term is associated with a vector
where wi,j is a weight associated to the
index-document pair
The relationship between two terms ku and kv is
Note that this is a variation of the correlation
measure used for computing scalar association
matrices

20
Term weighting vs. Term concept space
Doc dj
Term ki
Doc dj
tfij
tfij
Term ki
21
Query Expansion Procedure with Similarity
Thesaurus

1. Represent the query in the concept space by
using the representation of the index terms
2. Compute the similarity sim(q,kv) between each
term kv and the whole query
3. Expand the query with the top r ranked terms
according to sim(q,kv)

22
Example of Similarity Thesaurus

The distance of a given term kv to the query
centroid QC might be quite distinct from the
distances of kv to the individual query terms

ki
QCka ,kb
kv
kj
ka
kb
QC
23
Query Expansion based on a Similarity Thesaurus

A document dj is represented term-concept space
by
If the original query q is expanded to include
all the t index terms, then the similarity sim(q,
dj) between the document dj and the query q can
be computed as
which is similar to the generalized vector space
model

24
Query Expansion based on a Statistical Thesaurus

Idea by Crouch and Yang (1992)
Use complete link algorithm to produce small and
tight clusters
Use term discrimination value to select terms for
entry into a particular thesaurus class
Term discrimination value
A measure of the change in space separation which
occurs when a given term is assigned to the
document collection

25
Term Discrimination Value

Terms
good discriminators (terms with positive
discrimination values)
index terms
indifferent discriminators (near-zero
discrimination values)
thesaurus class
poor discriminators (negative discrimination
values)
term phrases
Document frequency dfk
dfk gtn/10 high frequency term (poor
discriminators)
dfk ltn/100 low frequency term (indifferent
discriminators)
n/100 ? dfk ?n/10 good discriminator

26
Statistical Thesaurus

Term discrimination value theory
the terms which make up a thesaurus class must be
indifferent discriminators
The proposed approach
cluster the document collection into small, tight
clusters
A thesaurus class is defined as the intersection
of all the low frequency terms in that cluster
documents are indexed by the thesaurus classes
the thesaurus classes are weighted by

27
Discussion

Query expansion
useful
little explored technique
Trends and research issues
The combination of local analysis, global
analysis, visual displays, and interactive
interfaces is also a current and important
research problem

Write a Comment

User Comments (0)