Chapter 5 Query Operations - PowerPoint PPT Presentation

1 / 47
About This Presentation
Title:

Chapter 5 Query Operations

Description:

Chapter 5 Query Operations Hsin-Hsi Chen Department of Computer Science and Information Engineering National Taiwan University Paraphrase Problem in IR Users often ... – PowerPoint PPT presentation

Number of Views:214
Avg rating:3.0/5.0
Slides: 48
Provided by: HH26
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5 Query Operations


1
Chapter 5Query Operations
  • Hsin-Hsi Chen
  • Department of Computer Science and Information
    Engineering
  • National Taiwan University

2
Paraphrase Problem in IR
  • Users often input queries containing terms that
    do not match the terms used to index the majority
    of the relevant documents.
  • relevance feedback and query modification
  • reweighting of the query terms based on the
    distribution of these terms in the relevant and
    nonrelevant documents retrieved in response to
    those queries
  • changing the actual terms in the query

3
Query Reformulation
  • basic steps
  • query expansion expanding the original query
    with new terms
  • feedback information from the user
  • information derived from the set of documents
    initially retrieved (local set of documents)
  • global information derived from document
    collection
  • term reweighting
  • reweighting the terms in the expanded query

4
User Relevance Feedback
  • U Query is submitted
  • S A list of the retrieved documents is presented
  • U The documents are examined and the relevant
    ones are marked
  • S The important terms/expressions are selected
    from the documents that have been identified as
    relevant
  • The relevance feedback cycle is repeated several
    times

5
User Relevance Feedback (Continued)
  • advantages
  • Shield the details of the query reformulation
  • Break down the whole searching task into a
    sequence of small steps
  • Provide a controlled process designed to
    emphasize some terms (relevant ones) and
    de-emphasize others (non-relevant ones)

6
Query Expansion and Term Reweighting for the
Vector Model
  • basic idea
  • Relevant documents resemble each other
  • Non-relevant documents have term-weight vectors
    which are dissimilar from the ones for the
    relevant documents
  • The reformulated query is moved to closer to the
    term-weight vector space of relevant documents

7
(No Transcript)
8
Query Expansion and Term Reweighting for the
Vector Model (Continued)
Dr set of relevant documents, as identified by
the user
Dn set of non-relevant documents
the retrieved documents
collection
Cr set of relevant documents
set of non-relevant documents
9
Query Expansion and Term Reweighting for the
Vector Model (Continued)
  • when complete set Cr of relevant documents is
    known
  • when the set Cr are not known a priori
  • Formulate an initial query
  • Incrementally change the initial query vector

10
  • Calculate the modified query
  • Standard-Rochio
  • Ide-Regular
  • Ide-Dec-Hi
  • ?, ?, ? tuning constants (usually, ?gt?)
  • ?1 (Rochio, 1971)
  • ???1 (Ide, 1971)
  • ?0 positive feedback

query expansion
term reweighting
the highest ranked non-relevant document
Similar performance
11
positive relevance-feedback ??1 and ?0
12
  • dec hi method use all relevant information,
    but subtract only the highest ranked nonrelevant
    document
  • feedback with query splittingsolve problems (1)
    the relevant documents identified do not form a
    tight cluster (2) nonrelevant documents are
    scattered among certain relevant ones

homogeneous relevant items
homogeneous relevant items
13
Analysis
  • advantages
  • simplicity
  • good results
  • disadvantages
  • No optimality criterion is adopted

14
Term Weighting for the Probabilistic Model
  • The similarity of a document dj to a query q

the probability of observing the term ki in the
set R of relevant documents
the probability of observing the term ki in the
set R of non-relevant documents
Initial search
15
Initial search
Feedback search
16
Feedback search
No query expansion occurs
17
For small values of Dr and Dr,i (i.e.,
Dr1, Dr,i0)
Alternative 1
Alternative 2
18
Analysis
  • advantages
  • Feedback process is directly related to the
    derivation of new weights for query terms
  • The term reweighting is optimal
  • disadvantages
  • Document term weights are not considered
  • Weights of terms in previous query formulations
    are disregarded
  • No query expansion is used

19
A Variant of Probabilistic Term Reweighting
  • variant
  • distinct initial search method
  • include within-document frequency weights
  • initial search

Similar to tf-idf scheme
20
C0 for automatically indexed collections or for
feedback searching (allow IDF or the
relevance weighting to be the dominant
factor) Cgt0 for manually indexed collections
(allow the mere existence of a term within a
document to carry more weight) K0.3
for initial search of regular length documents
(documents having many multiple occurrences
of a term) K0.5 for feedback searches K1 for
short documents the within-document frequency is
removed (the within-document frequency
plays a minimum role)
Feedback search
21
Analysis
  • advantages
  • The within-document frequencies are considered
  • A normalized version of these frequencies is
    adopted
  • Constants C and K are introduced
  • disadvantages
  • more complex formulation
  • no query expansion

22
Evaluation of relevance feedback
  • Standard evaluation (i.e., recall-precision)
    method is not suitable, because the relevant
    documents used to reweight the query terms moving
    to higher ranks.
  • The residual collection method
  • the evaluation of the results compares only the
    residual collections, i.e., the initial run is
    remade minus the documents previously shown to
    the user and this is compared with the feedback
    run minus the same documents

Note that qm tend to be lower than the figures
for the original query vector q in residual
collection
23
Residual Collection with Partial Rank Freezing
  • The previously retrieved items identified as
    relevant are kept frozen and the previously
    retrieved nonrelevant items are simple removed
    from the collection.

Assume 10 documents are relevant.
24
Residual Collection with Partial Rank Freezing
25
Automatic Local Analysis
  • user relevance feedback
  • Known relevant documents contain terms which can
    be used to describe a larger cluster of relevant
    documents with assistance from the user
    (clustering)
  • automatic analysis
  • Obtain a description (i.t.o terms) for a larger
    cluster of relevant documents automatically
  • global strategy global thesaurus-like structure
    is trained from all documents before querying
  • local strategy terms from the documents
    retrieved for a given query are selected at query
    time

26
Local Feedback Strategy
  • Internet
  • client site
  • Retrieving the text of 100 Web documents for
    local analysis would take too long
  • server site
  • Analyzing the text of 100 Web documents would
    spend extra CPU time
  • Applications
  • Intranet
  • Specialized document collections, e.g., medical
    document collections

27
Query Expansion-Local Clustering
  • stem
  • V(s) a non-empty subset of words which are
    grammatical variants of each othere.g., polish,
    polishing, polished
  • A canonical form s of V(s) is called a steme.g.,
    polish
  • local document set Dl
  • the set of documents retrieved for a given query
  • local vocabulary Vl (Sl)
  • the set of all distinct words (stems) in the
    local document set

28
local cluster
  • basic concept
  • Expanding the query with terms correlated to the
    query terms
  • The correlated terms are presented in the local
    clusters built from the local document set
  • local clusters
  • association clusters co-occurrences of pairs of
    terms in documents
  • metric clusters distance factor between two
    terms
  • scalar clusters terms with similar neighborhoods
    have some synonymity relationship

29
Association Clusters
  • idea
  • Based on the co-occurrence of stems (or terms)
    inside documents
  • association matrix
  • fsi,j the frequency of a stem si in a document
    dj (?Dl)
  • m(fsi,j) an association matrix with Sl rows
    and Dl columns
  • a local stem-stem association
    matrix

30
a correlation between the stems su and sv
an element in
su,vcu,v unnormalized matrix
normalized matrix
local association cluster around the stem su
Take u-th row Return the set of n largest values
su,v (u?v)
31
Metric Clusters
  • idea
  • Consider the distance between two terms in the
    computation of their correlation factor
  • local stem-stem metric correlation matrix
  • r(ki,kj) the number of words between keywords ki
    and kj in a same document
  • cu,v metric correlation between stems su and sv

32
su,vcu,v unnormalized matrix
normalized matrix
local metric cluster around the stem su
Take u-th row Return the set of n largest values
su,v (u?v)
33
Scalar Clusters
The row corresponding to a specific term in a
term co-occurrence matrix forms its neighborhood
  • idea
  • Two stems with similar neighborhoods have
    synonymity relationship
  • The relationship is indirect or induced by the
    neighborhood
  • scalar association matrix

The correlation value for su and sv in this
matrix may be small
local scalar cluster around the stem su
Take u-th row Return the set of n largest values
su,v (u?v)
34
Interactive Search Formulation
  • neighbors of the query term sv
  • Terms su belonging to clusters associated to sv,
    i.e., su?Sv(n)
  • su is called a searchonym of sv

35
Interactive Search Formulation(Continued)
  • Algorithm
  • For each stem sv?q, select m neighbor stems from
    the cluster Sv(n) and add them to the query
  • Merge normalized and unnormalized clusters
  • Extension
  • Let su and sv be correlated with a cu,v
  • If cu,v is larger than a predefined threshold,
    then a neighbor stem su of su can also be
    interpreted as a neighbor stem of sv, and vice
    versa.

more rare
large frequencies
36
Query Expansion throughLocal Context Analysis
  • local analysis
  • Based on the set of documents retrieved for the
    original query
  • Based on term co-occurrence inside documents
  • Terms closest to individual query terms are
    selected
  • global analysis
  • Based on the whole document collection
  • Based on term co-occurrence inside small contexts
    and phrase structures
  • Terms closest to the whole query are selected

37
Query Expansion throughLocal Context Analysis
(Continued)
  • candidates
  • noun groups instead of simple keywords
  • single noun, two adjacent nouns, or three
    adjacent nouns
  • query expansion
  • Concepts are selected from the top ranked
    documents (as in local analysis)
  • Passages are used for determining co-occurrence
    (as in global analysis)

38
Query Expansion throughLocal Context Analysis
(Continued)
  • algorithm
  • Retrieve the top n ranked passages using the
    original query
  • For each concept in the top ranked passages, the
    similarity sim(q,c) between the whole query q and
    the concept c is computed using a variant of
    tf-idf ranking
  • The top m ranked concepts are added to the
    original query q
  • Each concept is assigned a weight 1-0.9?i/m (i
    rank)
  • Each term in the original query is assigned a
    weight 2?original weight

39
n of ranked passages
for infrequent query term
0.1
correlation between c and ki pfi,j (pfc,j) freq
of ki (c) in j-th passage
association clusters (passage)
N of passages in the collection npi
of passages containing term ki npc of
passages containing concept c
idf?1,?np??(?)?,??????(?)?1
40
Automatic Global Analysis
  • local analysis
  • Extract information from the local set of
    documents (passages) retrieved
  • global analysis
  • Expand the query using information from the whole
    set of documents in the collection
  • Issues
  • How to build the thesaurus
  • How to select the terms for query expansion
  • Query expansion based on similarity thesaurus
  • Query expansion based on statistical thesaurus

41
Similarity Thesaurus
  • How to build the thesaurus
  • Consider term to term relationship instead of
    co-occurrence
  • How to select the terms for query expansion
  • Consider the similarity to the whole query
    instead of individual query terms

42
Concept Space
  • basic idea
  • Each term is indexed by the documents in which it
    appears
  • The role of terms and documents is interchanged
    in the concept space
  • t the number of terms in the collection
  • N the number of documents in the collection
  • fi,j the frequency of term ki in document dj
  • tj the number of distinct index terms in
    document dj
  • itfj inverse term frequency for document dj

(dj ????index term???, dj???index terms ??,??? ??)
43
Each term ki is associated with a vector ki
where
The relationship between two terms ku and kv is
computed as
44
Query Expansion using Global Similarity Thesaurus
  • Represent the query in the concept space used for
    representation of the index terms
  • Based on the global similarity thesaurus, compute
    a similarity sim(q,kv) between each term kv
    correlated to the query terms and the whole query
    q

query term
expand term
45
Query Expansion using Global Similarity Thesaurus
  • Expand the query with the top r ranked terms
    according to sim(q,kv)

46
Ki
Expand term
47
GVSM vs. Query Expansion
Only the top r ranked terms are used for query
expansion.
Write a Comment
User Comments (0)
About PowerShow.com