Chapter 5' Query Operations - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

Chapter 5' Query Operations

Description:

Formulating one's information need is difficult. ... done by Attar and Fraenkel in 1977. 5.3.2 Query Expansion Through Local Context Analysis ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 51
Provided by: wan147
Category:

less

Transcript and Presenter's Notes

Title: Chapter 5' Query Operations


1
Chapter 5. Query Operations
  • Wang Jimin
  • Oct. 24

2
  • Formulating ones information need is difficult.
  • Good way initial query formulation, and then
    query reformulation.
  • Query Expansion Expanding the original query
    with new terms
  • Term Reweighting Reweighting the term in the
    expanded query

3
Outline
  • User Relevance Feedback
  • Automatic Local Analysis
  • Automatic Global Analysis

4
5.2 User Relevance Feedback
  • Query Expansion and Term Reweighting for the
    Vector Model
  • Term Reweighting for the Probabilistic Model
  • A Variant of Probabilistic Term Reweighting
  • Evaluation of Relevance Feedback Strategies

5
5.2.1 Query Expansion and Term Reweighting for
the Vector Model
  • Basic idea Reformulate the query such that it
    gets closer to the term-weight vector space of
    the relevant documents
  • Some symbols for a given query, retrieved
    documents DrDn. Let Cr set of relevant
    documents in the collection D.
  • Their relations can be represented as follow.

6
Relationship
Dr
D
Cr
Dn
7
Formula
  • If Cr is known in advance, then the best query
    will be represented P119 (5.1).
  • According to the user feedback, we get the
    modified query q_m which can be written in three
    ways, i.e. P119 middle.

8
  • Three techniques above yield similar results.
  • Use a positive feedback strategy only r0
  • Advantage simplicity and good results
  • Disadvantage no optimality criterion

9
5.2.2 Term Reweighting for the Probabilistic
Model
  • sim(dj,q) ? wiq wij (log
    P(ki R) log P(ki ?R) ) P(?ki
    R) P(?ki ?R)
  • Probabilities P(ki R) and P(ki ?R) ?
  • Estimates based on assumptions
  • P(ki R) 0.5
  • P(ki ?R) ni N where
    ni is the number of docs that contain ki
  • Use this initial guess to retrieve an initial
    ranking
  • Improve upon this initial ranking

10
  • Let
  • Dr set of relevent docs judged by the users
  • Dr,i subset of Dr contained ki
  • Reevaluate estimates
  • P(ki R) Dr,i Dr
  • P(ki ?R) ni - Dr,i N - Dr

11
  • Some small adjustment factor could be used in
    the computing formula. 0.5, ni/N
  • Advantage term rewighting is optimal under
    probabilistic model
  • Disadvantage no query expansion.

12
5.2.3 A Variant of Probabilistic Term Reweighting
  • In 1983, Croft extended this weighting Scheme
    the initial query and the feedback query use
    distinct formulations.
  • The probabilistic formula includes
    within-document frequency weights.
  • some modified formula list on P122.

13
5.2.4 Evaluation of Relevance Feedback Strategies
  • Simple approach compare the two recall-precision
    figures for the query q and the modified query
    q_m. But the evaluation is unrealistic since the
    user has pointed them as relevant documents.
  • Another approach evaluate the performance by
    considering only the residual collection. But the
    search results may be worse because of the high
    ranked documents moving from the collection.

14
5.3 Automatic Local Analysis
  • 5.3.1 Query Expansion Through Local Clustering
  • done by Attar and Fraenkel in 1977 
  • 5.3.2 Query Expansion Through Local Context
    Analysis
  • combine techniques from local and global
    analysis, done by Xu and Croft in 1996

15
5.3.1 Query Expansion Through Local Clustering
  • Basic idea build a association matrix to
    quantify the term correlations
  • Some symbols
  • V(s) grammatical variants of a stem s
  • D_l retrieved documents for a given query
    q
  • V_l set of distinct words in D_l
  • S_l set of distinct stems in V_l

16
Association Clusters
  • Based on the co-occurrence of stems inside
    documents
  • Definition an association matrix m(m_ij) with
    S_l rows and D_l columns, m_ijf_ij, the
    frequency of a stem s_i in a doc d_j, then
    smmt is a local stem-stem association matrix.
    s_ij can be expressed by c_u,v Sum_(d_j \in D_l)
    f_s_u,j f_s_v,j

17
  • S_u,v can be normalized, i.e. P125 (5.7)
  • Definition Consider the u-th row in matrix s,
    take n largest values s_u,v and get S_u(n) ,
    called it a local association cluster around the
    stems S_u.
  • Instead of stem, using keywords is similar.

18
Metric Clusters
  • Consider the distance r(ki, kj) between the term
    ki and kj
  • Metric correlation c_u,v of stem s_u and stem
    s_v can be expressed as
  • c_u,vsum_ki\inV(s_u) sum_kj\inV(s_v)
    1/r(ki,kj)
  • Let s_u,vc_u,v, get a stem-stem metric
    correlation metrix

19
Scalar Clusters
  • Basic idea two stems with similar neighborhoods
    have some synonymity relationship
  • Use a scalar measure to compare two stems
    vectors s_u(n) and s_v(n)

20
Interactive Search Formulation
  • Neighbor stems have a synonymity relationship,
    they are not necessarily synonyms in the
    grammatical sense
  • Query expansion neighbor stems by using
    normalized, unnormalized clusters or merging the
    two clusters.

21
5.3.2 Query Expansion Through Local Context
Analysis
  • Global analysis Terms as concepts, thesaurus
    as a concept relation structure.
  • Use thesaurus to do Query expansion, and
    thesaurus can be used as a browsing tool.
  • Instead of simple keywords, concepts (noun
    groups) are introduced. Query is a concept, not
    individual terms

22
Local Context Analysis procedure
  • Step1. Retrieve the top n ranked passages using
    the original query,
  • Step 2. Compute sim(c,q), (a detail formula, see
    P130 )
  • Step 3. Add the top m ranked concepts to the
    query q

23
Automatic Global Analysis
  • Motivation
  • Methods of local analysis extract information
    from local set of documents retrieved to expand
    the query
  • An alternative is to expand the query using
    information from the whole set of documents
  • Until the beginning of the 1990s this techniques
    failed to yield consistent improvements in
    retrieval performance
  • Now, with moderns variants, some times based in
    thesaurus, this perception has changed

24
Automatic global Analysis
  • There are two modern variants based on a
    thesaurus-like structure built using all
    documents in collection
  • Query Expansion based on a Similarity Thesaurus
  • Query Expansion based on a Statistical Thesaurus

25
Similarity Thesaurus
  • The similarity thesaurus is based on term to term
    relationships rather than on a matrix of
    co-occurrence.
  • This relationship are not derived directly from
    co-occurrence of terms inside documents.
  • They are obtained by considering that the terms
    are concepts in a concept space.
  • In this concept space, each term is indexed by
    the documents in which it appears.
  • Terms assume the original role of documents while
    documents are interpreted as indexing elements

26
Similarity Thesaurus
  • The following definitions establish the proper
    framework
  • t number of terms in the collection
  • N number of documents in the collection
  • fi,j frequency of occurrence of the term ki in
    the document dj
  • tj vocabulary of document dj
  • itfj inverse term frequency for document dj

27
Similarity Thesaurus
  • Inverse term frequency for document dj
  • To ki is associated a vector

28
Similarity Thesaurus
  • where wi,j is a weight associated to
    index-document pairki,dj. These weights are
    computed as follows

29
Similarity Thesaurus
  • The relationship between two terms ku and kv is
    computed as a correlation factor cu,v given by
  • The global similarity thesaurus is built through
    the computation of correlation factor Cu,v for
    each pair of indexing terms ku,kv in the
    collection

30
Similarity Thesaurus
  • This computation is expensive
  • Global similarity thesaurus has to be computed
    only once and can be updated incrementally

31
Query Expansion based on a Similarity Thesaurus
  • Query expansion is done in three steps as
    follows
  • Represent the query in the concept space used for
    representation of the index terms
  • Based on the global similarity thesaurus, compute
    a similarity sim(q,kv) between each term kv
    correlated to the query terms and the whole query
    q.
  • Expand the query with the top r ranked terms
    according to sim(q,kv)

32
Query Expansion - step one
  • To the query q is associated a vector q in the
    term-concept space given by
  • where wi,q is a weight associated to the
    index-query pairki,q

33
Query Expansion - step two
  • Compute a similarity sim(q,kv) between each term
    kv and the user query q
  • where cu,v is the correlation factor

34
Query Expansion - step three
  • Add the top r ranked terms according to sim(q,kv)
    to the original query q to form the expanded
    query q
  • To each expansion term kv in the query q is
    assigned a weight wv,q given by
  • The expanded query q is then used to retrieve
    new documents to the user

35
Query Expansion Sample
  • Doc1 D, D, A, B, C, A, B, C
  • Doc2 E, C, E, A, A, D
  • Doc3 D, C, B, B, D, A, B, C, A
  • Doc4 A
  • c(A,A) 10.991
  • c(A,C) 10.781
  • c(A,D) 10.781
  • ...
  • c(D,E) 10.398
  • c(B,E) 10.396
  • c(E,E) 10.224

36
Query Expansion Sample
  • Query q A E E
  • sim(q,A) 24.298
  • sim(q,C) 23.833
  • sim(q,D) 23.833
  • sim(q,B) 23.830
  • sim(q,E) 23.435
  • New query q A C D E E
  • w(A,q') 6.88
  • w(C,q') 6.75
  • w(D,q') 6.75
  • w(E,q') 6.64

37
Query Expansion Based on a Global Statistical
Thesaurus
  • Global thesaurus is composed of classes which
    group correlated terms in the context of the
    whole collection
  • Such correlated terms can then be used to expand
    the original user query
  • This terms must be low frequency terms
  • However, it is difficult to cluster low frequency
    terms
  • To circumvent this problem, we cluster documents
    into classes instead and use the low frequency
    terms in these documents to define our thesaurus
    classes.
  • This algorithm must produce small and tight
    clusters.

38
Complete link algorithm
  • This is document clustering algorithm with
    produces small and tight clusters
  • Place each document in a distinct cluster.
  • Compute the similarity between all pairs of
    clusters.
  • Determine the pair of clusters Cu,Cv with the
    highest inter-cluster similarity.
  • Merge the clusters Cu and Cv
  • Verify a stop criterion. If this criterion is not
    met then go back to step 2.
  • Return a hierarchy of clusters.
  • Similarity between two clusters is defined as the
    minimum of similarities between all pair of
    inter-cluster documents

39
Selecting the terms that compose each class
  • Given the document cluster hierarchy for the
    whole collection, the terms that compose each
    class of the global thesaurus are selected as
    follows
  • Obtain from the user three parameters
  • TC Threshold class
  • NDC Number of documents in class
  • MIDF Minimum inverse document frequency

40
Selecting the terms that compose each class
  • Use the parameter TC as threshold value for
    determining the document clusters that will be
    used to generate thesaurus classes
  • This threshold has to be surpassed by sim(Cu,Cv)
    if the documents in the clusters Cu and Cv are to
    be selected as sources of terms for a thesaurus
    class

41
Selecting the terms that compose each class
  • Use the parameter NDC as a limit on the size of
    clusters (number of documents) to be considered.
  • A low value of NDC might restrict the selection
    to the smaller cluster Cuv

42
Selecting the terms that compose each class
  • Consider the set of document in each document
    cluster pre-selected above.
  • Only the lower frequency documents are used as
    sources of terms for the thesaurus classes
  • The parameter MIDF defines the minimum value of
    inverse document frequency for any term which is
    selected to participate in a thesaurus class

43
Query Expansion based on a Statistical Thesaurus
  • Use the thesaurus class to query expansion.
  • Compute an average term weight wtc for each
    thesaurus class C

44
Query Expansion based on a Statistical Thesaurus
  • wtc can be used to compute a thesaurus class
    weight wc as

45
Query Expansion Sample
Doc1 D, D, A, B, C, A, B, C Doc2 E, C, E, A,
A, D Doc3 D, C, B, B, D, A, B, C, A Doc4 A
q A E E
sim(1,3) 0.99 sim(1,2) 0.40 sim(1,2)
0.40 sim(2,3) 0.29 sim(4,1) 0.00 sim(4,2)
0.00 sim(4,3) 0.00
idf A 0.0 idf B 0.3 idf C 0.12 idf D
0.12 idf E 0.60
  • TC 0.90 NDC 2.00 MIDF 0.2

q'A B E E
46
Query Expansion based on a Statistical Thesaurus
  • Problems with this approach
  • initialization of parameters TC,NDC and MIDF
  • TC depends on the collection
  • Inspection of the cluster hierarchy is almost
    always necessary for assisting with the setting
    of TC.
  • A high value of TC might yield classes with too
    few terms

47
5.4 Conclusion
  • Thesaurus is a efficient method to expand queries
  • The computation is expensive but it is executed
    only once
  • Query expansion based on similarity thesaurus may
    use high term frequency to expand the query
  • Query expansion based on statistical thesaurus
    need well defined parameters

48
5.5 Trends and research Issues
  • Graphical interfaces display the docs as points a
    2D or 3D
  • Build a thesaurus to get a hierarchy of concepts,
    like Yahoo!
  • Combination of several techniques is also current
    and important research problem for MIR

49
Chapter 5. Query Operations
  • User Relevance Feedback
  • Automatic Local Analysis
  • Automatic Global Analysis

50
  • Thank you !
Write a Comment
User Comments (0)
About PowerShow.com