Modern Information Retrieval Chapter 5 Query Operations - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Modern Information Retrieval Chapter 5 Query Operations

Description:

global information derived from the document collection. User Relevance Feedback ... Consider the expression (su sv) where the symbol stands for disjunction. ... – PowerPoint PPT presentation

Number of Views:404
Avg rating:3.0/5.0
Slides: 34
Provided by: 6649293
Category:

less

Transcript and Presenter's Notes

Title: Modern Information Retrieval Chapter 5 Query Operations


1
Modern Information Retrieval Chapter 5 Query
Operations
  • ??????
  • ??89522022

2
Introduction
  • It is difficult to formulate queries which are
    well designed for retrieval purposes.
  • Improving the initial query formulation through
    query expansion and term reweighting.
  • Approaches based on
  • feedback information from the user
  • information derived from the set of documents
    initially retrieved (called the local set of
    documents)
  • global information derived from the document
    collection

3
User Relevance Feedback
  • User is presented with a list of the retrieved
    documents and, after examining them, marks those
    which are relevant.
  • Two basic operation
  • Query expansion addition of new terms from
    relevant document
  • Term reweighting modification of term weights
    based on the user relevance judgement

4
User Relevance Feedback
  • The usage of user relevance feedback to
  • expand queries with the vector model
  • reweight query terms with the probabilistic model
  • reweight query terms with a variant of the
    probabilistic model

5
Vector Model
  • Define
  • WeightLet the ki be a generic index term in the
    set K k1, , kt. A weight wi,j gt 0 is
    associated with each index term ki of a document
    dj.
  • document index term vectorthe document dj is
    associated with an index term vector dj
    representd by

6
Vector Model (contd)
  • Define
  • from the chapter 2the term weighting the
    normalized frequency freqi,j be the raw
    frequency of ki in the document djnverse
    document frequency for ki the query term
    weight

7
Vector Model (contd)
  • Define
  • query vector query vector q is defined as
  • Dr set of relevant documents identified by the
    user
  • Dn set of non-relevant documents among the
    retrieved documents
  • Cr set of relevant documents among all documents
    in the collection
  • a,ß,? tuning constants

8
Query Expansion and Term Reweighting for the
Vector Model
  • ideal caseCr the complete set Cr of relevant
    documents to a given query q
  • the best query vector is presented by
  • The relevant documents Cr are not known a priori,
    should be looking for.

9
Query Expansion and Term Reweighting for the
Vector Model (contd)
  • 3 classic similar way to calculate the modified
    query
  • Standard_Rochio
  • Ide_Regular
  • Ide_Dec_Hi
  • the Dr and Dn are the document sets which the
    user judged

10
Term Reweighting for the Probabilistic Model
  • simialrity the correlation between the vectors
    dj andthis correlation can be quantified as
  • The probabilistic model according to the
    probabilistic ranking principle.
  • p(kiR) the probability of observing the term
    ki in the set R of relevant document
  • p(kiR) the probability of observing the term
    ki in the set R of non-relevant document

(5.2)
11
Term Reweighting for the Probabilistic Model
  • The similarity of a document dj to a query q can
    be expressed as
  • for the initial search
  • estimated above equation by following
    assumptionsni is the number of documents which
    contain the index term ki get

12
Term Reweighting for the Probabilistic Model
(contd)
  • for the feedback search
  • The P(kiR) and P(kiR) can be approximated
    asthe Dr is the set of relevant documents
    according to the user judgementthe Dr,i is the
    subset of Dr composed of the documents contain
    the term ki
  • The similarity of dj to q
  • There is no query expansion occurs in the
    procedure.

13
Term Reweighting for the Probabilistic Model
(contd)
  • Adjusment factor
  • Because of Dr and Dr,i are certain small,
    take a 0.5 adjustment factor added to the P(kiR)
    and P(kiR)
  • alternative adjustment factor ni/N

14
A Variant of Probabilistic Term Reweighting
  • 1983, Croft extended above weighting scheme by
    suggesting distinct initial search methods and by
    adapting the probabilistic formula to include
    within-document frequency weights.
  • The variant of probabilistic term
    reweightingthe Fi,j,q is a factor which
    depends on the triple ki,dj,q.

15
A Variant of Probabilistic Term Reweighting
(contd)
  • using disinct formulations for the initial search
    and feedback searches
  • initial searchthe fi,j is a normalized
    within-document frequencyC and K should be
    adjusted according to the collection.
  • feedback searches
  • empty text

16
Automatic Local Analysis
  • Clustering the grouping of documents which
    satisfy a set of common properties.
  • Attempting to obtain a description for a larger
    cluster of relevant documents automatically To
    identify terms which are related to the query
    terms such as
  • Synonyms
  • Stemming
  • Variations
  • Terms with a distance of at most k words from a
    query term

17
Automatic Local Analysis (contd)
  • The local strategy is that the documents
    retrieved for a given query q are examined at
    query time to determine terms for query
    expansion.
  • Two basic types of local strategy
  • Local clustering
  • Local context analysis
  • Local strategies suit for environment of
    intranets, not for web documents.

18
Query Expansion Through Local Clustering
  • Local feedback strategies are that expands the
    query with terms correlated to the query
    terms.Such correlated terms are those present
    in local clusters built from the local document
    set.

19
Query Expansion Through Local Clustering (contd)
  • Definition
  • Stem
  • A V(s) be a non-empty subset of words which are
    grammatical variants of each other. A canonical
    form s of V(s) is called a stem.
  • Example
  • If V(s) polish, polishing, polished then
    spolish
  • Dl the local document set, the set of documents
    retrieved for a given query q
  • Strategies for building local clusters
  • Association clusters
  • Metric clusters
  • Scalar clusters

20
Association clusters
  • An association cluster is based on the
    co-occurrence of stems inside the documents
  • Definition
  • fsi,j the frequency of a stem si in a document
    dj ,
  • Let m(mij) be an association matrix with Sl
    row and Dl columns, where mijfsi,j.
  • The matrix smm is a local stem-stem
    association matrix.
  • Each element su,v in s expresses a correlation
    cu,v between the stems su and sv

21
Association Clusters (contd)
  • The correlation factor cu,v qunatifies the
    absolute frequencies of co-occurrence
  • The association matrix s unnormalized
  • Normalized

22
Association Clusters (contd)
  • Build local association clusters
  • Consider the u-th row in the association matrix
  • Let Su(n) be a function which takes the u-th row
    and returns the set of n largest values su,v,
    where v varies over the set of local stems and
    vnotequaltou
  • Then su(n) defines a local association cluster
    around the stem su.

23
Metric Clusters
  • Two terms which occur in the same sentence seem
    more correlated than two terms which occur far
    apart in a document.
  • It migh be worthwhile to factor in the distance
    between two terms in the computation of their
    correlation factor.

24
Metric Clusters
  • Let the distance r(ki, kj) between two keywords
    ki and kj in a same document.
  • If ki and kj are in distinct documents we take
    r(ki, kj) ?
  • A local stem-stem metric correlation matrix s is
    defined as Each element su,v of s expresses a
    metric correlation cu,v between the setms su,
    and sv

25
Metric Clusters
  • Given a local metric matrix s , to build local
    metric clusters
  • Consider the u-th row in the association matrix
  • Let Su(n) be a function which takes the u-th row
    and returns the set of n largest values su,v,
    where v varies over the set of local stems and v
  • Then su(n) defines a local association cluster
    around the stem su.

26
Scalar Clusters
  • Two stems with similar neighborhoods have some
    synonymity relationship.
  • The way to quantify such neighborhood
    relationships is to arrange all correlation
    values su,i in a vector su, to arrange all
    correlation values sv,i in another vector sv, and
    to compare these vectors through a scalar measure.

27
Scalar Clusters
  • Let su(su1, su2, ,sun ) and sv (sv1,
    sv2, svn) be two vectors of correlation values
    for the stems su and sv.
  • Let s(su,v ) be a scalar association matrix.
  • Each su,v can be defined as
  • Let Su(n) be a function which returns the set of
    n largest values su,v , vu . Then Su(n) defines
    a scalar cluster around the stem su.

28
Interactive Search Formulation
  • Stems(or terms) that belong to clusters
    associated to the query stems(or terms) can be
    used to expand the original query.
  • A stem su which belongs to a cluster (of size n)
    associated to another stem sv ( i.e.
    ) is said to be a neighbor of sv .

29
Interactive Search Formulation (contd)
  • figure of stem su as a neighbor of the stem sv

30
Interactive Search Formulation (contd)
  • For each stem , select m neighbor stems from the
    cluster Sv(n) (which might be of type
    association, metric, or scalar) and add them to
    the query.
  • Hopefully, the additional neighbor stems will
    retrieve new relevant documents.????????????releva
    nt documents.
  • Sv(n) may composed of stems obtained using
    correlation factors normalized and unnormalized.
  • normalized cluster tends to group stems which are
    more rare.
  • unnormalized cluster tends to group stems due to
    their large frequencies.

31
Interactive Search Formulation (contd)
  • Using information about correlated stems to
    improve the search.
  • Let two stems su and sv be correlated with a
    correlation factor cu,v.
  • If cu,v is larger than a predefined threshold
    then a neighbor stem of su can also be
    interpreted as a neighbor stem of sv and vice
    versa.
  • This provides greater flexibility, particularly
    with Boolean queries.
  • Consider the expression (su sv) where the
    symbol stands for disjunction.
  • Let su' be an neighbor stem of su.
  • Then one can try both(su'sv) and (susu) as
    synonym search expressions, because of the
    correlation given by cu,v.

32
Query Expansion Through Local Context Analysis
  • The local context analysis procedure operates in
    three steps
  • 1. retrieve the top n ranked passages using the
    original query.This is accomplished by breaking
    up the doucments initially retrieved by the query
    in fixed length passages (for instance, of size
    300 words) and ranking these passages as if they
    were documents.
  • 2. for each concept c in the top ranked passages,
    the similarity sim(q, c) between the whole query
    q (not individual query terms) and the concept c
    is computed using a variant of tf-idf ranking.

33
Query Expansion Through Local Context Analysis
  • 3. the top m ranked concepts(accroding to sim(q,
    c) ) are added to the original query q. To each
    added concept is assigned a weight given by 1-0.9
    i/m where i is the position of the concept in
    the final concept ranking . The terms in the
    original query q might be stressed by assigning a
    weight equal to 2 to each of them.
Write a Comment
User Comments (0)
About PowerShow.com