Query operations - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Query operations

Description:

No detailed knowledge of collection and retrieval environment ... The resulting terms tj form cluster for ti. Query q. Finding clusters for the |q| query terms ... – PowerPoint PPT presentation

Number of Views:80
Avg rating:3.0/5.0
Slides: 32
Provided by: Mounia4
Category:

less

Transcript and Presenter's Notes

Title: Query operations


1
Query operations
  • 1- Introduction
  • 2- Relevance feedback with user relevance
    information
  • 3- Relevance feedback without user relevance
    information
  • - Local analysis (pseudo-relevance feedback)
  • - Global analysis (thesaurus)
  • 4- Evaluation
  • 5- Issues

2
Introduction (1)
  • No detailed knowledge of collection and retrieval
    environment
  • difficult to formulate queries well designed for
    retrieval
  • Need many formulations of queries for effective
    retrieval
  • First formulation often naïve attempt to
    retrieve relevant information
  • Documents initially retrieved
  • Examined for relevance information (user,
    automatically)
  • Improve query formulations for retrieving
    additional relevant documents
  • Query reformulation
  • Expanding original query with new terms
  • Reweighting the terms in expanded query

3
Introduction (2)
  • Approaches based on feedback from users
    (relevance feedback)
  • Approaches based on information derived from set
    of initially retrieved documents (local set of
    documents)
  • Approaches based on global information derived
    from document collection

4
Relevance feedback with user relevance
information (1)
  • Most popular query reformulation strategy
  • Cycle
  • User presented with list of retrieved documents
  • User marks those which are relevant
  • In practice top 10-20 ranked documents are
    examined
  • Incremental
  • Select important terms from documents assessed
    relevant by users
  • Enhance importance of these terms in a new query
  • Expected
  • New query moves towards relevant documents and
    away from non-relevant documents

5
Relevance feedback with user relevance
information (2)
  • Two basic techniques
  • Query expansion
  • Add new terms from relevant documents
  • Term reweighting
  • Modify term weights based on user relevance
    judgements
  • Advantages
  • Shield users from details of query reformulation
    process
  • Search broken down in sequence of small steps
  • Controlled process
  • Emphasise some terms (relevant ones)
  • De-emphasise other terms (non-relevant ones)

6
Relevance feedback with user relevance
information (3)
  • Query expansion and term reweighting in the
    vector space model
  • Term reweighting in the probabilistic model

7
Query expansion and term reweighting in
thevector space model
  • Term weight vectors of documents assessed
    relevant
  • Similarities among themselves
  • Term weight vectors of documents assessed
    non-relevant
  • Dissimilar for those of relevant documents
  • Reformulated query
  • Closer to term weight vectors of relevant
    documents

8
Query expansion and term reweighting in
thevector space model
  • For query q
  • Dr set of relevant documents among retrieved
    documents
  • Dn set of non-relevant documents among retrieved
    documents
  • Cr set of relevant documents among all documents
    in collection
  • ?,?,? tuning constants
  • Assume that Cr is known (unrealistic!)
  • Best query vector for distinguishing relevant
    documents from non-relevant documents

9
Query expansion and term reweighting in
thevector space model
  • Problem Cr is unknown
  • Approach
  • Formulate initial query
  • Incrementally change initial query vector
  • Use Dr and Dn instead
  • Rochio formula
  • Ide formula

10
Rochio formula
  • Direct application of previous formula add
    query
  • Initial formulation ?1
  • Usually information in relevant documents more
    important than in non-relevant documents (?ltlt?)
  • Positive relevance feedback (?0)

11
Rochio formula in practice (SMART)
  • ?1
  • Terms
  • Original query
  • Appear in more relevant documents that
    non-relevant documents
  • Appear in more than half the relevant documents
  • Negative weights ignored

12
Ide formula
  • Initial formulation ? ? ?1
  • Same comments as for the Rochio formula
  • Both Ide and Rochio no optimal criterion

13
Term reweighting for the probabilistic model
  • (see note on the BIR model)
  • Use idf to rank documents for original query
  • Calculate
  • Predict relevance
  • Improved (optimal) retrieval function

14
Term reweighting for the probabilistic model
  • Independence assumptions
  • I1 distribution of terms in relevant documents
    is independent
  • and their distribution in all documents is
    independent
  • I2 distribution of terms in relevant documents
    is independent
  • and their distribution in irrelevant documents
    is independent
  • Ordering principle
  • O1 probable relevance based on presence of
    search terms in documents
  • O2 probable relevance based on presence of
    search terms in documents
  • and their absence from documents

15
Term reweighting for the probabilistic model
  • Various combinations

16
Term reweighting for the probabilistic model
  • F1 formula
  • ri number of relevant documents containing ti
  • ni number of documents containing ti
  • ratio of the proportion of relevant documents in
    which the query term ti occurs to the proportion
    of all documents in which the term ti occurs
  • R number of relevant documents
  • N number of documents in collection

17
Term reweighting for the probabilistic model
  • F2 formula
  • ri number of relevant documents containing ti
  • ni number of documents containing ti
  • proportion of relevant documents in which the
    term ti occurs to the proportion of all
    irrelevant documents in which ti occurs
  • R number of relevant documents
  • N number of documents in collection

18
Term reweighting for the probabilistic model
  • ratio of relevance odds (ratio of relevant
    documents containing term ti and non-relevant
    documents containing term ti) and collection
    odds (ratio of documents containing ti and
    documents not containing ti)
  • ri number of relevant documents containing ti
  • ni number of documents containing ti
  • F3 formula
  • R number of relevant documents
  • N number of documents in collection

19
Term reweighting for the probabilistic model
  • ratio of relevance odds and non-relevance
    odds (ratio of relevant documents not containing
    ti and the non-relevant documents not containing
    ti)
  • ri number of relevant documents containing ti
  • ni number of documents containing ti
  • F4 formula
  • R number of relevant documents
  • N number of documents in collection

20
Experiments
  • F1, F2, F3 and F4 outperform no relevance
    weighting and idf
  • F1 and F2 F3 and F4 perform in the same range
  • F3 and F4 gt F1 and F2
  • F4 slightly gt F3
  • O2 is correct (looking at presence and absence of
    terms)
  • No conclusion with respect to I1 and I2, although
    I2 seems a more realistic assumption.

21
Relevance feedback without user relevance
  • Relevance feedback with user relevance
  • Clustering hypothesis
  • known relevant documents contain terms which can
    be used to describe a larger cluster of relevant
    documents
  • Description of cluster built interactively with
    user assistance
  • Relevance feedback without user relevance
  • Obtain cluster description automatically
  • Identify terms related to query terms
  • (e.g. synonyms, stemming variations, terms close
    to query terms in text)
  • Local strategies
  • Global strategies

22
Local analysis (pseudo-relevance feedback)
  • Examine documents retrieved for query to
    determine query expansion
  • No user assistance
  • Clustering techniques
  • Query drift

23
Clusters (1)
  • Synonymy association (one example) terms that
    frequently co-occur inside local set of documents
  • Term-term (e.g., stem-stem) association matrix
    (normalised)

24
Clusters (2)
  • For term ti
  • Take the n largest values mi,j
  • The resulting terms tj form cluster for ti
  • Query q
  • Finding clusters for the q query terms
  • Keep clusters small
  • Expand original query

25
Global analysis
  • Expand query using information from whole set of
    documents in collection
  • Thesaurus-like structure using all documents
  • Approach to automatically built thesaurus
  • (e.g. similarity thesaurus based on co-occurrence
    frequency)
  • Approach to select terms for query expansion

26
Evaluation of relevance feedback strategies
  • Use qi and compute precision and recall graph
  • Use qi1 and compute precision recall graph
  • Use all documents in the collection
  • Spectacular improvements
  • Also due to relevant documents ranked higher
  • Documents known to user
  • Must evaluate with respect to documents not seen
    by user
  • Three techniques

27
Evaluation of relevance feedback strategies
  • Freezing
  • Full-freezing
  • Top n documents are frozen (ones used in RF)
  • Remaining documents are re-ranked
  • Precision-recall on whole ranking
  • Change in effectiveness thus come from unseen
    documents
  • With many iteration, higher contribution of
    frozen documents may lead to decrease in
    effectiveness
  • Modified freezing
  • Rank position of the last marked relevant document

28
Evaluation of relevance feedback strategies
  • Test and control group
  • Random splitting of documents test documents and
    group documents
  • Query reformulation performed on test documents
  • New query run against the control documents
  • RF performed only on control group
  • Difficulty in splitting the collection
  • Distribution of relevant documents

29
Evaluation of relevance feedback strategies
  • Residual ranking
  • Documents used in assessing relevance are removed
  • Precision-recall on residual collection
  • Consider effect of unseen documents
  • Results not comparable with original ranking
    (fewer relevant documents)

30
Issues
  • Interface
  • Allow user to quickly identify relevant and
    non-relevant documents
  • What happen with 2D and 3D visualisation?
  • Global analysis
  • On the web?
  • Yahoo!
  • Local analysis
  • Computation cost (on-line)
  • Interactive query expansion
  • User choose the terms to be added

31
Negative relevance feedback
  • Documents explicitly marked as non-relevant by
    users
  • Implementation
  • Clarity
  • Usability
Write a Comment
User Comments (0)
About PowerShow.com