Title: Query operations
1Query operations
- 1- Introduction
- 2- Relevance feedback with user relevance
information - 3- Relevance feedback without user relevance
information - - Local analysis (pseudo-relevance feedback)
- - Global analysis (thesaurus)
- 4- Evaluation
- 5- Issues
2Introduction (1)
- No detailed knowledge of collection and retrieval
environment - difficult to formulate queries well designed for
retrieval - Need many formulations of queries for effective
retrieval - First formulation often naïve attempt to
retrieve relevant information - Documents initially retrieved
- Examined for relevance information (user,
automatically) - Improve query formulations for retrieving
additional relevant documents - Query reformulation
- Expanding original query with new terms
- Reweighting the terms in expanded query
3Introduction (2)
- Approaches based on feedback from users
(relevance feedback) - Approaches based on information derived from set
of initially retrieved documents (local set of
documents) - Approaches based on global information derived
from document collection
4Relevance feedback with user relevance
information (1)
- Most popular query reformulation strategy
- Cycle
- User presented with list of retrieved documents
- User marks those which are relevant
- In practice top 10-20 ranked documents are
examined - Incremental
- Select important terms from documents assessed
relevant by users - Enhance importance of these terms in a new query
- Expected
- New query moves towards relevant documents and
away from non-relevant documents
5Relevance feedback with user relevance
information (2)
- Two basic techniques
- Query expansion
- Add new terms from relevant documents
- Term reweighting
- Modify term weights based on user relevance
judgements - Advantages
- Shield users from details of query reformulation
process - Search broken down in sequence of small steps
- Controlled process
- Emphasise some terms (relevant ones)
- De-emphasise other terms (non-relevant ones)
6Relevance feedback with user relevance
information (3)
- Query expansion and term reweighting in the
vector space model - Term reweighting in the probabilistic model
7Query expansion and term reweighting in
thevector space model
- Term weight vectors of documents assessed
relevant - Similarities among themselves
- Term weight vectors of documents assessed
non-relevant - Dissimilar for those of relevant documents
- Reformulated query
- Closer to term weight vectors of relevant
documents
8Query expansion and term reweighting in
thevector space model
- For query q
- Dr set of relevant documents among retrieved
documents - Dn set of non-relevant documents among retrieved
documents - Cr set of relevant documents among all documents
in collection - ?,?,? tuning constants
- Assume that Cr is known (unrealistic!)
- Best query vector for distinguishing relevant
documents from non-relevant documents
9Query expansion and term reweighting in
thevector space model
- Problem Cr is unknown
- Approach
- Formulate initial query
- Incrementally change initial query vector
- Use Dr and Dn instead
- Rochio formula
- Ide formula
10Rochio formula
- Direct application of previous formula add
query - Initial formulation ?1
- Usually information in relevant documents more
important than in non-relevant documents (?ltlt?) - Positive relevance feedback (?0)
11Rochio formula in practice (SMART)
- ?1
- Terms
- Original query
- Appear in more relevant documents that
non-relevant documents - Appear in more than half the relevant documents
- Negative weights ignored
12Ide formula
- Initial formulation ? ? ?1
- Same comments as for the Rochio formula
- Both Ide and Rochio no optimal criterion
13Term reweighting for the probabilistic model
- (see note on the BIR model)
- Use idf to rank documents for original query
- Calculate
- Predict relevance
- Improved (optimal) retrieval function
14Term reweighting for the probabilistic model
- Independence assumptions
- I1 distribution of terms in relevant documents
is independent - and their distribution in all documents is
independent - I2 distribution of terms in relevant documents
is independent - and their distribution in irrelevant documents
is independent - Ordering principle
- O1 probable relevance based on presence of
search terms in documents - O2 probable relevance based on presence of
search terms in documents - and their absence from documents
15Term reweighting for the probabilistic model
16Term reweighting for the probabilistic model
- F1 formula
- ri number of relevant documents containing ti
- ni number of documents containing ti
- ratio of the proportion of relevant documents in
which the query term ti occurs to the proportion
of all documents in which the term ti occurs - R number of relevant documents
- N number of documents in collection
17Term reweighting for the probabilistic model
- F2 formula
- ri number of relevant documents containing ti
- ni number of documents containing ti
-
- proportion of relevant documents in which the
term ti occurs to the proportion of all
irrelevant documents in which ti occurs - R number of relevant documents
- N number of documents in collection
18Term reweighting for the probabilistic model
- ratio of relevance odds (ratio of relevant
documents containing term ti and non-relevant
documents containing term ti) and collection
odds (ratio of documents containing ti and
documents not containing ti) - ri number of relevant documents containing ti
- ni number of documents containing ti
- F3 formula
- R number of relevant documents
- N number of documents in collection
19Term reweighting for the probabilistic model
- ratio of relevance odds and non-relevance
odds (ratio of relevant documents not containing
ti and the non-relevant documents not containing
ti) - ri number of relevant documents containing ti
- ni number of documents containing ti
- F4 formula
- R number of relevant documents
- N number of documents in collection
20Experiments
- F1, F2, F3 and F4 outperform no relevance
weighting and idf - F1 and F2 F3 and F4 perform in the same range
- F3 and F4 gt F1 and F2
- F4 slightly gt F3
- O2 is correct (looking at presence and absence of
terms) - No conclusion with respect to I1 and I2, although
I2 seems a more realistic assumption.
21Relevance feedback without user relevance
- Relevance feedback with user relevance
- Clustering hypothesis
- known relevant documents contain terms which can
be used to describe a larger cluster of relevant
documents - Description of cluster built interactively with
user assistance - Relevance feedback without user relevance
- Obtain cluster description automatically
- Identify terms related to query terms
- (e.g. synonyms, stemming variations, terms close
to query terms in text) - Local strategies
- Global strategies
22Local analysis (pseudo-relevance feedback)
- Examine documents retrieved for query to
determine query expansion - No user assistance
- Clustering techniques
- Query drift
23Clusters (1)
- Synonymy association (one example) terms that
frequently co-occur inside local set of documents - Term-term (e.g., stem-stem) association matrix
(normalised)
24Clusters (2)
- For term ti
- Take the n largest values mi,j
- The resulting terms tj form cluster for ti
- Query q
- Finding clusters for the q query terms
- Keep clusters small
- Expand original query
25Global analysis
- Expand query using information from whole set of
documents in collection - Thesaurus-like structure using all documents
- Approach to automatically built thesaurus
- (e.g. similarity thesaurus based on co-occurrence
frequency) - Approach to select terms for query expansion
26Evaluation of relevance feedback strategies
- Use qi and compute precision and recall graph
- Use qi1 and compute precision recall graph
- Use all documents in the collection
- Spectacular improvements
- Also due to relevant documents ranked higher
- Documents known to user
- Must evaluate with respect to documents not seen
by user - Three techniques
27Evaluation of relevance feedback strategies
- Freezing
- Full-freezing
- Top n documents are frozen (ones used in RF)
- Remaining documents are re-ranked
- Precision-recall on whole ranking
- Change in effectiveness thus come from unseen
documents - With many iteration, higher contribution of
frozen documents may lead to decrease in
effectiveness - Modified freezing
- Rank position of the last marked relevant document
28Evaluation of relevance feedback strategies
- Test and control group
- Random splitting of documents test documents and
group documents - Query reformulation performed on test documents
- New query run against the control documents
- RF performed only on control group
- Difficulty in splitting the collection
- Distribution of relevant documents
29Evaluation of relevance feedback strategies
- Residual ranking
- Documents used in assessing relevance are removed
- Precision-recall on residual collection
- Consider effect of unseen documents
- Results not comparable with original ranking
(fewer relevant documents)
30Issues
- Interface
- Allow user to quickly identify relevant and
non-relevant documents - What happen with 2D and 3D visualisation?
- Global analysis
- On the web?
- Yahoo!
- Local analysis
- Computation cost (on-line)
- Interactive query expansion
- User choose the terms to be added
31Negative relevance feedback
- Documents explicitly marked as non-relevant by
users - Implementation
- Clarity
- Usability