Title: CS 430 INFO 430 Information Retrieval
1CS 430 / INFO 430 Information Retrieval
Lecture 8 Queries and Strings 2
2Course Administration
3Queries Tasks and Applications
Task Application Ad hoc Search
Systems Retrieval Which documents are relevant
to an information need? Information Information
Agents Filtering Which news articles are
interesting to a particular person? Text
Routing Help-Desk Support Who is an
appropriate expert for a particular problem?
4The Optimal QueryExample, Information Filtering
d1, d2, d3, ... is a stream of incoming documents
that are to be divided into two sets R -
documents judged relevant to an information
need S - documents judged not relevant to the
information need A query is defined as the vector
in the term vector space q (w1, w2, ..., wn)
where wi is the weight given to term i dj will
be assigned to R if similarity(q, dj) gt ? What
is the optimal query, i.e., the optimal values of
the wi and ??
5Seeking Optimal Parameters
Theoretical approach (not successful) Develop a
theoretical model Derive parameters Test with
users Heuristic approach Develop a
heuristic Vary parameters Test with
users Machine learning
6Seeking Optimal Parameters Methods
What is the optimal query for each
application? Rich query languages -- make use
of human understanding Extending the Boolean
model -- avoid the need for exact
matches Relevance feedback and query
refinement Automated query formulation using
machine learning Different approaches are needed
for fielded information and free text.
7Query Language
A query language defines the syntax and the
semantics of the queries in a given search
system. Factors to consider in designing a query
language include Service needs What are the
characteristics of the documents being searched?
What need does the service satisfy? Human
factors Are the users trained or untrained or
both? What is the trade- off between power of
the language and easy of learning? Efficiency Ca
n the search system process all queries
efficiently?
8The Common Query Language
The Common Query Language is maintained by the
Library of Congress. The following examples are
taken from the CQL Tutorial, A Gentle
Introduction to CQL. http//zing.z3950.org/cql/int
ro.html
9Query Languages the Common Query Language
The Common Query Language a formal language for
queries to information retrieval systems such as
abstracting and indexing services, bibliographic
catalogs, and museum collection information.
Objective human readable and human writable
intuitive while maintaining the expressiveness
of more complex languages. Supports Full
text searching Boolean operators
Fielded searching
10The Common Query Language Examples
Simple queries dinosaur comp.sources.misc
"the complete dinosaur" "ext-gtu.generic"
"and" Booleans dinosaur or bird dinosaur and
bird or dinobird "feathered dinosaur" and
(yixian or jehol) (((a and b) or (c not d)
not (e or f and g)) and h not i) or j
11The Common Query Language Examples
Indexes fielded searching title dinosaur
title ((dinosaur and bird) or dinobird)
dc.title saurischia bath.title "the
complete dinosaur" Index-set mapping
gtdchttp//www.loc.gov/srw/index-sets/dc ... dc
.title dinosaur and dc.author farlow
Definition of fields (Dublin Core)
title and author use the Dublin Core definitions
12The Common Query Language Examples
Proximity The prox operator prox/relation/distan
ce/unit/ordering Examples complete prox
dinosaur adjacent ribs prox//5 chevrons near
5 ribs prox//0/sentence chevrons same
sentence ribs prox/gt/0/paragraph chevrons not
adjacent
13The Common Query Language Examples
Relations year gt 1998 title all "complete
dinosaur" all terms in title title any
"dinosaur bird reptile" any term in
title title exact "the complete
dinosaur" publicationYear lt 1980 numberOfWheels
lt 3
14The Common Query Language Examples
Relation Modifiers title all/stem "complete
dinosaur" title any/relevant "dinosaur bird
reptile" title exact/fuzzy "the complete
dinosaur" author /fuzzy tailor The
implementations of relevant and fuzzy are not
defined by the query language.
15The Common Query Language Examples
Pattern Matching dinosaur zero or more
characters sauria man?raptor exactly
one character char\ literal "" Word
Anchoring title"the complete dinosaur"
beginning of field author"bakker"
end of field author any
"kernighan ritchie thompson"
16The Common Query Language Examples
A complete example Find records whose author (in
the Dublin Core sense) includes either a word
beginning kern or the word ritchie, and which
have either the exact title (in the sense of the
Bath profile) the c programming language or a
title containing the words elements and
programming not more the four words apart, and
whose subject is relevant to one or more of the
words design or analysis.
dc.author(kern or ritchie) and
(bath.title exact "the c programming language"
or dc.titleelements prox///4
dc.titleprogramming) and subject
any/relevant "design analysis"
17Problems with the Boolean model
Boolean is all or nothing Boolean model has no
way to rank documents. Boolean model allows for
no uncertainty in assigning index terms to
documents. The Boolean model has no provision
for adjusting the importance of query terms.
18Boolean model as sets
d is either in the set A or not in A.
d
A
19Problems with the Boolean model
Counter-intuitive results Query q a and b and
c and d and e Document d has terms a, b, c and
d, but not e Intuitively, d is quite a good match
for q, but it is rejected by the Boolean model.
Query q a or b or c or d or e Document d1 has
terms a, b, c, d, and e Document d2 has term a,
but not b, c, d or e Intuitively, d1 is a much
better match than d2, but the Boolean model ranks
them as equal.
20Extending the Boolean model
Term weighting Give weights to terms in
documents and/or queries. Combine standard
Boolean retrieval with vector ranking of
results Fuzzy sets Relax the boundaries of the
sets used in Boolean retrieval
21Ranking methods in Boolean systems
SIRE (Syracuse Information Retrieval
Experiment) Term weights Add term weights
Weights calculated by the standard method
of term frequency inverse document
frequency. Ranking Calculate results set by
standard Boolean methods Rank results by
vector distances
22Expanding the results set in SIRE
SIRE used relevance feedback to refine the
results set Results set is created by
standard Boolean retrieval User selects one
document from results set Other documents in
collection are ranked by vector distance
from this document This process allows the
results set to be expanded, thus overcoming the
all-or-nothing problem of Boolean retrieval
Relevance feedback is discussed in the
following slides.
23Query Refinement
new query
Query formulation
reformulated query
Search
Display retrieved information
Reformulate query
EXIT
24Reformulation of Query
Manual Add or remove search terms
Change Boolean operators Change wild
cards Automatic Change the query vector
Remove/add search terms Change weighting of
search terms
25Relevance Feedback Document Vectors as Points on
a Surface
Normalize all document vectors to be of
length 1 Then the ends of the vectors all
lie on a surface with unit radius
For similar documents, we can represent parts of
this surface as a flat region
Similar document are represented as points that
are close together on this surface
26Relevance Feedback Results of a Search
x
x
hits from search
x
?
x
x
x
x
x documents found by search ? query
27Relevance Feedback (Concept)
hits from original search
x
x
o
?
x
x
o
o
x documents identified by user as non-relevant o
documents identified by user as relevant ?
original query reformulated query
28Difficulties with Relevance Feedback
optimal query
o
x
x
o
x
o
x
x
x
x
x
x
x
o
x
x
o
x
x
o
x
x
x
x
x non-relevant documents o relevant documents
29Difficulties with Relevance Feedback
optimal query
o
x
Hits from the initial query are contained in the
gray shaded area
x
o
x
o
x
x
x
x
x
x
x
o
?
x
x
o
x
x
o
x
x non-relevant documents o relevant documents ?
original query reformulated query
x
x
x
30Difficulties with Relevance Feedback
optimal results set
o
x
What region provides the optimal results set?
x
o
x
o
x
x
x
x
x
x
x
o
?
x
x
o
x
x
o
x
x non-relevant documents o relevant documents ?
original query reformulated query
x
x
x
31Theoretically Best Query
For a specific query, q, let DR be the
set of all relevant documents DS be the
set of all non-relevant documents sim (q,
DR) be the mean similarity between query q and
documents in DR sim (q, DS) be
the mean similarity between query q and
documents in DS A possible measure of a best
query would be to maximize F sim (q,
DR) - sim (q, DS)
32Estimating the Best Query
In practice, DR and DS are not known. (The
objective is to find them.) However, the results
of an initial query can be used to estimate sim
(q, DR) and sim (q, DS).
33Rocchio's Modified Query
Modified query vector Original query vector
Mean vector of relevant documents found by
original query - Mean vector of non-relevant
documents found by original query
34Rocchio's Modified Query
q0 vector for the initial query q1 vector for
the modified query ri vector for relevant
document i in set of hits si vector for
non-relevant document i in set of hits n1
number of relevant documents in set of hits n2
number of non-relevant documents in set of hits
35Adjusting Parameters Relevance Feedback
?, ? and ? are weights that adjust the importance
of the three vectors. If ? 0, the weights
provide positive feedback, by emphasizing the
relevant documents in the initial set. If ? 0,
the weights provide negative feedback, by
reducing the emphasis on the non-relevant
documents in the initial set.
36When to Use Relevance Feedback
Relevance feedback is most important when the
user wishes to increase recall, i.e., it is
important to find all relevant documents. Under
these circumstances, users can be expected to put
effort into searching Formulate queries
thoughtfully with many terms Review results
carefully to provide feedback Iterate
several times Combine automatic query
enhancement with studies of thesauruses
and other manual enhancements
37Relevance FeedbackClickthrough Data
Relevance feedback methods have suffered from the
unwillingness of users to provide
feedback. Joachims and others have developed
methods that use Clickthrough data from online
searches. Concept Suppose that a query delivers
a set of hits to a user. If a user skips a link
from hit a and clicks on a link from hit b, which
was ranked lower, then the user preference
reflects rank(b) lt rank(a)
38Clickthrough Example
Ranking Presented to User 1. Kernel
Machines http//svm.first.gmd.de/ 2. Support
Vector Machine http//jbolivar.freeservers.com/ 3
. SVM-Light Support Vector Machine http//ais.gmd
.de/thorsten/svm light/ 4. An Introduction to
Support Vector Machines http//www.support-vector
.net/ 5. Support Vector Machine and Kernel ...
References http//svm.research.bell-labs.com/SVMr
efs.html Ranking (3 lt 2) and (4 lt 2)
User clicks on 1, 3 and 4
Joachims
39Adjusting Parameters Weights by Machine Learning
Any query can be written q1 w1e1 w2e2 ...
wnen where the ei are a basis for the term
vector space of unit vectors corresponding to
the terms in the word list and the wi are
corresponding weights. If a query is used
repeatedly, optimal values of the wi can be
estimated using machine learning.
40Information Filtering Seeking Optimal Parameters
using Machine Learning
GENERAL EXAMPLE Text Retrieval Input Input
training examples queries with relevance
judgments design space parameters of
retrieval function Training Training
automatically find the solution find parameters
so that many in design space that works well
relevant documents are ranked on the training
data highly Prediction Prediction predict
well on new examples rank relevant documents
high also for new queries