Title: Intro to Information Retrieval
1 Intro to Information Retrieval
- By the end of the lecture you should be able to
- explain the differences between database and
information retrieval technologies - describe the basic maths underlying set-theoretic
and vector models of classical IR.
2Reminder efficiency is vital
- Reminder Google finds documents which match your
keywords this must be done EFFICIENTLY cant
just go through each document from start to end
for each keyword - So, cache stores copy of document, and also a
cut-down version of the document for searching
just a bag of words, a sorted list (or
array/vector/) of words appearing in the
document (with links back to full document) - Try to match keywords against this list if
found, then return the full document - Even cleverer dictionary and inverted file
3Inverted file structure
dictionary
Inverted or postings file
Data file
1 2 1 2 3 2 2 3 4 . .
Term 1 (2) Term 2 (3) Term 3 (1) Term 4 (3) Term
5 (4) . .
Doc 1 Doc2 Doc3 Doc4 Doc5 Doc6 . .
1 3 6 7 9 . .
4IR vs DBMS
5informal introduction
- IR was developed for bibliographic systems. We
shall refer to documents, but the technique
extends beyond items of text. - central to IR is representation of a document by
a set of descriptors or index terms (words
in the document). - searching for a document is carried out (mainly)
in the space of index terms. - we need a language for formulating queries, and a
method for matching queries with document
descriptors.
6architecture
query
user
Query matching
hits
Learning component
feedback
Object base (objects and their descriptions)
7basic notation
Given a list of m documents, D, and a list of n
index terms, T, we define wi,j ? 0 to be a weight
associated with the ith keyword and the jth
document. For the jth document, we define an
index term vector, dj dj (w1,j , w2,j , .,
wn,j )
Recipe for jam pudding
For example D d1, d2, d3, T pudding,
jam, traffic, lane, treacle d1 (1, 1, 0, 0,
0), d2 (0, 0, 1, 1, 0), d3 (1, 1, 1, 1, 0)
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
8set theoretic, Boolean model
- Queries are Boolean expressions formed using
keywords, eg - (Jam ? Treacle) ?Pudding ? Lane ?
Traffic - Query is re-expressed in disjunctive normal form
(DNF)
CF T pudding, jam, traffic, lane, treacle
eg (1, 1, 0, 0, 0) ? (1, 0, 0, 0, 1) ? (1, 1, 0,
0, 1) To match a document with a query
sim(d, qDNF) 1 if d is equal to a component
of qDNF 0 otherwise
9(1, 1, 0, 0, 0) ? (1, 0, 0, 0, 1) ? (1, 1, 0, 0,
1)
T pudding, jam, traffic, lane, treacle
treacle
pudding
jam
traffic
lane
d1 (1, 1, 0, 0, 0), d2 (0, 0, 1, 1, 0), d3
(1, 1, 1, 1, 0)
10collecting results
T pudding, jam, traffic, lane, treacle
Query (Jam ? Treacle) ?Pudding ?
Lane ? Traffic
treacle
pudding
(jam ? treacle)? (pudding) - Lane - Traffic
jam
traffic
lane
Answer d1 (1, 1, 0, 0, 0) Jam pud recipe
11Statistical vector model
- weights, 1 ? wi,j ? 0, no longer binary-valued
- query also represented by a vector
- q (w1q, w2q, , wnq)
- eg q (1.0, 0.6, 0.0, 0.0, 0.8)
CF T pudding, jam, traffic, lane, treacle
to match jth document with a query sim(dj, q)
dj ? q /( dj q )
12Cosine coefficient
cos(?)
T1
?
T2
13Cosine coefficient
cos(0) 1
T1
?0
T2
14Cosine coefficient
cos(90º) 0
T1
D1
w11
? 90º
w1q 0
Q
w2q
w21 0
T2
15q (1.0, 0.6, 0.0, 0.0, 0.8)
d1 (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe
0.81.0 0.80.6 0.00.0 0.00.0
0.20.8 1.44
0.82 0.82 0.02 0.02 0.22 1.32
1.02 0.62 0.02 0.02 0.82 2.0
16q (1.0, 0.6, 0.0, 0.0, 0.8)
d2 (0.0, 0.0, 0.9, 0.8, 0), DoT Report
0.01.0 0.00.6 0.90.0 0.80.0
0.00.8 0.0
0.02 0.02 0.92 0.82 0.02 1.45
1.02 0.62 0.02 0.02 0.82 2.0
17q (1.0, 0.6, 0.0, 0.0, 0.8)
d3 (0.6, 0.9, 1.0, 0.6, 0.0) Radio
Traffic Report
0.61.0 0.90.6 1.00.0 0.60.0
0.00.8 1.14
0.62 0.92 1.02 0.62 0.02 2.53
1.02 0.62 0.02 0.02 0.82 2.0
18collecting results
CF T pudding, jam, traffic, lane, treacle
q (1.0, 0.6, 0.0, 0.0, 0.8)
Rank document vector document (sim)
1. d1 (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud
recipe (0.89)
2. d3 (0.6, 0.9, 1.0, 0.6, 0.0)
Radio Traffic (0.51) Report
19Discussion Set theoretic model
- Boolean model is simple, queries have precise
semantics, but it is an exact match model, and
does not Rank results - Boolean model popular with bibliographic systems
available on some search engines - Users find Boolean queries hard to formulate
- Attempts to use set theoretic model as basis for
a partial-match system Fuzzy set model and the
extended Boolean model.
20Discussion Vector Model
- Vector model is simple, fast and results show
leads to good results. - Partial matching leads to ranked output
- Popular model with search engines
- Underlying assumption of term independence (not
realistic! Phrases, collocations, grammar) - Generalised vector space model relaxes the
assumption that index terms are pairwise
orthogonal (but is more complicated).
21questions raised
- Where do the index terms come from? (ALL the
words in the source documents?) - What determines the weights?
- How well can we expect these systems to work for
practical applications? - How can we improve them?
- How do we integrate IR into more traditional DB
management?
22Questions to think about
- Why is traditional database unsuited to retrieval
of unstructured information? - How would you re-express a Boolean query, eg (A
or B or (C and not D)), in disjunctive normal
form? - For the matching coefficient, sim(., .) show that
0 ? sim(., .) ? 1, and that sim(a, a) 1. - Compare and contrast the vector and set
theoretic models in terms of power of
representation of documents and queries.