Intro to Information Retrieval - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Intro to Information Retrieval

Description:

Intro to Information Retrieval By the end of the lecture you should be able to: explain the differences between database and information retrieval technologies – PowerPoint PPT presentation

Number of Views:102

Avg rating:3.0/5.0

Slides: 23

Provided by: robe60

Category:

more less

Transcript and Presenter's Notes

Title: Intro to Information Retrieval

1
Intro to Information Retrieval

By the end of the lecture you should be able to
explain the differences between database and
information retrieval technologies
describe the basic maths underlying set-theoretic
and vector models of classical IR.

2
Reminder efficiency is vital

Reminder Google finds documents which match your
keywords this must be done EFFICIENTLY cant
just go through each document from start to end
for each keyword
So, cache stores copy of document, and also a
cut-down version of the document for searching
just a bag of words, a sorted list (or
array/vector/) of words appearing in the
document (with links back to full document)
Try to match keywords against this list if
found, then return the full document
Even cleverer dictionary and inverted file

3
Inverted file structure
dictionary
Inverted or postings file
Data file
1 2 1 2 3 2 2 3 4 . .
Term 1 (2) Term 2 (3) Term 3 (1) Term 4 (3) Term
5 (4) . .
Doc 1 Doc2 Doc3 Doc4 Doc5 Doc6 . .
1 3 6 7 9 . .
4
IR vs DBMS
5
informal introduction

IR was developed for bibliographic systems. We
shall refer to documents, but the technique
extends beyond items of text.
central to IR is representation of a document by
a set of descriptors or index terms (words
in the document).
searching for a document is carried out (mainly)
in the space of index terms.
we need a language for formulating queries, and a
method for matching queries with document
descriptors.

6
architecture
query
user
Query matching
hits
Learning component
feedback
Object base (objects and their descriptions)
7
basic notation
Given a list of m documents, D, and a list of n
index terms, T, we define wi,j ? 0 to be a weight
associated with the ith keyword and the jth
document. For the jth document, we define an
index term vector, dj dj (w1,j , w2,j , .,
wn,j )
Recipe for jam pudding
For example D d1, d2, d3, T pudding,
jam, traffic, lane, treacle d1 (1, 1, 0, 0,
0), d2 (0, 0, 1, 1, 0), d3 (1, 1, 1, 1, 0)
DoT report on traffic lanes
Radio item on traffic jam in Pudding Lane
8
set theoretic, Boolean model

Queries are Boolean expressions formed using
keywords, eg
(Jam ? Treacle) ?Pudding ? Lane ?
Traffic
Query is re-expressed in disjunctive normal form
(DNF)

CF T pudding, jam, traffic, lane, treacle
eg (1, 1, 0, 0, 0) ? (1, 0, 0, 0, 1) ? (1, 1, 0,
0, 1) To match a document with a query
sim(d, qDNF) 1 if d is equal to a component
of qDNF 0 otherwise
9
(1, 1, 0, 0, 0) ? (1, 0, 0, 0, 1) ? (1, 1, 0, 0,
1)
T pudding, jam, traffic, lane, treacle
treacle
pudding
jam
traffic
lane
d1 (1, 1, 0, 0, 0), d2 (0, 0, 1, 1, 0), d3
(1, 1, 1, 1, 0)
10
collecting results
T pudding, jam, traffic, lane, treacle
Query (Jam ? Treacle) ?Pudding ?
Lane ? Traffic
treacle
pudding
(jam ? treacle)? (pudding) - Lane - Traffic
jam
traffic
lane
Answer d1 (1, 1, 0, 0, 0) Jam pud recipe
11
Statistical vector model

weights, 1 ? wi,j ? 0, no longer binary-valued
query also represented by a vector
q (w1q, w2q, , wnq)
eg q (1.0, 0.6, 0.0, 0.0, 0.8)

CF T pudding, jam, traffic, lane, treacle
to match jth document with a query sim(dj, q)
dj ? q /( dj q )
12
Cosine coefficient
cos(?)
T1
?
T2
13
Cosine coefficient
cos(0) 1
T1
?0
T2
14
Cosine coefficient
cos(90º) 0
T1
D1
w11
? 90º
w1q 0
Q
w2q
w21 0
T2
15
q (1.0, 0.6, 0.0, 0.0, 0.8)
d1 (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud recipe
0.81.0 0.80.6 0.00.0 0.00.0
0.20.8 1.44
0.82 0.82 0.02 0.02 0.22 1.32
1.02 0.62 0.02 0.02 0.82 2.0
16
q (1.0, 0.6, 0.0, 0.0, 0.8)
d2 (0.0, 0.0, 0.9, 0.8, 0), DoT Report
0.01.0 0.00.6 0.90.0 0.80.0
0.00.8 0.0
0.02 0.02 0.92 0.82 0.02 1.45
1.02 0.62 0.02 0.02 0.82 2.0
17
q (1.0, 0.6, 0.0, 0.0, 0.8)
d3 (0.6, 0.9, 1.0, 0.6, 0.0) Radio
Traffic Report
0.61.0 0.90.6 1.00.0 0.60.0
0.00.8 1.14
0.62 0.92 1.02 0.62 0.02 2.53
1.02 0.62 0.02 0.02 0.82 2.0
18
collecting results
CF T pudding, jam, traffic, lane, treacle
q (1.0, 0.6, 0.0, 0.0, 0.8)
Rank document vector document (sim)

1. d1 (0.8, 0.8, 0.0, 0.0, 0.2) Jam pud
recipe (0.89)
2. d3 (0.6, 0.9, 1.0, 0.6, 0.0)
Radio Traffic (0.51) Report
19
Discussion Set theoretic model

Boolean model is simple, queries have precise
semantics, but it is an exact match model, and
does not Rank results
Boolean model popular with bibliographic systems
available on some search engines
Users find Boolean queries hard to formulate
Attempts to use set theoretic model as basis for
a partial-match system Fuzzy set model and the
extended Boolean model.

20
Discussion Vector Model

Vector model is simple, fast and results show
leads to good results.
Partial matching leads to ranked output
Popular model with search engines
Underlying assumption of term independence (not
realistic! Phrases, collocations, grammar)
Generalised vector space model relaxes the
assumption that index terms are pairwise
orthogonal (but is more complicated).

21
questions raised

Where do the index terms come from? (ALL the
words in the source documents?)
What determines the weights?
How well can we expect these systems to work for
practical applications?
How can we improve them?
How do we integrate IR into more traditional DB
management?

22
Questions to think about

Why is traditional database unsuited to retrieval
of unstructured information?
How would you re-express a Boolean query, eg (A
or B or (C and not D)), in disjunctive normal
form?
For the matching coefficient, sim(., .) show that
0 ? sim(., .) ? 1, and that sim(a, a) 1.
Compare and contrast the vector and set
theoretic models in terms of power of
representation of documents and queries.