IST 2140 Information Storage and Retrieval - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

IST 2140 Information Storage and Retrieval

Description:

or hamburger? or french()(fry or fries)) (note use of ... No consensus over what types of measure best for what application. Cosine measure is widely used ... – PowerPoint PPT presentation

Number of Views:2161

Avg rating:3.0/5.0

Slides: 42

Provided by: eras

Category:

more less

Transcript and Presenter's Notes

Title: IST 2140 Information Storage and Retrieval

1
IST 2140Information Storage and Retrieval

Week 3
Information Retrieval Models

2
Retrieval Tasks

Ad hoc retrieval
Static documents, new queries
formerly called retrospective retrieval
Filtering
Static queries, new documents
Based on user profile
Formerly called SDI (selective dissemination of
information) or current awareness

3
Information Retrieval
Matching Process
Document Collection
Query
4
Basic concepts

Document is described by a set of representative
keywords (index terms)
Keywords may have binary weights or weights
calculated from statistics of their frequency in
text
Retrieval is a matching process between
document keywords and words in queries

5
Information Retrieval Models

A model is an embodiment of the theory in which
we define a set of objects about which assertions
can be made and restrict the ways in which
classes of objects can interact
A retrieval model specifies the representations
used for documents and information needs, and how
they are compared.
(Turtle Croft, 1992)

6
A formal characterization of Information
Retrieval Models

An information retrieval model is a quadrupole
D,Q,F,R(qi,dj) where
D is a set of representations for the documents
in the collection
Q is a set of representations for the user
information needs (queries)
F is a framework for modelling document
representations, queries, and their relationships
R(qi,dj) is a ranking function which associates a
real number with a query qi (qi ? Q) and document
representation dj (dj ? D) (Baeza-Yates
Ribeiro-Neto, 1999)

7
Information Retrieval Models

Three classic models
Boolean Model
Vector Space Model
Probabilistic Model
Additional models
Extended Boolean
Fuzzy matching
Cluster-based retrieval
Language models

8
Implementation vs. Models

An IR model is a formalization of the way of
thinking about information retrieval
Compare to implementation---how to operationalize
the model in a given environment (e.g. file
structures)

9
Classic Retrieval Models

Boolean
Documents and queries are sets of index terms
set theoretic
Vector
Documents and queries are documents in
L-dimensional space
algebraic
Probabilistic
Based on probability theory

10
Boolean Model

first online systems in 60s and 70s
most widely used in commercial IR
AND, OR, NOT operators
usually supplemented with proximity operators
requires an exact match
based on inverted file

11
Boolean Model

Based on set theory and Boolean algebra
Queries are specified as Boolean expressions
Widely used in commercial IR systems (Dialog,
Lexis/Nexis)
Interpreted using Venn Diagrams

12
Boolean Operators

13
Boolean AND

Information AND Retrieval

Information
Retrieval
14
Boolean OR

Cats OR Felines

Felines
Cats
15
Example

Draw a Venn diagram for
Care and feeding and (cats or dogs)
What is the meaning of
Information and retrieval and performance or
evaluation

16
Boolean-based Matching

Exact match systems separate the documents
containing a given term from those that do not.
No similarity between document and query
structure
Proximity Judgment Gradations of the retrieved
set

Queries
Terms
Mediterranean
scholarships
horticulture
agriculture
cathedrals
adventure
disasters
leprosy
recipes
bridge
tennis
Venus
flags
flags AND tennis
0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1
1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0
0 0 1 0
leprosy AND tennis
Venus OR (tennis AND flags)
Documents
(bridge OR flags) AND tennis
17
Sample Boolean Query

On DIALOG
SELECT nutritional()value? and (junk()food? or
fast()food? or pizza? or hamburger? or
french()(fry or fries))
(note use of positional operators)

18
Boolean model

Terms are present or absent
Index term weights are binary 0,1
Exact match system documents are predicted to be
relevant or non-relevant

19
Other Boolean features

Order dependency of operators
( ), NOT, AND, OR (DIALOG)
May differ on different systems
Nesting of search terms
Nutrition and (fast or junk) and food

20
Other Boolean features

Extensions to AND operator
Positional operators
adjacency A ADJ B, A(W)B
Field limitations
A/TI, B/TI,DE
Extensions to OR operator
Truncation
A?, A?? ?
Browsing indexes

21
Pros and Cons

simple formalism
- hard to control (too many, too few documents
retrieved)
- requires formally constructed queries
- performance issues??

22
Extended Boolean model

Add partial matching and term weighting
E.g.
take retrieval set and weight by binary terms
Take retrieval set and calculate weights and
ranks within the set

23
Vector Model

Non-binary weights
Calculate degree of similarity between document
and query
Ranked output by sorting similarity values
Also called vector space model

24
Vector Space Model

based on idea of n-dimensional document space
query is also located in document space
documents are ranked in order of their
closeness to the query
many possible matching functions

25
Vector Space Model

Documents and queries are points in L-dimensional
space (where L is number of unique index terms in
the data collection)

Q
D
26
Document and Query Vectors

Documents and Queries are vectors of terms
Actual vectors have many terms (thousands)
Vectors can use binary keyword weights or assume
0-1 weights (term frequencies)
Example terms dog,cat,house, sink,
road, car
Binary (1,1,0,0,0,0), (0,0,1,1,0,0)
Weighted (0.01,0.01, 0.002, 0.0,0.0,0.0)
Queries can be weighted also

27
Vector Space Model with Term Weights

assume document terms have different values for
retrieval
therefore assign weights to each term in each
document
example TFxIDF weighting
proportional to frequency of term in document
inversely proportional to frequency of term in
collection

28
A 2-D Document Space

Consider a document collection with only two
words cat, dog
Three document vectors
D1
D2
D3

D3
D2
D1
29
A 3-D Document Space
Di 3 t1 4t2 2t3 Query0t1t2t3
30
Term document matrix

A document collection can be represented by a
matrix in which rows are documents and columns
are terms entries are 0 (if term is absent in a
document) or a value for term weight
Note that this is not effective for
implementation other data structures are used
(inverted file)

31
Vector-based Matching Metrics

Metric or Distance Measure documents close
together in the vector space are likely to be
highly similar

32
Measuring similarity

Need a means of computing similarity between
documents and queries (or pairs or documents)
Many proposed none ideal
Allows
Ranking
Thresholding
Query reformulation

33
Some conventions

N number of documents in collection
L number of unique terms in collection
Di the ith document in the collection
Tj the jth term in the collection
Q a query
Wij the weight of term i in Document j
WiQ the weight of term i in Query

34
Inner product

Many similarity functions are based on the inner
product
SIM(Di,Q) S(Dij,Qj)
For binary vectors, reduces to number of terms in
common in D and Q
For weighted terms it is the sum of the products
of the term weights in documents and query
Favours longer documents

35
Calculating Degree of Similarity

Many candidate measures, e.g.
Dice Coefficient
Jaccard Coefficient
Cosine Coefficient
Based on contribution to similarity of
co-occurring terms
Generally normalized by length of document

36
Dice Coefficient

SIM (D,Q)
2 S (wiD wiQ)
-------------------------
S wiD2 S wiQ2
Reduces to 2A/(BC) for binary documents
Sum is taken over all i from 1?L (dimensionality
of vector space)

37
Jaccard Coefficient

SIM (D,Q)
S (wiD wiQ)
-------------------------------------------------
-
S wiD2 S wiQ2 S (wiDwiQ)
Reduces to A/(BC-A) for binary documents
Sum is taken over all i from 1? L(dimensionality
of vector space)

38
Cosine Measure

Very commonly used
Measures cosine of angle between document-query
(or document-document) vector

Di
)
Q
39
Cosine coefficient

SIM (Dj, Q)
S (wi,Dj wi,q)
----------------------------
SQRT(S wi,Dj2) SQRT( S wi,q2)
Sum is taken over L(dimensionality of vector
space)

40
What Similarity Function to Use?

Some studies comparing performance of measures
(e.g. McGill, 67 measures)
Significant differences in performance may
perform differently with high or low frequency
terms
May be monotonic among similar measures
No consensus over what types of measure best for
what application
Cosine measure is widely used

41
Pros and Cons