Title: IST 2140 Information Storage and Retrieval
1IST 2140Information Storage and Retrieval
- Week 3
- Information Retrieval Models
2Retrieval Tasks
- Ad hoc retrieval
- Static documents, new queries
- formerly called retrospective retrieval
- Filtering
- Static queries, new documents
- Based on user profile
- Formerly called SDI (selective dissemination of
information) or current awareness
3Information Retrieval
Matching Process
Document Collection
Query
4Basic concepts
- Document is described by a set of representative
keywords (index terms) - Keywords may have binary weights or weights
calculated from statistics of their frequency in
text - Retrieval is a matching process between
document keywords and words in queries
5Information Retrieval Models
- A model is an embodiment of the theory in which
we define a set of objects about which assertions
can be made and restrict the ways in which
classes of objects can interact - A retrieval model specifies the representations
used for documents and information needs, and how
they are compared. - (Turtle Croft, 1992)
6A formal characterization of Information
Retrieval Models
- An information retrieval model is a quadrupole
D,Q,F,R(qi,dj) where - D is a set of representations for the documents
in the collection - Q is a set of representations for the user
information needs (queries) - F is a framework for modelling document
representations, queries, and their relationships - R(qi,dj) is a ranking function which associates a
real number with a query qi (qi ? Q) and document
representation dj (dj ? D) (Baeza-Yates
Ribeiro-Neto, 1999)
7Information Retrieval Models
- Three classic models
- Boolean Model
- Vector Space Model
- Probabilistic Model
- Additional models
- Extended Boolean
- Fuzzy matching
- Cluster-based retrieval
- Language models
8Implementation vs. Models
- An IR model is a formalization of the way of
thinking about information retrieval - Compare to implementation---how to operationalize
the model in a given environment (e.g. file
structures)
9Classic Retrieval Models
- Boolean
- Documents and queries are sets of index terms
- set theoretic
- Vector
- Documents and queries are documents in
L-dimensional space - algebraic
- Probabilistic
- Based on probability theory
10Boolean Model
- first online systems in 60s and 70s
- most widely used in commercial IR
- AND, OR, NOT operators
- usually supplemented with proximity operators
- requires an exact match
- based on inverted file
11Boolean Model
- Based on set theory and Boolean algebra
- Queries are specified as Boolean expressions
- Widely used in commercial IR systems (Dialog,
Lexis/Nexis) - Interpreted using Venn Diagrams
12Boolean Operators
13Boolean AND
- Information AND Retrieval
Information
Retrieval
14Boolean OR
Felines
Cats
15Example
- Draw a Venn diagram for
- Care and feeding and (cats or dogs)
- What is the meaning of
- Information and retrieval and performance or
evaluation
16Boolean-based Matching
- Exact match systems separate the documents
containing a given term from those that do not. - No similarity between document and query
structure - Proximity Judgment Gradations of the retrieved
set
Queries
Terms
Mediterranean
scholarships
horticulture
agriculture
cathedrals
adventure
disasters
leprosy
recipes
bridge
tennis
Venus
flags
flags AND tennis
0 0 1 1 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 0 1
1 0 1 0 1 0 1 0 0 1 0 0 0 0 1 1 1 0 0 0 1 1 0 0
0 0 1 0
leprosy AND tennis
Venus OR (tennis AND flags)
Documents
(bridge OR flags) AND tennis
17Sample Boolean Query
- On DIALOG
- SELECT nutritional()value? and (junk()food? or
fast()food? or pizza? or hamburger? or
french()(fry or fries)) - (note use of positional operators)
18Boolean model
- Terms are present or absent
- Index term weights are binary 0,1
- Exact match system documents are predicted to be
relevant or non-relevant
19Other Boolean features
- Order dependency of operators
- ( ), NOT, AND, OR (DIALOG)
- May differ on different systems
- Nesting of search terms
- Nutrition and (fast or junk) and food
20Other Boolean features
- Extensions to AND operator
- Positional operators
- adjacency A ADJ B, A(W)B
- Field limitations
- A/TI, B/TI,DE
- Extensions to OR operator
- Truncation
- A?, A?? ?
- Browsing indexes
21Pros and Cons
- simple formalism
- - hard to control (too many, too few documents
retrieved) - - requires formally constructed queries
- - performance issues??
22Extended Boolean model
- Add partial matching and term weighting
- E.g.
- take retrieval set and weight by binary terms
- Take retrieval set and calculate weights and
ranks within the set
23Vector Model
- Non-binary weights
- Calculate degree of similarity between document
and query - Ranked output by sorting similarity values
- Also called vector space model
24Vector Space Model
- based on idea of n-dimensional document space
- query is also located in document space
- documents are ranked in order of their
closeness to the query - many possible matching functions
25Vector Space Model
- Documents and queries are points in L-dimensional
space (where L is number of unique index terms in
the data collection)
Q
D
26Document and Query Vectors
- Documents and Queries are vectors of terms
- Actual vectors have many terms (thousands)
- Vectors can use binary keyword weights or assume
0-1 weights (term frequencies) - Example terms dog,cat,house, sink,
road, car - Binary (1,1,0,0,0,0), (0,0,1,1,0,0)
- Weighted (0.01,0.01, 0.002, 0.0,0.0,0.0)
- Queries can be weighted also
27Vector Space Model with Term Weights
- assume document terms have different values for
retrieval - therefore assign weights to each term in each
document - example TFxIDF weighting
- proportional to frequency of term in document
- inversely proportional to frequency of term in
collection
28A 2-D Document Space
- Consider a document collection with only two
words cat, dog - Three document vectors
- D1
- D2
- D3
D3
D2
D1
29A 3-D Document Space
Di 3 t1 4t2 2t3 Query0t1t2t3
30Term document matrix
- A document collection can be represented by a
matrix in which rows are documents and columns
are terms entries are 0 (if term is absent in a
document) or a value for term weight - Note that this is not effective for
implementation other data structures are used
(inverted file)
31Vector-based Matching Metrics
- Metric or Distance Measure documents close
together in the vector space are likely to be
highly similar
32Measuring similarity
- Need a means of computing similarity between
documents and queries (or pairs or documents) - Many proposed none ideal
- Allows
- Ranking
- Thresholding
- Query reformulation
33Some conventions
- N number of documents in collection
- L number of unique terms in collection
- Di the ith document in the collection
- Tj the jth term in the collection
- Q a query
- Wij the weight of term i in Document j
- WiQ the weight of term i in Query
34Inner product
- Many similarity functions are based on the inner
product - SIM(Di,Q) S(Dij,Qj)
- For binary vectors, reduces to number of terms in
common in D and Q - For weighted terms it is the sum of the products
of the term weights in documents and query - Favours longer documents
35Calculating Degree of Similarity
- Many candidate measures, e.g.
- Dice Coefficient
- Jaccard Coefficient
- Cosine Coefficient
- Based on contribution to similarity of
co-occurring terms - Generally normalized by length of document
36Dice Coefficient
- SIM (D,Q)
- 2 S (wiD wiQ)
- -------------------------
- S wiD2 S wiQ2
- Reduces to 2A/(BC) for binary documents
- Sum is taken over all i from 1?L (dimensionality
of vector space)
37Jaccard Coefficient
- SIM (D,Q)
- S (wiD wiQ)
- -------------------------------------------------
- - S wiD2 S wiQ2 S (wiDwiQ)
- Reduces to A/(BC-A) for binary documents
- Sum is taken over all i from 1? L(dimensionality
of vector space)
38Cosine Measure
- Very commonly used
- Measures cosine of angle between document-query
(or document-document) vector
Di
)
Q
39Cosine coefficient
- SIM (Dj, Q)
- S (wi,Dj wi,q)
- ----------------------------
- SQRT(S wi,Dj2) SQRT( S wi,q2)
- Sum is taken over L(dimensionality of vector
space)
40What Similarity Function to Use?
- Some studies comparing performance of measures
(e.g. McGill, 67 measures) - Significant differences in performance may
perform differently with high or low frequency
terms - May be monotonic among similar measures
- No consensus over what types of measure best for
what application - Cosine measure is widely used
41Pros and Cons
- Term weights improve retrieval performance
- Partial matching means query need not be exact
- Ranking provides guidance to searcher
- Relevance feedback feasible
- - assumes orthogonality obviously false
- - high degree of empiricism
- - Less control by searcher?