Title: Chapter 2 Modeling
1Chapter 2 Modeling
- Modern Information Retrieval
- by R. Baeza-Yates and B. Ribeiro-Neto
-
2Introduction
- Traditional information retrieval systems usually
adopt index terms to index and retrieve
documents. - An index term is a keyword (or group of related
words) which has some meaning of its own (usually
a noun). - Advantages
- Simple
- The semantic of the documents and of the user
information need can be naturally expressed
through sets of index terms.
3IR Models
- Ranking algorithms are at the core of information
retrieval systems (predicting which documents are
relevant and which are not).
4A taxonomy of information retrieval models
Classic Models
Set Theoretic
Boolean Vector Probabilistic
Fuzzy Extended Boolean
Retrieval Ad hoc Filtering
Algebraic
Structured Models
Generalized Vector Lat. Semantic Index Neural
Networks
Non-overlapping lists Proximal Nodes
Browsing
Probabilistic
Browsing
Inference Network Belief Network
Flat Structured Guided Hypertext
5Figure 2.2 Retrieval models most frequently
associated with distinct combinations of a
document logical view and a user task.
6Retrieval Ad hoc and Filtering
- Ad hoc (Search) The documents in the collection
remain relatively static while new queries are
submitted to the system. - Routing (Filtering) The queries remain
relatively static while new documents come into
the system
7A formal characterization of IR models
- D A set composed of logical views (or
representation) for the documents in the
collection. - Q A set composed of logical views (or
representation) for the user information needs
(queries). - F A framework for modeling document
representations, queries, and their
relationships. - R(qi, dj) A ranking function which defines an
ordering among the documents with regard to the
query.
8Define
- ki A generic index term
- K The set of all index terms k1,,kt
- wi,j A weight associated with index term
- ki of a document dj
- gi A function returns the weight associated
- with ki in any t-dimensoinal vector(
gi(dj)wi,j )
9Classic IR Model
- Basic concepts Each document is described by a
set of representative keywords called index
terms. - Assign a numerical weights to distinct relevance
between index terms.
10Boolean model
- Binary decision criterion
- Data retrieval model
- Advantage
- clean formalism, simplicity
- Disadvantage
- It is not simple to translate an information need
into a Boolean expression. - exact matching may lead to retrieval of too few
or too many documents
11Example
- Can be represented as a disjunction of
conjunction vectors (in DNF). - Q qa?(qb??qc)(1,1,1) ? (1,1,0) ? (1,0,0)
- Formal definition
- For the Boolean model, the index term weight are
all binary. - A query is a conventional Boolean expression,
which can be transformed to a disjunctive normal
form - if (?qcc? )?(?ki, wi,jgi(qcc))
12Vector model
- Assign non-binary weights to index terms in
queries and in documents. gt TFxIDF - Compute the similarity between documents and
query. gt Sim(Dj, Q) - More precise than Boolean model.
13The IR problem ? A clustering problem
- We think of the documents as a collection C of
objects and think of the user query as a
specification of a set A of objects. - Intra-cluster
- What are the features which better describe the
objects in the set A? - Inter-cluster
- What are the features which better distinguish
the objects in the set A?
14Idea for TFxIDF
- TF inter-clustering similarity is quantified by
measuring the raw frequency of a term ki inside a
document dj, such term frequency is usually
referred to as the tf factor and provides one
measure of how well that term describes the
document contents. - IDF inter-clustering similarity is quantified
by measuring the inverse of the frequency of a
term ki among the documents in the
collection.This frequency is often referred to as
the inverse document frequency.
15Vector Model (1/4)
- Index terms are assigned positive and non-binary
weights. - The index terms in the query are also weighted.
- Term weights are used to compute the degree of
similarity between documents and the user query.
Then, retrieved documents are sorted in
decreasing order.
16Vector Model (2/4)
17Vector Model (3/4)
- Definition
- normalized frequency
- inverse document frequency
- term-weighting schemes
- query-term weights
18Vector Model (4/4)
- Advantages
- its term-weighting scheme improves retrieval
performance - its partial matching strategy allows retrieval of
documents that approximate the query conditions - its cosine ranking formula sorts the documents
according to their degree of similarity to the
query - Disadvantage
- The assumption of mutual independence between
index terms
19Orthogonal
v1 (1,0) (1,0) v2 (1,1) (0,1) v3
(0,1) (-1,1) Cos(v1,v2)1/?2 Cos(v2,v3)1/?2 Cos(
v1,v3)0 Cos(v1,v2)0 Cos(v2,v3)1/?2 Cos(v1,v3)
-1/?2
v2
v3
v1
20Probabilistic Model (1/6)
- Introduced by Roberston and Sparck Jones, 1976
- Also called binary independence retrieval (BIR)
model - Idea Given a user query q, and the ideal answer
set of the relevant documents, the problem is to
specify the properties for this set. - i.e. the probabilistic model tries to estimate
the probability that the user will find the
document dj relevant with ratio - P(dj relevant to q)/P(dj nonrelevant to q)
21Probabilistic Model (2/6)
- Definition
- All index term weights are all binary i.e., wi,j
? 0,1 - Let R be the set of documents know to be relevant
to query q - Let be the complement of R
- Let be the probability that the
document dj is relevant to the query q - Let be the probability that the
document dj is nonelevant to query q
22Probabilistic Model (3/6)
- The similarity sim(dj,q) of the document dj to
the query q is defined as the ratio - Using Bayes rule,
- P(R) stands for the probability that a document
randomly selected from the entire collection is
relevant - stands for the probability of
randomly selecting the document dj from the set R
of relevant documents
23Probabilistic Model (4/6)
- Assuming independence of index terms and given
q(d1, d2, , dt),
24Probabilistic Model (5/6)
- Pr(ki R) stands for the probability that the
index term ki is present in a document randomly
selected from the set R - stands for the probability that the
index term ki is not present in a document
randomly selected from the set R
25Probabilistic Model (6/6)
26Estimation of Term Relevance
- In the very beginning
- Next, the ranking can be improved as follows
- For small values for V
Let V be a subset of the documents initially
retrieved
27Alternative Set Theoretic Models
- Fuzzy Set Model
- Extended Boolean Model
28Fuzzy Theory
- A fuzzy subset A of a universe U is characterized
by a membership function uA U?0,1 which
associates with each element u?U a number uA - Let A and B be two fuzzy subsets of U,
29Fuzzy Information Retrieval
- Using a term-term correlation matrix
- Define a fuzzy set associated to each
index term ki. - If a term kl is strongly related to ki, that is
ci,l 1, then ui(dj)1 - If a term kl is loosely related to ki, that is
ci,l 0, then ui(dj)0
30Example
31Algebraic Sum and Product
- The degree of membership in a disjunctive fuzzy
set is computed using an algebraic sum, instead
of max function. - The degree of membership in a conjunctive fuzzy
set is computed using an algebraic product,
instead of min function. - More smooth than max and min functions.
32Alternative Algebraic Models
- Generalized Vector Space Model
- Latent Semantic Model
33Latent Semantic Indexing (1/5)
- Let A be a term-document association matrix with
m rows and n columns. - Latent semantic indexing decomposes A using
singular value decompositions. - U (m?m) is the matrix of eigenvectors derived
from the term-to-term correlation matrix (AAT) - V (n?n) is the matrix of eigenvectors derived
from the the document-to-document matrix (ATA) - ? is an m?n diagonal matrix of singular values,
where r?min(t,N) is the rank of A.
34Latent Semantic Indexing (2/5)
- Consider now only the s largest singular values
of S, and their corresponding columns in U and V.
(The remaining singular values of ? are deleted). - The resultant matrix As (rank s) is closest to
the original matrix A in the least square sense. - sltr is the dimensionality of a reduced concept
space.
35Latent Semantic Indexing (3/5)
- The selection of s attempts to balance two
opposing effects - s should be large enough to allow fitting all the
structure in the real data - s should be small enough to allow filtering out
all the non-relevant representational details - Usu1, u2, , us are the s principle components
of column space (document space) Rm - Vsv1, v2, , vs are the s principle components
of row space (term space) Rn
36Latent Semantic Indexing (4/5)
- Consider the relationship between any two
documents - is the projected vector for document di
(Rm?Rs) - is the projected vector for term vector
ti (Rn?Rs)
37Latent Semantic Indexing (5/5)
- To rank documents with regard to a given user
query, we model the query as a pseudo-document in
the matrix A (original). - Assume the query is modeled as the document with
number k. - Then the kth row in the matrix provides
the ranks of all documents with respect to this
query.
38Speedup
- The matrix vector multiplication
requires a total of N?t scalar multiplications. - While requires only
(nm)?s scalar multiplications.
39Alternative Probabilistic Model
- Bayesian Networks
- Inference Network Model
- Belief Network Model
40Bayesian Network
- Let xi be a node in a Bayesian network G and ?xi
be the set of parent nodes of xi. - The influence of ?xi on xi can be specified by
any set of functions that satisfy - P(x1,x2,x3,x4,x5)P(x1)P(x2x1)P(x3x1)P(x4x2,x3)
P(x5x3)
41Belief Network Model (1/6)
- The probability space
- The set Kk1, k2, , kt is the universe. To
each subset u is associated a vector such that
gi( )1 ? ki?u. - Random variables
- To each index term ki is associated a binary
random variable.
42Belief Network Model (2/6)
- Concept space
- A document dj is represented as a concept
composed of the terms used to index dj. - A user query q is also represented as a concept
composed of the terms used to index q. - Both user query and document are modeled as
subsets of index terms. - Probability distribution P over K
43Belief Network Model (3/6)
- A query is modeled as a network node
- This variable is set to 1 whenever q completely
covers the concept space K - P(q) computes the degree of coverage of the space
K by q - A document dj is modeled as a network node
- This random variable is 1 to indicate that dj
completely covers the concept space K - P(dj) computes the degree of coverage of the
space K by dj
44Belief Network Model (4/6)
45Belief Network Model (5/6)
- Assumption
- P(dj q) is adopted as the rank of the document
dj with respect to the query q.
46Belief Network Model (6/6)
- Specify the conditional probabilities as follows
- Thus, the belief network model can be tuned to
subsume the vector model.
47Comparison
- Belief network model
- Belief network model is based on set-theoretic
view - Belief network model provides a separation
between the document and the query - Belief network model is able to reproduce any
ranking strategy generated by the inference
network model - Inference network model