Title: Information Retrieval and Web Search
1Information Retrieval and Web Search
- Alternative IR models
- Instructor Rada Mihalcea
- Some of the slides were adopted from a course
tought at Cornell University by William Y. Arms
2Probabilistic Model
- An initial set of documents is retrieved somehow
- User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be
inspected) - IR system uses this information to refine
description of ideal answer set - By repeting this process, it is expected that the
description of the ideal answer set will improve - Have always in mind the need to guess at the very
beginning the description of the ideal answer set - Description of ideal answer set is modeled in
probabilistic terms
3Probabilistic Ranking Principle
- Given a user query q and a document dj, the
probabilistic model tries to estimate the
probability that the user will find the document
dj interesting (i.e., relevant). The model
assumes that this probability of relevance
depends on the query and the document
representations only. Ideal answer set is
referred to as R and should maximize the
probability of relevance. Documents in the set R
are predicted to be relevant. - But,
- how to compute probabilities?
- what is the sample space?
4The Ranking
- Probabilistic ranking computed as
- sim(q,dj) P(dj relevant-to q) / P(dj
non-relevant-to q) - How to read this? Maximize the number of
relevant documents, minimize the number of
irrelevant documents - This is the odds of the document dj being
relevant - Taking the odds minimize the probability of an
erroneous judgement - Definition
- wij ? 0,1
- P(R vec(dj)) probability that given document
is relevant - P(?R vec(dj)) probability that document is
not relevant - Bayes Rule P(AB) P(B) P(BA)P(A)
5The Ranking
- sim(dj,q) P(R vec(dj)) / P(?R
vec(dj)) P(vec(dj) R)
P(R)
P(vec(dj) ?R) P(?R)
P(vec(dj) R)
P(vec(dj) ?R) - P(vec(dj) R) probability of randomly
selecting the document dj from the set R of
relevant documents - P(R) and P(?R) are constant
6The Ranking
- sim(dj,q) P(vec(dj) R)
P(vec(dj) ?R)
? P(ki R) ? P(?ki ?R)
? P(? ki R) ? P( ki
?R) - P(ki R) probability that the index term ki is
present in a document randomly selected from the
set R of relevant documents - Based on independence assumption
- Strong assumption!
- In real life, does not always hold
7The Ranking
- sim(dj,q) log ? P(ki R) ?
P(?ki ? R)
? P(?ki R) ? P( ki ?R)
? wiq wij (log P(ki R)
log (1 - P(ki ?R) ) (1- P( ki R))
P( ki ?R) where P(?ki
R) 1 - P(ki R) P(?ki ?R) 1
- P(ki ?R)
8The Initial Ranking
- sim(dj,q) ? wiq wij (log
P(ki R) log (1 - P(ki ?R) ) (1-
P( ki R)) P( ki ?R) - Probabilities P(ki R) and P(ki ?R) ?
- Estimates based on assumptions
- P(ki R) 0.5
- P(ki ?R) ni N where
ni is the number of docs that contain ki - Use this initial guess to retrieve an initial
ranking - Improve upon this initial ranking
9Improving the Initial Ranking
- sim(dj,q) ? wiq wij (log
P(ki R) log (1 - P(ki ?R) ) (1-
P( ki R)) P( ki ?R) - Let
- V set of docs initially retrieved
- Vi subset of docs retrieved that contain ki
- Reevaluate estimates
- P(ki R) Vi V
- P(ki ?R) ni - Vi N - V
- Repeat recursively
10Improving the Initial Ranking
- sim(dj,q) ? wiq wij (log
P(ki R) log (1 - P(ki ?R) ) (1-
P( ki R)) P( ki ?R) - To avoid problems with V1 and Vi0
- P(ki R) Vi 0.5 V 1
- P(ki ?R) ni - Vi 0.5 N - V
1 - Also,
- P(ki R) Vi ni/N V 1
- P(ki ?R) ni - Vi ni/N N - V
1
11Latent Semantic Indexing
Objective Replace indexes that use sets of index
terms by indexes that use concepts. Approach Map
the term vector space into a lower dimensional
space, using singular value decomposition. Each
dimension in the new space corresponds to a
latent concept in the original data.
12Deficiencies with Conventional Automatic Indexing
Synonymy Various words and phrases refer to the
same concept (lowers recall). Polysemy
Individual words have more than one meaning
(lowers precision) Independence No significance
is given to two terms that frequently appear
together Latent semantic indexing addresses the
first of these (synonymy), and the third
(dependence)
13Technical Memo Example Titles
 c1 Human machine interface for Lab ABC computer
applications  c2 A survey of user opinion of
computer system response time  c3 The EPS user
interface management system  c4 System and human
system engineering testing of EPS Â c5 Relation of
user-perceived response time to error
measurement m1 The generation of random, binary,
unordered trees m2 The intersection graph of
paths in trees m3 Graph minors IV Widths of
trees and well-quasi-ordering  m4 Graph minors A
survey
14Technical Memo Example Terms and Documents
Â
Terms
Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0
1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1
1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1
1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1
0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0
0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0
0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1
Â
15Technical Memo Example Query
Query Find documents relevant to "human
computer interaction" Simple Term Matching
Matches c1, c2, and c4 Misses c3 and c5
16The term vector space
t3
The space has as many dimensions as there are
terms in the word list.
d1
d2
t2
?
t1
17Latent concept vector space
term document query --- cosine gt 0.9
18Mathematical concepts
Define X as the term-document matrix, with t rows
(number of index terms) and d columns (number of
documents). Singular Value Decomposition For
any matrix X, with t rows and d columns, there
exist matrices T0, S0 and D0', such that X
T0S0D0' T0 and D0 are the matrices of left and
right singular vectors S0 is the diagonal matrix
of singular values
19Dimensions of matrices
t x d
t x m
m x d
m x m
D0'
S0
X
T0
m is the rank of X lt min(t, d)
20Reduced Rank
S0 can be chosen so that the diagonal elements
are positive and decreasing in magnitude. Keep
the first k and set the others to zero. Delete
the zero rows and columns of S0 and the
corresponding rows and columns of T0 and D0.
This gives X X TSD' Interpretation If
value of k is selected well, expectation is that
X retains the semantic information from X, but
eliminates noise from synonymy and recognizes
dependence.
21Dimensionality Reduction
t x d
t x k
k x d
k x k
S
D'
X
T
k is the number of latent concepts (typically
300 500) X X TSD'
22Recombination after Dimensionality Reduction
23Mathematical Revision
A is a p x q matrix B is a r x q matrix ai is the
vector represented by row i of A bj is the vector
represented by row j of B The inner product ai.bj
is element i, j of AB'
q
r
ith row of A
q
p
B'
jth row of B
A
24Comparing a Query and a Document
A query can be expressed as a vector in the
term-document vector space xq. xqi 1 if
term i is in the query and 0 otherwise. (Ignore
query terms that are not in the term vector
space.) Let pqj be the inner product of the query
xq with document dj in the term-document vector
space. pqj is the jth element in the
product of xq'X.
25Comparing a Query and a Document
pq1 ... pqj ... pqt xq1 xq2 ... xqt
document dj is column j of X
inner product of query q with document dj
query
pq' xq'X xq'TSD'
xq'T(DS)' similarity(q, dj)
cosine of angle is inner product divided by
lengths of vectors
pqj xq dj
26Comparing a Query and a Document
Alternatively, treat the query q as a
pseudo-document in the concept space dq dq
xq'TS-1 To compare a query against document j,
extend the method used to compare document i with
document j. Take the jth element of the product
of dqS and (DS)' This is the jth element of
product of xq'T (DS)' which is the same
expression as before.
27Experimental Results 100 Factors
28Experimental Results Number of Factors