Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 28

About This Presentation

Title:

Information Retrieval and Web Search

Description:

... looking for the relevant ones (in truth, only top 10-20 need to be inspected) ... Probabilistic Ranking Principle ... The Ranking ... – PowerPoint PPT presentation

Number of Views:20

Avg rating:3.0/5.0

Slides: 29

Provided by: gheorghe

Category:

more less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search

1
Information Retrieval and Web Search

Alternative IR models
Instructor Rada Mihalcea
Some of the slides were adopted from a course
tought at Cornell University by William Y. Arms

2
Probabilistic Model

An initial set of documents is retrieved somehow
User inspects these docs looking for the relevant
ones (in truth, only top 10-20 need to be
inspected)
IR system uses this information to refine
description of ideal answer set
By repeting this process, it is expected that the
description of the ideal answer set will improve
Have always in mind the need to guess at the very
beginning the description of the ideal answer set
Description of ideal answer set is modeled in
probabilistic terms

3
Probabilistic Ranking Principle

Given a user query q and a document dj, the
probabilistic model tries to estimate the
probability that the user will find the document
dj interesting (i.e., relevant). The model
assumes that this probability of relevance
depends on the query and the document
representations only. Ideal answer set is
referred to as R and should maximize the
probability of relevance. Documents in the set R
are predicted to be relevant.
But,
how to compute probabilities?
what is the sample space?

4
The Ranking

Probabilistic ranking computed as
sim(q,dj) P(dj relevant-to q) / P(dj
non-relevant-to q)
How to read this? Maximize the number of
relevant documents, minimize the number of
irrelevant documents
This is the odds of the document dj being
relevant
Taking the odds minimize the probability of an
erroneous judgement
Definition
wij ? 0,1
P(R vec(dj)) probability that given document
is relevant
P(?R vec(dj)) probability that document is
not relevant
Bayes Rule P(AB) P(B) P(BA)P(A)

5
The Ranking

sim(dj,q) P(R vec(dj)) / P(?R
vec(dj)) P(vec(dj) R)
P(R)
P(vec(dj) ?R) P(?R)
P(vec(dj) R)
P(vec(dj) ?R)
P(vec(dj) R) probability of randomly
selecting the document dj from the set R of
relevant documents
P(R) and P(?R) are constant

6
The Ranking

sim(dj,q) P(vec(dj) R)
P(vec(dj) ?R)
? P(ki R) ? P(?ki ?R)
? P(? ki R) ? P( ki
?R)
P(ki R) probability that the index term ki is
present in a document randomly selected from the
set R of relevant documents
Based on independence assumption
Strong assumption!
In real life, does not always hold

7
The Ranking

sim(dj,q) log ? P(ki R) ?
P(?ki ? R)
? P(?ki R) ? P( ki ?R)
? wiq wij (log P(ki R)
log (1 - P(ki ?R) ) (1- P( ki R))
P( ki ?R) where P(?ki
R) 1 - P(ki R) P(?ki ?R) 1
- P(ki ?R)

8
The Initial Ranking

sim(dj,q) ? wiq wij (log
P(ki R) log (1 - P(ki ?R) ) (1-
P( ki R)) P( ki ?R)
Probabilities P(ki R) and P(ki ?R) ?
Estimates based on assumptions
P(ki R) 0.5
P(ki ?R) ni N where
ni is the number of docs that contain ki
Use this initial guess to retrieve an initial
ranking
Improve upon this initial ranking

9
Improving the Initial Ranking

sim(dj,q) ? wiq wij (log
P(ki R) log (1 - P(ki ?R) ) (1-
P( ki R)) P( ki ?R)
Let
V set of docs initially retrieved
Vi subset of docs retrieved that contain ki
Reevaluate estimates
P(ki R) Vi V
P(ki ?R) ni - Vi N - V
Repeat recursively

10
Improving the Initial Ranking

sim(dj,q) ? wiq wij (log
P(ki R) log (1 - P(ki ?R) ) (1-
P( ki R)) P( ki ?R)
To avoid problems with V1 and Vi0
P(ki R) Vi 0.5 V 1
P(ki ?R) ni - Vi 0.5 N - V
1
Also,
P(ki R) Vi ni/N V 1
P(ki ?R) ni - Vi ni/N N - V
1

11
Latent Semantic Indexing
Objective Replace indexes that use sets of index
terms by indexes that use concepts. Approach Map
the term vector space into a lower dimensional
space, using singular value decomposition. Each
dimension in the new space corresponds to a
latent concept in the original data.
12
Deficiencies with Conventional Automatic Indexing
Synonymy Various words and phrases refer to the
same concept (lowers recall). Polysemy
Individual words have more than one meaning
(lowers precision) Independence No significance
is given to two terms that frequently appear
together Latent semantic indexing addresses the
first of these (synonymy), and the third
(dependence)
13
Technical Memo Example Titles
c1 Human machine interface for Lab ABC computer
applications c2 A survey of user opinion of
computer system response time c3 The EPS user
interface management system c4 System and human
system engineering testing of EPS c5 Relation of
user-perceived response time to error
measurement m1 The generation of random, binary,
unordered trees m2 The intersection graph of
paths in trees m3 Graph minors IV Widths of
trees and well-quasi-ordering m4 Graph minors A
survey
14
Technical Memo Example Terms and Documents

Terms
Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0
1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1
1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1
1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1
0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0
0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0
0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1

15
Technical Memo Example Query
Query Find documents relevant to "human
computer interaction" Simple Term Matching
Matches c1, c2, and c4 Misses c3 and c5
16
The term vector space
t3
The space has as many dimensions as there are
terms in the word list.
d1
d2
t2
?
t1
17
Latent concept vector space
term document query --- cosine gt 0.9
18
Mathematical concepts
Define X as the term-document matrix, with t rows
(number of index terms) and d columns (number of
documents). Singular Value Decomposition For
any matrix X, with t rows and d columns, there
exist matrices T0, S0 and D0', such that X
T0S0D0' T0 and D0 are the matrices of left and
right singular vectors S0 is the diagonal matrix
of singular values
19
Dimensions of matrices
t x d
t x m
m x d
m x m
D0'
S0
X

T0
m is the rank of X lt min(t, d)
20
Reduced Rank
S0 can be chosen so that the diagonal elements
are positive and decreasing in magnitude. Keep
the first k and set the others to zero. Delete
the zero rows and columns of S0 and the
corresponding rows and columns of T0 and D0.
This gives X X TSD' Interpretation If
value of k is selected well, expectation is that
X retains the semantic information from X, but
eliminates noise from synonymy and recognizes
dependence.

21
Dimensionality Reduction
t x d
t x k
k x d
k x k
S
D'

X
T
k is the number of latent concepts (typically
300 500) X X TSD'
22
Recombination after Dimensionality Reduction
23
Mathematical Revision
A is a p x q matrix B is a r x q matrix ai is the
vector represented by row i of A bj is the vector
represented by row j of B The inner product ai.bj
is element i, j of AB'
q
r
ith row of A
q
p
B'
jth row of B
A
24
Comparing a Query and a Document
A query can be expressed as a vector in the
term-document vector space xq. xqi 1 if
term i is in the query and 0 otherwise. (Ignore
query terms that are not in the term vector
space.) Let pqj be the inner product of the query
xq with document dj in the term-document vector
space. pqj is the jth element in the
product of xq'X.

25
Comparing a Query and a Document
pq1 ... pqj ... pqt xq1 xq2 ... xqt
document dj is column j of X

inner product of query q with document dj
query

pq' xq'X xq'TSD'
xq'T(DS)' similarity(q, dj)
cosine of angle is inner product divided by
lengths of vectors
pqj xq dj
26
Comparing a Query and a Document
Alternatively, treat the query q as a
pseudo-document in the concept space dq dq
xq'TS-1 To compare a query against document j,
extend the method used to compare document i with
document j. Take the jth element of the product
of dqS and (DS)' This is the jth element of
product of xq'T (DS)' which is the same
expression as before.
27
Experimental Results 100 Factors
28
Experimental Results Number of Factors

Write a Comment

User Comments (0)