Information Retrieval and Web Search - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Information Retrieval and Web Search

Description:

... looking for the relevant ones (in truth, only top 10-20 need to be inspected) ... Probabilistic Ranking Principle ... The Ranking ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 29
Provided by: gheorghe
Category:

less

Transcript and Presenter's Notes

Title: Information Retrieval and Web Search


1
Information Retrieval and Web Search
  • Alternative IR models
  • Instructor Rada Mihalcea
  • Some of the slides were adopted from a course
    tought at Cornell University by William Y. Arms

2
Probabilistic Model
  • An initial set of documents is retrieved somehow
  • User inspects these docs looking for the relevant
    ones (in truth, only top 10-20 need to be
    inspected)
  • IR system uses this information to refine
    description of ideal answer set
  • By repeting this process, it is expected that the
    description of the ideal answer set will improve
  • Have always in mind the need to guess at the very
    beginning the description of the ideal answer set
  • Description of ideal answer set is modeled in
    probabilistic terms

3
Probabilistic Ranking Principle
  • Given a user query q and a document dj, the
    probabilistic model tries to estimate the
    probability that the user will find the document
    dj interesting (i.e., relevant). The model
    assumes that this probability of relevance
    depends on the query and the document
    representations only. Ideal answer set is
    referred to as R and should maximize the
    probability of relevance. Documents in the set R
    are predicted to be relevant.
  • But,
  • how to compute probabilities?
  • what is the sample space?

4
The Ranking
  • Probabilistic ranking computed as
  • sim(q,dj) P(dj relevant-to q) / P(dj
    non-relevant-to q)
  • How to read this? Maximize the number of
    relevant documents, minimize the number of
    irrelevant documents
  • This is the odds of the document dj being
    relevant
  • Taking the odds minimize the probability of an
    erroneous judgement
  • Definition
  • wij ? 0,1
  • P(R vec(dj)) probability that given document
    is relevant
  • P(?R vec(dj)) probability that document is
    not relevant
  • Bayes Rule P(AB) P(B) P(BA)P(A)

5
The Ranking
  • sim(dj,q) P(R vec(dj)) / P(?R
    vec(dj)) P(vec(dj) R)
    P(R)
    P(vec(dj) ?R) P(?R)
    P(vec(dj) R)
    P(vec(dj) ?R)
  • P(vec(dj) R) probability of randomly
    selecting the document dj from the set R of
    relevant documents
  • P(R) and P(?R) are constant

6
The Ranking
  • sim(dj,q) P(vec(dj) R)
    P(vec(dj) ?R)
    ? P(ki R) ? P(?ki ?R)
    ? P(? ki R) ? P( ki
    ?R)
  • P(ki R) probability that the index term ki is
    present in a document randomly selected from the
    set R of relevant documents
  • Based on independence assumption
  • Strong assumption!
  • In real life, does not always hold

7
The Ranking
  • sim(dj,q) log ? P(ki R) ?
    P(?ki ? R)
    ? P(?ki R) ? P( ki ?R)
    ? wiq wij (log P(ki R)
    log (1 - P(ki ?R) ) (1- P( ki R))
    P( ki ?R) where P(?ki
    R) 1 - P(ki R) P(?ki ?R) 1
    - P(ki ?R)

8
The Initial Ranking
  • sim(dj,q) ? wiq wij (log
    P(ki R) log (1 - P(ki ?R) ) (1-
    P( ki R)) P( ki ?R)
  • Probabilities P(ki R) and P(ki ?R) ?
  • Estimates based on assumptions
  • P(ki R) 0.5
  • P(ki ?R) ni N where
    ni is the number of docs that contain ki
  • Use this initial guess to retrieve an initial
    ranking
  • Improve upon this initial ranking

9
Improving the Initial Ranking
  • sim(dj,q) ? wiq wij (log
    P(ki R) log (1 - P(ki ?R) ) (1-
    P( ki R)) P( ki ?R)
  • Let
  • V set of docs initially retrieved
  • Vi subset of docs retrieved that contain ki
  • Reevaluate estimates
  • P(ki R) Vi V
  • P(ki ?R) ni - Vi N - V
  • Repeat recursively

10
Improving the Initial Ranking
  • sim(dj,q) ? wiq wij (log
    P(ki R) log (1 - P(ki ?R) ) (1-
    P( ki R)) P( ki ?R)
  • To avoid problems with V1 and Vi0
  • P(ki R) Vi 0.5 V 1
  • P(ki ?R) ni - Vi 0.5 N - V
    1
  • Also,
  • P(ki R) Vi ni/N V 1
  • P(ki ?R) ni - Vi ni/N N - V
    1

11
Latent Semantic Indexing
Objective Replace indexes that use sets of index
terms by indexes that use concepts. Approach Map
the term vector space into a lower dimensional
space, using singular value decomposition. Each
dimension in the new space corresponds to a
latent concept in the original data.
12
Deficiencies with Conventional Automatic Indexing
Synonymy Various words and phrases refer to the
same concept (lowers recall). Polysemy
Individual words have more than one meaning
(lowers precision) Independence No significance
is given to two terms that frequently appear
together Latent semantic indexing addresses the
first of these (synonymy), and the third
(dependence)
13
Technical Memo Example Titles
 c1 Human machine interface for Lab ABC computer
applications  c2 A survey of user opinion of
computer system response time  c3 The EPS user
interface management system  c4 System and human
system engineering testing of EPS  c5 Relation of
user-perceived response time to error
measurement m1 The generation of random, binary,
unordered trees m2 The intersection graph of
paths in trees m3 Graph minors IV Widths of
trees and well-quasi-ordering  m4 Graph minors A
survey
14
Technical Memo Example Terms and Documents
 
Terms
Documents c1 c2 c3 c4 c5 m1 m2 m3 m4 human 1 0 0
1 0 0 0 0 0 interface 1 0 1 0 0 0 0 0 0 computer 1
1 0 0 0 0 0 0 0 user 0 1 1 0 1 0 0 0 0 system 0 1
1 2 0 0 0 0 0 response 0 1 0 0 1 0 0 0 0 time 0 1
0 0 1 0 0 0 0 EPS 0 0 1 1 0 0 0 0 0 survey 0 1 0
0 0 0 0 0 1 trees 0 0 0 0 0 1 1 1 0 graph 0 0 0 0
0 0 1 1 1 minors 0 0 0 0 0 0 0 1 1
 
15
Technical Memo Example Query
Query Find documents relevant to "human
computer interaction" Simple Term Matching
Matches c1, c2, and c4 Misses c3 and c5
16
The term vector space
t3
The space has as many dimensions as there are
terms in the word list.
d1
d2
t2
?
t1
17
Latent concept vector space
term document query --- cosine gt 0.9
18
Mathematical concepts
Define X as the term-document matrix, with t rows
(number of index terms) and d columns (number of
documents). Singular Value Decomposition For
any matrix X, with t rows and d columns, there
exist matrices T0, S0 and D0', such that X
T0S0D0' T0 and D0 are the matrices of left and
right singular vectors S0 is the diagonal matrix
of singular values
19
Dimensions of matrices
t x d
t x m
m x d
m x m
D0'
S0
X

T0
m is the rank of X lt min(t, d)
20
Reduced Rank
S0 can be chosen so that the diagonal elements
are positive and decreasing in magnitude. Keep
the first k and set the others to zero. Delete
the zero rows and columns of S0 and the
corresponding rows and columns of T0 and D0.
This gives X X TSD' Interpretation If
value of k is selected well, expectation is that
X retains the semantic information from X, but
eliminates noise from synonymy and recognizes
dependence.


21
Dimensionality Reduction
t x d
t x k
k x d
k x k
S
D'


X
T
k is the number of latent concepts (typically
300 500) X X TSD'
22
Recombination after Dimensionality Reduction
23
Mathematical Revision
A is a p x q matrix B is a r x q matrix ai is the
vector represented by row i of A bj is the vector
represented by row j of B The inner product ai.bj
is element i, j of AB'
q
r
ith row of A
q
p
B'
jth row of B
A
24
Comparing a Query and a Document
A query can be expressed as a vector in the
term-document vector space xq. xqi 1 if
term i is in the query and 0 otherwise. (Ignore
query terms that are not in the term vector
space.) Let pqj be the inner product of the query
xq with document dj in the term-document vector
space. pqj is the jth element in the
product of xq'X.

25
Comparing a Query and a Document
pq1 ... pqj ... pqt xq1 xq2 ... xqt
document dj is column j of X

inner product of query q with document dj
query

pq' xq'X xq'TSD'
xq'T(DS)' similarity(q, dj)
cosine of angle is inner product divided by
lengths of vectors
pqj xq dj
26
Comparing a Query and a Document
Alternatively, treat the query q as a
pseudo-document in the concept space dq dq
xq'TS-1 To compare a query against document j,
extend the method used to compare document i with
document j. Take the jth element of the product
of dqS and (DS)' This is the jth element of
product of xq'T (DS)' which is the same
expression as before.
27
Experimental Results 100 Factors
28
Experimental Results Number of Factors
Write a Comment
User Comments (0)
About PowerShow.com