Probabilistic Information Retrieval

About This Presentation

Title:

Description:

Number of Views:157

Avg rating:3.0/5.0

Slides: 12

Provided by: wha93

Learn more at: https://crystal.uta.edu

Category:

Tags: blueblack | information | probabilistic | retrieval

Transcript and Presenter's Notes

Title: Probabilistic Information Retrieval

1
Probabilistic Information Retrieval

Z.M. Joseph Spring 2006, CSE, UTA
2
Basic Rules of Probability

3
Basic Assumptions

Assume a database D consisting of a set of
objects documents, tuples, etc.
Q Query
R Relevant Set of tuples
Goal is to find an R for each Q, given D.
Instead of deterministic, consider probabilistic
ordering
Ranking/Scoring function should decide the degree
of relevance of a document
Thus given a document d
Score(d) P(RD) 1
Thus, according to this, if you know the
relevance set, then Rs members would have
probability of 1, which would be the maximum
score. Others would get a probability of 0.

4
Simplification

From 1
Take ratios of probability that document is in R
to probability that it is not in R
This retains the old ordering. Factors in the
elements outside R which are part of D.

5
Applying Bayes Theorem

6
Observations

7
Derivation for Keyword Queries

Now assume that a query contains a vector of
words, with zero probability assigned if it does
not occur.
Then, applying the previous equation to each word
w (instead of to a document) and combining all
the words of the query gives

8
Search for Microsoft Corporation

9
Search for Microsoft Corporation

Because Corporation is more common in the
database D, then P(CorporationD) will be far
higher than P(MicrosoftD).
Thus Score(D1) will be higher than Score(D2).
Thus document which has Microsoft in it will
get higher ranking as this is more specific than
the word Corporation.
Similar to Vector Space ranking by relevance

10
Relevance Feedback

11
PIR Applied to Databases

Originally PIR was applied to documents and not
to databases
Applying PIR to databases is not easy as it is
difficult to capture various aspects
These include
Different values of an attributes
PIR is based on words in document, in a database
if a car is blue, black,etc. that is not easily
captured
Would you assign each color as a keyword?
What to sacrifice in ranking is also not easy to
capture if a users preference is black cars,
how is PIR applied to that when listing results
that do not match entirely?