Probabilistic Information Retrieval - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilistic Information Retrieval

Description:

Probabilistic Information Retrieval. CSE6392 - Database Exploration. Gautam Das ... document, in a database if a car is blue, black,etc. that is not easily captured ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 12
Provided by: wha93
Learn more at: https://crystal.uta.edu
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Information Retrieval


1
Probabilistic Information Retrieval
  • CSE6392 - Database Exploration
  • Gautam Das
  • Thursday, March 29 2006

Z.M. Joseph Spring 2006, CSE, UTA
2
Basic Rules of Probability
  • Recall the product rule
  • Bayes Theorem

3
Basic Assumptions
  • Assume a database D consisting of a set of
    objects documents, tuples, etc.
  • Q Query
  • R Relevant Set of tuples
  • Goal is to find an R for each Q, given D.
  • Instead of deterministic, consider probabilistic
    ordering
  • Ranking/Scoring function should decide the degree
    of relevance of a document
  • Thus given a document d
  • Score(d) P(RD) 1
  • Thus, according to this, if you know the
    relevance set, then Rs members would have
    probability of 1, which would be the maximum
    score. Others would get a probability of 0.

4
Simplification
  • From 1
  • Take ratios of probability that document is in R
    to probability that it is not in R
  • This retains the old ordering. Factors in the
    elements outside R which are part of D.

5
Applying Bayes Theorem
  • Simplify as follows

6
Observations
  • Forms the scoring function
  • The equation still retains R, which we do not
    know.
  • The ordering will still be the same using this
    equation as a scoring function

7
Derivation for Keyword Queries
  • Now assume that a query contains a vector of
    words, with zero probability assigned if it does
    not occur.
  • Then, applying the previous equation to each word
    w (instead of to a document) and combining all
    the words of the query gives

8
Search for Microsoft Corporation
  • Thus expression would be
  • Assume you had two documents
  • D1 Contains Microsoft but not Corporation
  • D2 Contains Corporation but not Microsoft
  • Thus

9
Search for Microsoft Corporation
  • Because Corporation is more common in the
    database D, then P(CorporationD) will be far
    higher than P(MicrosoftD).
  • Thus Score(D1) will be higher than Score(D2).
  • Thus document which has Microsoft in it will
    get higher ranking as this is more specific than
    the word Corporation.
  • Similar to Vector Space ranking by relevance

10
Relevance Feedback
  • Can keep fine-tuning R by getting user feedback
    on initial rankings.
  • Once a better R is known, better scoring and
    ranking of matches is possible.

11
PIR Applied to Databases
  • Originally PIR was applied to documents and not
    to databases
  • Applying PIR to databases is not easy as it is
    difficult to capture various aspects
  • These include
  • Different values of an attributes
  • PIR is based on words in document, in a database
    if a car is blue, black,etc. that is not easily
    captured
  • Would you assign each color as a keyword?
  • What to sacrifice in ranking is also not easy to
    capture if a users preference is black cars,
    how is PIR applied to that when listing results
    that do not match entirely?
Write a Comment
User Comments (0)
About PowerShow.com