Probabilistic Model - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Probabilistic Model

Description:

P(R | vec(dj)) : probability that given doc is relevant ... P(vec(dj) | R) : probability of randomly selecting the document dj from the set ... – PowerPoint PPT presentation

Number of Views:14
Avg rating:3.0/5.0
Slides: 13
Provided by: bert9
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Model


1
Probabilistic Model
  • Objective to capture the IR problem using a
    probabilistic framework
  • Given a user query, there is an ideal answer set
  • Querying as specification of the properties of
    this ideal answer set (clustering)
  • But, what are these properties?
  • Guess at the beginning what they could be (i.e.,
    guess initial description of ideal answer set)
  • Improve by iteration

2
Probabilistic Model
  • An initial set of documents is retrieved somehow
  • User inspects these docs looking for the relevant
    ones (in truth, only top 10-20 need to be
    inspected)
  • IR system uses this information to refine
    description of ideal answer set
  • By repeting this process, it is expected that the
    description of the ideal answer set will improve
  • Have always in mind the need to guess at the very
    beginning the description of the ideal answer set
  • Description of ideal answer set is modeled in
    probabilistic terms

3
Probabilistic Ranking Principle
  • Given a user query q and a document dj, the
    probabilistic model tries to estimate the
    probability that the user will find the document
    dj interesting (i.e., relevant). The model
    assumes that this probability of relevance
    depends on the query and the document
    representations only. Ideal answer set is
    referred to as R and should maximize the
    probability of relevance. Documents in the set R
    are predicted to be relevant.
  • But,
  • how to compute probabilities?
  • what is the sample space?

4
The Ranking
  • Probabilistic ranking computed as
  • sim(q,dj) P(dj relevant-to q) / P(dj
    non-relevant-to q)
  • This is the odds of the document dj being
    relevant
  • Taking the odds minimize the probability of an
    erroneous judgement
  • Definition
  • wij ? 0,1
  • P(R vec(dj)) probability that given doc is
    relevant
  • P(?R vec(dj)) probability doc is not relevant

5
The Ranking
  • sim(dj,q) P(R vec(dj)) / P(?R
    vec(dj)) P(vec(dj) R)
    P(R) P(vec(dj) ?R)
    P(?R) P(vec(dj) R)
    P(vec(dj) ?R)
  • P(vec(dj) R) probability of randomly
    selecting the document dj from the set R of
    relevant documents

6
The Ranking
  • sim(dj,q) P(vec(dj) R)
    P(vec(dj) ?R) ?
    P(ki R) ? P(?ki R) ?
    P(ki ?R) ? P(?ki ?R)
  • P(ki R) probability that the index term ki is
    present in a document randomly selected from the
    set R of relevant documents

7
The Ranking
  • sim(dj,q) log ? P(ki R) ?
    P(?kj R)
  • ? P(ki ?R) ? P(?kj
    ?R)
  • ? wiq wij (log P(ki R) log P(ki
    ?R) )
  • P(?ki R)
    P(?ki ?R)
  • where P(?ki R) 1 - P(ki
    R) P(?ki ?R) 1 - P(ki ?R)

8
The Initial Ranking
  • Probabilities P(ki R) and P(ki ?R) ?
  • Estimates based on assumptions
  • P(ki R) 0.5
  • P(ki ?R) ni N where ni is
    the number of docs that contain ki
  • Use this initial guess to retrieve an initial
    ranking
  • Improve upon this initial ranking

9
Improving the Initial Ranking
  • Let
  • V set of docs initially retrieved
  • Vi subset of docs retrieved that contain ki
  • Reevaluate estimates
  • P(ki R) Vi V
  • P(ki ?R) ni - Vi N - V
  • Repeat recursively

10
Improving the Initial Ranking
  • To avoid problems with V1 and Vi0
  • P(ki R) Vi 0.5 V 1
  • P(ki ?R) ni - Vi 0.5 N - V 1
  • Also,
  • P(ki R) Vi ni/N V 1
  • P(ki ?R) ni - Vi ni/N N - V 1

11
Pluses and Minuses
  • Advantages
  • Docs ranked in decreasing order of probability of
    relevance
  • Disadvantages
  • need to guess initial estimates for P(ki R)
  • method does not take into account tf and idf
    factors

12
Brief Comparison of Classic Models
  • Boolean model does not provide for partial
    matches and is considered to be the weakest
    classic model
  • Salton and Buckley did a series of experiments
    that indicate that, in general, the vector model
    outperforms the probabilistic model with general
    collections
  • This seems also to be the view of the research
    community
Write a Comment
User Comments (0)
About PowerShow.com