Probabilistic Information Retrieval Models - PowerPoint PPT Presentation

About This Presentation
Title:

Probabilistic Information Retrieval Models

Description:

Ahmet Selman Bozkir. Introduction to conditional, total ... One answer is the Okapi formulae (S. Robertson) Combine to find document relevance probability ... – PowerPoint PPT presentation

Number of Views:1388
Avg rating:3.0/5.0
Slides: 45
Provided by: selman2
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Information Retrieval Models


1
Probabilistic Information Retrieval Models
Systems
  • Ahmet Selman Bozkir

2
Outline
  • Introduction to conditional, total probability
    Bayesian theorem
  • Historical background of probabilistic
    information retrieval
  • Why probabilities in IR?
  • Document ranking problem
  • Binary Independence Model

3
Conditioanal Probability
  • Given some event B with nonzero probability P(B)
    gt 0
  • We can define conditional prob. as an event A,
    given B, by
  • The Probabilty P(AB) simply reflects the fact
    that the probability of an event A may depend on
    a second event B. So if A and B are mutually
    exclusive, A ? B ?

4
Conditioanal Probability
Tolerance
Resistance (?) 5 10 Total
22-? 10 14 24
47-? 28 26 44
100-? 24 8 32
Total 62 38 100
Lets define three events1. A as draw 47 ?
resistor 2. B as draw a resistor with 5 3. C
as draw a 100? resistor
P(A) P(47?) 44/100P(B) P(5)
62/100 P(C) P(100?) 32 /100
The joint probabilities are P(A ? B) P(47 ? ?
5) 28/100P(A ? C) P(47 ? ? 100 ?) 0P(B ?
C) P(5 ? 100 ?) 24/100
I f we use them the cond. prob.
5
Total Probability
  • The probability of P(A) of any event A defined on
    a sample space S can be expressed in terms of
    cond. probabilities. Suppose we are given N
    mutually exclusive events Bn ,n 1,2. N whose
    union equals S as ilustrated in figure

A ? Bn
B2
B1
A
B3
Bn
6
Bayes Theorem
  • The definition of conditional probability applies
    to any two events. In particular ,let Bn be one
    of the events defined above in the subsection on
    total probability.If P(A)?O,or,
    alternatively,

7
Bayes Theorem (cont.)
  • if P(Bn)?0, one form of Bayes theorem is
    obtained by equating these two expressions
  • Another form derives from a substitution of P(A)
    as given

8
Historical Background of PIR
  • The first attempts to develop a probabilistic
    theory of retrieval were made over 30 years ago
    Maron and Kuhns 1960 Miller 1971, and since
    then there has been a steady development of the
    approach. There are already several operational
    IR systems based upon probabilistic or
    semiprobabilistic models.
  •  
  • One major obstacle in probabilistic or
    semiprobabilistic IR models is finding methods
    for estimating the probabilities used to evaluate
    the probability of relevance that are both
    theoretically sound and computationally
    efficient.
  •  
  • The first models to be based upon such
    assumptions were the binary independence
    indexing model and the binary independence
    retrieval model
  •  
  • One area of recent research investigates the use
    of an explicit network representation of
    dependencies. The networks are processed by means
    of Bayesian inference or belief theory, using
    evidential reasoning techniques such as those
    described by Pearl 1988. This approach is an
    extension of the earliest probabilistic models,
    taking into account the conditional dependencies
    present in a real environment.

9
Why probabilities in IR?
10
Probabilistic IR topics
  • Classical probabilistic retrieval model
  • Probability ranking principle, etc.
  • (Naïve) Bayesian Text Categorization
  • Bayesian networks for text retrieval
  • Probabilistic methods are one of the oldest but
    also one of the currently hottest topics in IR.
  • Traditionally neat ideas, but theyve never won
    on performance. It may be different now.

11
Introduction
  • In probabilistic information retrieval, the goal
    is the estimation of the probability of relevance
    P(R l qk, dm) that a document dm will be judged
    relevant by a user with request qk. In order to
    estimate this probability, a large number of
    probabilistic models have been developed.
  • Typically, such a model is based on
    representations of queries and documents (e.g.,
    as sets of terms) in addition to this,
    probabilistic assumptions about the distribution
    of elements of these representations within
    relevant and nonrelevant documents are required.
  • By collecting relevance feedback data from a few
    documents, the model then can be applied in order
    to estimate the probability of relevance for the
    remaining documents in the collection.

12
The document ranking problem
  • We have a collection of documents
  • User issues a query
  • A list of documents needs to be returned
  • Ranking method is core of an IR system
  • In what order do we present documents to the
    user?
  • We want the best document to be first, second
    best second, etc.
  • Idea Rank by probability of relevance of the
    document w.r.t. information need
  • P(relevantdocumenti, query)

13
Recall a few probability basics
  • For events a and b
  • Bayes Rule
  • Odds

Prior
Posterior
14
Probability Ranking Principle
Let x be a document in the collection. Let R
represent relevance of a document w.r.t. given
(fixed) query and let NR represent non-relevance.
R0,1 vs. NR/R
Need to find p(Rx) - probability that a document
x is relevant.
p(R),p(NR) - prior probability of retrieving a
(non) relevant document
p(xR), p(xNR) - probability that if a relevant
(non-relevant) document is retrieved, it is x.
15
Probability Ranking Principle
  • Bayes Optimal Decision Rule
  • x is relevant iff p(Rx) gt p(NRx)
  • PRP in action Rank all documents by p(Rx)

16
Probability Ranking Principle
  • More complex case retrieval costs.
  • Let d be a document
  • C - cost of retrieval of relevant document
  • C - cost of retrieval of non-relevant document
  • Probability Ranking Principle if
  • for all d not yet retrieved, then d is the next
    document to be retrieved
  • We wont further consider loss/utility from now on

17
Probability Ranking Principle
  • How do we compute all those probabilities?
  • Do not know exact probabilities, have to use
    estimates
  • Binary Independence Retrieval (BIR) which we
    discuss later today is the simplest model
  • Questionable assumptions
  • Relevance of each document is independent of
    relevance of other documents.
  • Really, its bad to keep on returning duplicates
  • Boolean model of relevance

18
Probabilistic Retrieval Strategy
  • Estimate how terms contribute to relevance
  • How tf, df, and length influence your judgments
    about do things like document relevance?
  • One answer is the Okapi formulae (S. Robertson)
  • Combine to find document relevance probability
  • Order documents by decreasing probability

19
Probabilistic Ranking
  • Basic concept
  • "For a given query, if we know some documents
    that are relevant, terms that occur in those
    documents should be given greater weighting in
    searching for other relevant documents.
  • By making assumptions about the distribution of
    terms and applying Bayes Theorem, it is possible
    to derive weights theoretically."
  • Van Rijsbergen

20
Binary Independence Model
  • Traditionally used in conjunction with PRP
  • Binary Boolean documents are represented as
    binary incidence vectors of terms (cf. lecture
    1)
  • iff term i is present in document
    x.
  • Independence terms occur in documents
    independently
  • Different documents can be modeled as same vector
  • Bernoulli Naive Bayes model (cf. text
    categorization!)

21
Binary Independence Model
  • Queries binary term incidence vectors
  • Given query q,
  • for each document d need to compute p(Rq,d).
  • replace with computing p(Rq,x) where x is binary
    term incidence vector representing d Interested
    only in ranking
  • Will use odds and Bayes Rule

22
Binary Independence Model
Constant for a given query
Needs estimation
23
Binary Independence Model
  • Since xi is either 0 or 1

This can be changed (e.g., in relevance feedback)
Then...
24
Binary Independence Model
25
Binary Independence Model
26
Binary Independence Model
  • Estimating RSV coefficients.
  • For each term i look at this table of document
    counts

27
Estimation key challenge
  • If non-relevant documents are approximated by the
    whole collection, then ri (prob. of occurrence in
    non-relevant documents for query) is n/N and
  • log (1 ri)/ri log (N n)/n log N/n IDF!
  • pi (probability of occurrence in relevant
    documents) can be estimated in various ways
  • from relevant documents if know some
  • Relevance weighting can be used in feedback loop
  • constant (Croft and Harper combination match)
    then just get idf weighting of terms
  • proportional to prob. of occurrence in collection
  • more accurately, to log of this (Greiff, SIGIR
    1998)

28
Iteratively estimating pi
  • Assume that pi constant over all xi in query
  • pi 0.5 (even odds) for any given doc
  • Determine guess of relevant document set
  • V is fixed size set of highest ranked documents
    on this model (note now a bit like tf.idf!)
  • We need to improve our guesses for pi and ri, so
  • Use distribution of xi in docs in V. Let Vi be
    set of documents containing xi
  • pi Vi / V
  • Assume if not retrieved then not relevant
  • ri (ni Vi) / (N V)
  • Go to 2. until converges then return ranking

29
Probabilistic Relevance Feedback
  • Guess a preliminary probabilistic description of
    R and use it to retrieve a first set of documents
    V, as above.
  • Interact with the user to refine the description
    learn some definite members of R and NR
  • Reestimate pi and ri on the basis of these
  • Or can combine new information with original
    guess (use Bayesian prior)
  • Repeat, thus generating a succession of
    approximations to R.

? is prior weight
30
PRP and BIR
  • Getting reasonable approximations of
    probabilities is possible.
  • Requires restrictive assumptions
  • term independence
  • terms not in query dont affect the outcome
  • boolean representation of documents/queries/releva
    nce
  • document relevance values are independent
  • Some of these assumptions can be removed
  • Problem either require partial relevance
    information or only can derive somewhat inferior
    term weights

31
Removing term independence
  • In general, index terms arent independent
  • Dependencies can be complex
  • van Rijsbergen (1979) proposed model of simple
    tree dependencies
  • Exactly Friedman and Goldszmidts Tree Augmented
    Naive Bayes (AAAI 13, 1996)
  • Each term dependent on one other
  • In 1970s, estimation problems held back success
    of this model

32
Bayesian Networks for Text Retrieval (Turtle and
Croft 1990)
  • What is a Bayesian network?
  • A directed acyclic graph
  • Nodes
  • Events or Variables
  • Assume values.
  • For our purposes, all Boolean
  • Links
  • model direct dependencies between nodes

33
Bayesian Networks
  • Bayesian networks model causal relations between
    events
  • Inference in Bayesian Nets
  • Given probability distributions
  • for roots and conditional
  • probabilities can compute
  • apriori probability of any instance
  • Fixing assumptions (e.g., b
  • was observed) will cause
  • recomputation of probabilities

a
b
c
For more information see R.G. Cowell, A.P.
Dawid, S.L. Lauritzen, and D.J. Spiegelhalter.
1999. Probabilistic Networks and Expert Systems.
Springer Verlag. J. Pearl. 1988. Probabilistic
Reasoning in Intelligent Systems Networks of
Plausible Inference. Morgan-Kaufman.
34
Example
Project Due (d)
Finals (f)
Gloom (g)
No Sleep (n)
Triple Latte (t)
35
Independence Assumptions
Project Due (d)
Finals (f)
  • Independence assumption
  • P(tg, f)P(tg)
  • Joint probability
  • P(f d n g t)
  • P(f) P(d) P(nf) P(gf d) P(tg)

Gloom (g)
No Sleep (n)
Triple Latte (t)
36
Model for Text Retrieval
  • Goal
  • Given a users information need (evidence), find
    probability a doc satisfies need
  • Retrieval model
  • Model docs in a document network
  • Model information need in a query network

37
Bayesian Nets for IR Idea
I - goal node
38
Bayesian Nets for IR
  • Construct Document Network (once !)
  • For each query
  • Construct best Query Network
  • Attach it to Document Network
  • Find subset of dis which maximizes the
    probability value of node I (best subset).
  • Retrieve these dis as the answer to query.

39
Bayesian nets for text retrieval
d1
Documents
d2
Document Network
r1
r3
r2
Terms/Concepts
c1
c2
c3
Concepts
Query Network
q1
q2
Query operators (AND/OR/NOT)
i
Information need
40
Link matrices and probabilities
  • Prior doc probability P(d) 1/n
  • P(rd)
  • within-document term frequency
  • tf ? idf - based
  • P(cr)
  • 1-to-1
  • thesaurus
  • P(qc) canonical forms of query operators
  • Always use things like AND and NOT never store
    a full CPT
  • conditional probability table

41
Example reason trouble two
Hamlet
Macbeth
Document Network
reason
double
trouble
reason
two
trouble
Query Network
OR
NOT
User query
42
Extensions
  • Prior probs dont have to be 1/n.
  • User information need doesnt have to be a
    query - can be words typed, in docs read, any
    combination
  • Phrases, inter-document links
  • Link matrices can be modified over time.
  • User feedback.
  • The promise of personalization

43
Computational details
  • Document network built at indexing time
  • Query network built/scored at query time
  • Representation
  • Link matrices from docs to any single term are
    like the postings entry for that term
  • Canonical link matrices are efficient to store
    and compute
  • Attach evidence only at roots of network
  • Can do single pass from roots to leaves

44
Resources
  • All sources served by Google!
Write a Comment
User Comments (0)
About PowerShow.com