Na - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

Na

Description:

MAP (maximum a posteriori) decision rule: classify (x) = classify (f1, .., fd) = argmaxc p(c|x) ... For each position, P(w=a)=p1, P(w=b)=p2 and P(w=c)=p3. ... – PowerPoint PPT presentation

Number of Views:478
Avg rating:3.0/5.0
Slides: 28
Provided by: coursesWa5
Category:
Tags: posteriori

less

Transcript and Presenter's Notes

Title: Na


1
Naïve Bayes
  • LING 572
  • Fei Xia
  • Week 2 1/9/06

2
Outline
  • Naïve Bayes in general
  • Naïve Bayes for TC

3
Questions
  • Why is it called Naïve Bayes?
  • What objective function does it optimize?
  • How many types of model parameters?
  • What happen at the training time?
  • What happen at the test time?
  • Any variations?

4
Modeling
  • Given x(f1, , fd), find
  • c arg maxc P(cx)
  • arg maxc P(c) P(xc) / P(x) ? Bayes
  • arg maxc P(c) P(xc)
  • Independence assumption
  • P(xc) P(f1, f2, , fd c)
  • ?k P(fk c, f1k-1)
  • ¼ ?k P(fk c) ? Naïve

5
Naïve Bayes Model
C

fn
f2
f1
Assumption each fi is conditionally independent
from fj given C.
6
Model parameters
  • Choose
  • c arg maxc P(c) ?k P(fk c)
  • Two types of model parameters
  • Class prior P(c)
  • Conditional prob P(fk c)
  • The number of parameters
  • CCV
  • How many parameters are free?

7
Training estimating parameters ?
  • Maximum likelihood (ML)
  • ? arg max? P(trainingData ?)
  • P(fk ci) Cnt(fk, ci) / Cnt(ci)
  • P(ci) Cnt(ci) / ?i Cnt(ci)

8
Laplace Estimate/Correction/Smoothing
  • Pretend you saw outcome one more than you
    actually did.
  • Suppose X has K possible outcomes, and the counts
    for them are n1, , nK , which sum to N.
  • Without smoothing P(Xi) ni /N
  • With Laplace smoothing P(Xi) (ni 1) / (NK)
  • It can be derived from Dirichlet priors as a MAP
    estimate.

9
Classifying
  • MAP (maximum a posteriori) decision rule
  • classify (x)
  • classify (f1, .., fd)
  • argmaxc p(cx)
  • argmaxc p(c) ?k p(fk c)

10
Naïve Bayes for TC
11
Features
  • Features bag of words (word order information is
    lost)
  • Number of feature templates 1
  • Number of features V
  • Features wt, t 2 1, 2, , V

12
Issues
  • Is wt a binary feature?
  • Are absent features used for calculating P(dj
    ci) ?

13
Two Naive Bayes Models (McCallum and Nigram,
1998)
  • Multi-variate Bernoulli event model
  • (a.k.a. binary independence model)
  • All features are binary the number of times a
    feature occurs in an instance is ignored.
  • When calculating p(d c), all features are used,
    including the absent features.
  • Multinomial event model unigram LM

14
Bernoulli distribution
  • Bernoulli distribution has exactly two mutually
    exclusive outcomes P(X1)p and P(X0)1-p.
  • Bernoulli trial a single experiment which can
    have one of two possible outcomes
  • A Bernoulli process is a sequence of iid
    (independent identically distributed) Bernoulli
    trials.

15
Multi-variate Bernoulli Model
  • A document is seen as a collection of V
    independent Bernoulli experiments, one for each
    word in the vocabulary does this word appear in
    the document?
  • Let Bit 1 if wt appears in di
  • 0 otherwise
  • Modeling
  • P(di cj) ?k ( Bit P(wt cj) (1-Bit)
    (1 P(wt cj))
  • Training
  • P(ci) DocNum(ci) / DocNum

16
Training (cont)
  • P(wt cj)
  • (1DocNum(wt, cj)) / (2DocNum(cj))

Where P(cj di) 1 if di has the label cj
0 otherwise
17
Questions about Bernoulli event model?
18
Multinomial distribution
  • Possible outcomes w1, w2, , wv
  • A trial for each word position
  • P(CurWordwi)pi and ?i pi 1
  • Perform n Bernoulli trials n is the length of
    the document
  • Let Xi be the number of times that the word wi is
    observed in the document.
  • P(X1x1,,Xvxv n!/(x1!...xv!) p1x1pvxv
  • n! ?k (pkxk
    / xk!)

19
An example
  • Suppose
  • the voc, V, contains only three words a, b, and
    c.
  • a document, di, contains only 2 word tokens
  • For each position, P(wa)p1, P(wb)p2 and
    P(wc)p3.
  • What is the prob that we see a once and b
    once in di?

20
An example (cont)
  • 9 possible sequences aa, ab, ac, ba, bb, bc, cc,
    cb, cc.
  • The number of sequences with one a and one b
    (ab and ba) n!/(x1!...xv!)
  • The prob of the sequence ab is p1p2,
  • so is the prob of the sequence ba.
  • So the prob of seeing a once and b once is
    n! ?k (pkxk / xk!) 2 p1p2

21
Multinomial Model
  • A document is seen as an order sequence of word
    events, drawn from the vocabulary V.
  • Nit the number of times that wt appears in di
  • Modeling multinomial distribution

22
Training for multinomial model
23
Two models
  • Bernoulli event model treat features as binary
    each trial corresponds to a feature.
  • Multinomial event model treat features as
    non-binary each trial corresponds to a word
    position in the document.
  • Multinomial event model usually beats the
    Bernoulli event model (McCallum and Nigram,
    1998)

24
Summary of Naïve Bayes
  • It makes a strong independence assumption.
  • It generally works well despite the strong
    assumption. Why?
  • Both training and testing are simple and fast.

25
Summary of Naïve Bayes (cont)
  • Strengths
  • Simplicity (conceptual)
  • Efficiency at training
  • Efficiency at testing time
  • Handling multi-class
  • Scalability
  • Output topN
  • Weakness
  • Theoretical validity
  • Predication accuracy ??
  • Stability and robustness

26
Today
  • Classification algorithm overview
  • Naïve Bayes in general
  • Naïve Bayes for text classification

27
Coming up
  • kNN and Rocchio on Thurs read the paper
  • An additional lab session right after Thursdays
    class.
  • Hw1 is due at 11pm on Sat no extension
Write a Comment
User Comments (0)
About PowerShow.com