Na - PowerPoint PPT Presentation

1 / 27

About This Presentation

Title:

Na

Description:

MAP (maximum a posteriori) decision rule: classify (x) = classify (f1, .., fd) = argmaxc p(c|x) ... For each position, P(w=a)=p1, P(w=b)=p2 and P(w=c)=p3. ... – PowerPoint PPT presentation

Number of Views:478

Avg rating:3.0/5.0

Slides: 28

Provided by: coursesWa5

Learn more at: http://courses.washington.edu

Category:

Tags: posteriori

more less

Transcript and Presenter's Notes

Title: Na

1
Naïve Bayes

LING 572
Fei Xia
Week 2 1/9/06

2
Outline

Naïve Bayes in general
Naïve Bayes for TC

3
Questions

Why is it called Naïve Bayes?
What objective function does it optimize?
How many types of model parameters?
What happen at the training time?
What happen at the test time?
Any variations?

4
Modeling

Given x(f1, , fd), find
c arg maxc P(cx)
arg maxc P(c) P(xc) / P(x) ? Bayes
arg maxc P(c) P(xc)
Independence assumption
P(xc) P(f1, f2, , fd c)
?k P(fk c, f1k-1)
¼ ?k P(fk c) ? Naïve

5
Naïve Bayes Model
C

fn
f2
f1
Assumption each fi is conditionally independent
from fj given C.
6
Model parameters

Choose
c arg maxc P(c) ?k P(fk c)
Two types of model parameters
Class prior P(c)
Conditional prob P(fk c)
The number of parameters
CCV
How many parameters are free?

7
Training estimating parameters ?

Maximum likelihood (ML)
? arg max? P(trainingData ?)
P(fk ci) Cnt(fk, ci) / Cnt(ci)
P(ci) Cnt(ci) / ?i Cnt(ci)

8
Laplace Estimate/Correction/Smoothing

Pretend you saw outcome one more than you
actually did.
Suppose X has K possible outcomes, and the counts
for them are n1, , nK , which sum to N.
Without smoothing P(Xi) ni /N
With Laplace smoothing P(Xi) (ni 1) / (NK)
It can be derived from Dirichlet priors as a MAP
estimate.

9
Classifying

MAP (maximum a posteriori) decision rule
classify (x)
classify (f1, .., fd)
argmaxc p(cx)
argmaxc p(c) ?k p(fk c)

10
Naïve Bayes for TC
11
Features

Features bag of words (word order information is
lost)
Number of feature templates 1
Number of features V
Features wt, t 2 1, 2, , V

12
Issues

Is wt a binary feature?
Are absent features used for calculating P(dj
ci) ?

13
Two Naive Bayes Models (McCallum and Nigram,
1998)

Multi-variate Bernoulli event model
(a.k.a. binary independence model)
All features are binary the number of times a
feature occurs in an instance is ignored.
When calculating p(d c), all features are used,
including the absent features.
Multinomial event model unigram LM

14
Bernoulli distribution

Bernoulli distribution has exactly two mutually
exclusive outcomes P(X1)p and P(X0)1-p.
Bernoulli trial a single experiment which can
have one of two possible outcomes
A Bernoulli process is a sequence of iid
(independent identically distributed) Bernoulli
trials.

15
Multi-variate Bernoulli Model

A document is seen as a collection of V
independent Bernoulli experiments, one for each
word in the vocabulary does this word appear in
the document?
Let Bit 1 if wt appears in di
0 otherwise
Modeling
P(di cj) ?k ( Bit P(wt cj) (1-Bit)
(1 P(wt cj))
Training
P(ci) DocNum(ci) / DocNum

16
Training (cont)

P(wt cj)
(1DocNum(wt, cj)) / (2DocNum(cj))

Where P(cj di) 1 if di has the label cj
0 otherwise
17
Questions about Bernoulli event model?
18
Multinomial distribution

Possible outcomes w1, w2, , wv
A trial for each word position
P(CurWordwi)pi and ?i pi 1
Perform n Bernoulli trials n is the length of
the document
Let Xi be the number of times that the word wi is
observed in the document.
P(X1x1,,Xvxv n!/(x1!...xv!) p1x1pvxv
n! ?k (pkxk
/ xk!)

19
An example

Suppose
the voc, V, contains only three words a, b, and
c.
a document, di, contains only 2 word tokens
For each position, P(wa)p1, P(wb)p2 and
P(wc)p3.
What is the prob that we see a once and b
once in di?

20
An example (cont)

9 possible sequences aa, ab, ac, ba, bb, bc, cc,
cb, cc.
The number of sequences with one a and one b
(ab and ba) n!/(x1!...xv!)
The prob of the sequence ab is p1p2,
so is the prob of the sequence ba.
So the prob of seeing a once and b once is
n! ?k (pkxk / xk!) 2 p1p2

21
Multinomial Model

A document is seen as an order sequence of word
events, drawn from the vocabulary V.
Nit the number of times that wt appears in di
Modeling multinomial distribution

22
Training for multinomial model
23
Two models

Bernoulli event model treat features as binary
each trial corresponds to a feature.
Multinomial event model treat features as
non-binary each trial corresponds to a word
position in the document.
Multinomial event model usually beats the
Bernoulli event model (McCallum and Nigram,
1998)

24
Summary of Naïve Bayes