Machine Learning Chapter 6. Bayesian Learning

About This Presentation

Title:

Machine Learning Chapter 6. Bayesian Learning

Description:

Provides 'gold standard' for evaluating other learning algorithms ... rec.sport.hockey. alt.atheism. soc.religion.christian. talk.religion.misc. talk.politics.mideast ... – PowerPoint PPT presentation

Number of Views:3474

Avg rating:3.0/5.0

Slides: 50

Provided by: borameCs

Category:

more less

Transcript and Presenter's Notes

Title: Machine Learning Chapter 6. Bayesian Learning

1
Machine LearningChapter 6. Bayesian Learning

Tom M. Mitchell

2
Bayesian Learning

Bayes Theorem
MAP, ML hypotheses
MAP learners
Minimum description length principle
Bayes optimal classifier
Naive Bayes learner
Example Learning over text data
Bayesian belief networks
Expectation Maximization algorithm

3
Two Roles for Bayesian Methods

Provides practical learning algorithms
Naive Bayes learning
Bayesian belief network learning
Combine prior knowledge (prior probabilities)
with observed data
Requires prior probabilities
Provides useful conceptual framework
Provides gold standard for evaluating other
learning algorithms
Additional insight into Occams razor

4
Bayes Theorem
5
Choosing Hypotheses

Generally want the most probable hypothesis given
the training data
Maximum a posteriori hypothesis hMAP
If assume P(hi) P(hj) then can further
simplify, and choose the Maximum likelihood (ML)
hypothesis

6
Bayes Theorem

Does patient have cancer or not?
A patient takes a lab test and the result comes
back positive. The test returns a correct
positive result in only 98 of the cases in which
the disease is actually present, and a correct
negative result in only 97 of the cases in which
the disease is not present. Furthermore, .008 of
the entire population have this cancer.
P(cancer) P(?cancer)
P(?cancer) P(?cancer)
P(??cancer) P(??cancer)

7
Basic Formulas for Probabilities

Product Rule probability P(A ? B) of a
conjunction of two events A and B
P(A ? B) P(A B) P(B) P(B A) P(A)
Sum Rule probability of a disjunction of two
events A and B
P(A ? B) P(A) P(B) - P(A ? B)
Theorem of total probability if events A1,, An
are mutually exclusive with , then

8
Brute Force MAP Hypothesis Learner

For each hypothesis h in H, calculate the
posterior probability
Output the hypothesis hMAP with the highest
posterior probability

9
Relation to Concept Learning(1/2)

Consider our usual concept learning task
instance space X, hypothesis space H, training
examples D
consider the FindS learning algorithm (outputs
most specific hypothesis from the version space V
SH,D)
What would Bayes rule produce as the MAP
hypothesis?
Does FindS output a MAP hypothesis??

10
Relation to Concept Learning(2/2)

Assume fixed set of instances ltx1,, xmgt
Assume D is the set of classifications D
ltc(x1),,c(xm)gt
Choose P(Dh)
P(Dh) 1 if h consistent with D
P(Dh) 0 otherwise
Choose P(h) to be uniform distribution
P(h) 1/H for all h in H
Then,

11
Evolution of Posterior Probabilities
12
Characterizing Learning Algorithms by Equivalent
MAP Learners
13
Learning A Real Valued Function(1/2)

Consider any real-valued target function f
Training examples ltxi, digt, where di is noisy
training value
di f(xi) ei
ei is random variable (noise) drawn independently
for each xi according to some Gaussian
distribution with mean0
Then the maximum likelihood hypothesis hML is the
one that minimizes
the sum of squared errors

14
Learning A Real Valued Function(2/2)

Maximize natural log of this instead...

15
Learning to Predict Probabilities

Consider predicting survival probability from
patient data
Training examples ltxi, digt, where di is 1 or 0
Want to train neural network to output a
probability given xi (not a 0 or 1)
In this case can show
Weight update rule for a sigmoid unit
where

16
Minimum Description Length Principle (1/2)

Occams razor prefer the shortest hypothesis
MDL prefer the hypothesis h that minimizes
where LC(x) is the description length of x under
encoding C
Example H decision trees, D training data
labels
LC1(h) is bits to describe tree h
LC2(Dh) is bits to describe D given h
Note LC2(Dh) 0 if examples classified
perfectly by h. Need only describe exceptions
Hence hMDL trades off tree size for training
errors

17
Minimum Description Length Principle (2/2)

Interesting fact from information theory
The optimal (shortest expected coding length)
code for an event with
probability p is log2p bits.
So interpret (1)
log2P(h) is length of h under optimal code
log2P(Dh) is length of D given h under optimal
code
? prefer the hypothesis that minimizes
length(h) length(misclassifications)

18
Most Probable Classification of New Instances

So far weve sought the most probable hypothesis
given the data D (i.e., hMAP)
Given new instance x, what is its most probable
classification?
hMAP(x) is not the most probable classification!
Consider
Three possible hypotheses
P(h1D) .4, P(h2D) .3, P(h3D) .3
Given new instance x,
h1(x) , h2(x) ?, h3(x) ?
Whats most probable classification of x?

19
Bayes Optimal Classifier

Bayes optimal classification
Example
P(h1D) .4, P(?h1) 0, P(h1) 1
P(h2D) .3, P(?h2) 1, P(h2) 0
P(h3D) .3, P(?h3) 1, P(h3) 0
therefore
and

20
Gibbs Classifier

Bayes optimal classifier provides best result,
but can be expensive if many hypotheses.
Gibbs algorithm
1. Choose one hypothesis at random, according to
P(hD)
2. Use this to classify new instance
Surprising fact Assume target concepts are drawn
at random from H according to priors on H. Then
EerrorGibbs ? 2E errorBayesOptional
Suppose correct, uniform prior distribution over
H, then
Pick any hypothesis from VS, with uniform
probability
Its expected error no worse than twice Bayes
optimal

21
Naive Bayes Classifier (1/2)

Along with decision trees, neural networks,
nearest nbr, one of the most practical learning
methods.
When to use
Moderate or large training set available
Attributes that describe instances are
conditionally independent given classification
Successful applications
Diagnosis
Classifying text documents

22
Naive Bayes Classifier (2/2)

Assume target function f X ? V, where each
instance x described by attributes lta1, a2 angt.
Most probable value of f(x) is
Naive Bayes assumption
which gives
Naive Bayes classifier

23
Naive Bayes Algorithm

Naive Bayes Learn(examples)
For each target value vj
P(vj) ? estimate P(vj)
For each attribute value ai of each attribute a
P(ai vj) ? estimate P(ai vj)
Classify New Instance(x)

24
Naive Bayes Example

Consider PlayTennis again, and new instance
ltOutlk sun, Temp cool, Humid high, Wind
stronggt
Want to compute
P(y) P(suny) P(cooly) P(highy) P(strongy)
.005
P(n) P(sunn) P(cooln) P(highn) P(strongn)
.021
? vNB n

25
Naive Bayes Subtleties (1/2)

1. Conditional independence assumption is often
violated
...but it works surprisingly well anyway. Note
dont need estimated posteriors to be
correct need only that
see Domingos Pazzani, 1996 for analysis
Naive Bayes posteriors often unrealistically
close to 1 or 0

26
Naive Bayes Subtleties (2/2)

2. what if none of the training instances with
target value vj have attribute value ai? Then
Typical solution is Bayesian estimate for
where
n is number of training examples for which v
vi,
nc number of examples for which v vj and a ai
p is prior estimate for
m is weight given to prior (i.e. number of
virtual examples)

27
Learning to Classify Text (1/4)

Why?
Learn which news articles are of interest
Learn to classify web pages by topic
Naive Bayes is among most effective algorithms
What attributes shall we use to represent text
documents??

28
Learning to Classify Text (2/4)

Target concept Interesting? Document ??, ?
1. Represent each document by vector of words
one attribute per word position in document
2. Learning Use training examples to estimate
P(?) ? P(?)
P(doc?) ? P(doc?)
Naive Bayes conditional independence assumption
where P(ai wk vj) is probability that word in
position i is
wk, given vj
one more assumption

29
Learning to Classify Text (3/4)

LEARN_NAIVE_BAYES_TEXT (Examples, V)
1. collect all words and other tokens that occur
in Examples
Vocabulary ? all distinct words and other tokens
in Examples
2. calculate the required P(vj) and P(wk vj)
probability terms
For each target value vj in V do
docsj ? subset of Examples for which the target
value is vj
Textj ? a single document created by
concatenating all members of docsj

30
Learning to Classify Text (4/4)

n ? total number of words in Textj (counting
duplicate words multiple times)
for each word wk in Vocabulary
nk ? number of times word wk occurs in Textj
CLASSIFY_NAIVE_BAYES_TEXT (Doc)
positions ? all word positions in Doc that
contain tokens found in Vocabulary
Return vNB where

31
Twenty NewsGroups

Given 1000 training documents from each group
Learn to classify new documents according to
which newsgroup it came from
Naive Bayes 89 classification accuracy

comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns sci.space sci.crypt sci.electronics sci.med
32
Learning Curve for 20 Newsgroups

Accuracy vs. Training set size (1/3 withheld for
test)

33
Bayesian Belief Networks

Interesting because
Naive Bayes assumption of conditional
independence too restrictive
But its intractable without some such
assumptions...
Bayesian Belief networks describe conditional
independence among subsets of variables
? allows combining prior knowledge about
(in)dependencies among variables with observed
training data
(also called Bayes Nets)

34
Conditional Independence

Definition X is conditionally independent of Y
given Z if the probability distribution governing
X is independent of the value of Y given the
value of Z that is, if
(?xi, yj, zk) P(X xiY yj, Z zk) P(X xiZ
zk)
more compactly, we write
P(XY, Z) P(XZ)
Example Thunder is conditionally independent of
Rain, given Lightning
P(ThunderRain, Lightning) P(ThunderLightning)
Naive Bayes uses cond. indep. to justify
P(X, YZ) P(XY, Z) P(YZ) P(XZ) P(YZ)

35
Bayesian Belief Network (1/2)

Network represents a set of conditional
independence assertions
Each node is asserted to be conditionally
independent of its nondescendants, given its
immediate predecessors.
Directed acyclic graph

36
Bayesian Belief Network (2/2)

Represents joint probability distribution over
all variables
e.g., P(Storm, BusTourGroup, . . . , ForestFire)
in general,
where Parents(Yi) denotes immediate predecessors
of Yi in graph
so, joint distribution is fully defined by graph,
plus the P(yiParents(Yi))

37
Inference in Bayesian Networks

How can one infer the (probabilities of) values
of one or more network variables, given observed
values of others?
Bayes net contains all information needed for
this inference
If only one variable with unknown value, easy to
infer it
In general case, problem is NP hard
In practice, can succeed in many cases
Exact inference methods work well for some
network structures
Monte Carlo methods simulate the network
randomly to calculate approximate solutions

38
Learning of Bayesian Networks

Several variants of this learning task
Network structure might be known or unknown
Training examples might provide values of all
network variables, or just some
If structure known and observe all variables
Then its easy as training a Naive Bayes
classifier

39
Learning Bayes Nets

Suppose structure known, variables partially
observable
e.g., observe ForestFire, Storm, BusTourGroup,
Thunder, but not Lightning, Campfire...
Similar to training neural network with hidden
units
In fact, can learn network conditional
probability tables using gradient ascent!
Converge to network h that (locally) maximizes
P(Dh)

40
Gradient Ascent for Bayes Nets

Let wijk denote one entry in the conditional
probability table for variable Yi in the network
wijk P(Yi yijParents(Yi) the list uik of
values)
e.g., if Yi Campfire, then uik might be
ltStorm T, BusTourGroup F gt
Perform gradient ascent by repeatedly
1. update all wijk using training data D
2. then, renormalize the to wijk assure
?j wijk 1 ? 0 ? wijk ? 1

41
More on Learning Bayes Nets

EM algorithm can also be used. Repeatedly
1. Calculate probabilities of unobserved
variables, assuming h
2. Calculate new wijk to maximize Eln P(Dh)
where D now includes both observed and
(calculated probabilities of) unobserved
variables
When structure unknown...
Algorithms use greedy search to add/substract
edges and nodes
Active research topic

42
Summary Bayesian Belief Networks

Combine prior knowledge with observed data
Impact of prior knowledge (when correct!) is to
lower the sample complexity
Active research area
Extend from boolean to real-valued variables
Parameterized distributions instead of tables
Extend to first-order instead of propositional
systems
More effective inference methods

43
Expectation Maximization (EM)

When to use
Data is only partially observable
Unsupervised clustering (target value
unobservable)
Supervised learning (some instance attributes
unobservable)
Some uses
Train Bayesian Belief Networks
Unsupervised clustering (AUTOCLASS)
Learning Hidden Markov Models

44
Generating Data from Mixture of k Gaussians

Each instance x generated by
1. Choosing one of the k Gaussians with uniform
probability
2. Generating an instance at random according to
that Gaussian

45
EM for Estimating k Means (1/2)

Given
Instances from X generated by mixture of k
Gaussian distributions
Unknown means lt?1,,?k gt of the k Gaussians
Dont know which instance xi was generated by
which Gaussian
Determine
Maximum likelihood estimates of lt?1,,?k gt
Think of full description of each instance as
yi lt xi, zi1, zi2gt where
zij is 1 if xi generated by jth Gaussian
xi observable
zij unobservable

46
EM for Estimating k Means (2/2)

EM Algorithm Pick random initial h lt?1, ?2gt
then iterate
E step Calculate the expected value Ezij of
each
hidden variable zij, assuming the current
hypothesis
h lt?1, ?2gt holds.
M step Calculate a new maximum likelihood
hypothesis
h' lt?'1, ?'2gt, assuming the value taken on by
each hidden variable zij is its expected value
Ezij calculated above. Replace h lt?1, ?2gt
by h' lt?'1, ?'2gt.

47
EM Algorithm

Converges to local maximum likelihood h and
provides estimates of hidden variables zij
In fact, local maximum in Eln P(Yh)
Y is complete (observable plus unobservable
variables) data
Expected value is taken over possible values of
unobserved variables in Y

48
General EM Problem

Given
Observed data X x1,, xm
Unobserved data Z z1,, zm
Parameterized probability distribution P(Yh),
where
Y y1,, ym is the full data yi xi ? zi
h are the parameters
Determine h that (locally) maximizes Eln
P(Yh)
Many uses
Train Bayesian belief networks
Unsupervised clustering (e.g., k means)
Hidden Markov Models

49
General EM Method

Define likelihood function Q(h'h) which
calculates
Y X ? Z using observed X and current
parameters h to estimate Z
Q(h'h) ? Eln P(Y h')h, X
EM Algorithm
Estimation (E) step Calculate Q(h'h) using the
current hypothesis h and the observed data X to
estimate the probability distribution over Y .
Q(h'h) ? Eln P(Y h')h, X
Maximization (M) step Replace hypothesis h by
the hypothesis h' that maximizes this Q function.