2D1431 Machine Learning - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

2D1431 Machine Learning

Description:

Example: Thunder is conditionally independent of Rain given Lightning ... Network represents joint probability distribution over all variables ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 42
Provided by: nada8
Category:

less

Transcript and Presenter's Notes

Title: 2D1431 Machine Learning


1
2D1431 Machine Learning
  • Bayesian Learning

2
Outline
  • Bayes theorem
  • Maximum likelihood (ML) hypothesis
  • Maximum a posteriori (MAP) hypothesis
  • Naïve Bayes classifier
  • Bayes optimal classifier
  • Bayesian belief networks
  • Expectation maximization (EM) algorithm

3
Handwritten characters classification
4
Gray level picturesobject classification
5
Gray level pictures human action classification
6
Literature Software
  • T. Mitchell chapter 6
  • S. Russell P. Norvig, Artificial Intelligence
    A Modern Approach chapters 1415
  • R.O. Duda, P.E. Hart, D.G. Stork, Pattern
    Classification 2nd ed. chapters 23
  • David Heckerman A Tutorial on Learning with
    Bayesian Belief Networks
  • http//ftp.research.microsoft.com/pub/tr/tr-95
    -06.pdf
  • Bayes Net Toolbox for Matlab (free), Kevin Murphy
  • http//www.cs.berkeley.edu/murphyk/Bayes/bnt.
    html

7
Bayes Theorem
  • P(hD) P(Dh) P(h) / P(D)
  • P(D) prior probability of the data D, evidence
  • P(h) prior probability of the hypothesis h,
    prior
  • P(hD) posterior probability of the hypothesis
    given the data D, posterior
  • P(Dh) probability of the data D given the
    hypothesis h , likelihood of the data

8
Bayes Theorem
  • P(hD) P(Dh) P(h) / P(D)
  • posterior likelihood x prior / evidence
  • By observing the data D we can convert the prior
    probability P(h) to the a posteriori probability
    (posterior) P(hD)
  • The posterior is probability that h holds after
    data D has been observed.
  • The evidence P(D) can be viewed merely as a scale
    factor that guarantees that the posterior
    probabilities sum to one.

9
Choosing Hypotheses
  • P(hD) P(Dh) P(h) / P(D)
  • Generally want the most probable hypothesis given
    the training data
  • Maximum a posteriori hypothesis hMAP
  • hMAP argmaxh?H P(hD)
  • argmaxh?H P(Dh) P(h) / P(D)
  • argmaxh?H P(Dh) P(h)
  • If the priors of hypothesis are equally likely
    P(hi)P(hj) then one can choose the maximum
    likelihood (ML) hypothesis
  • hML argmaxh?H P(Dh)

10
Bayes Theorem Example
  • A patient takes a lab test and the result is
    positive. The test returns a correct positive (?)
    result in 98 of the cases in which the disease
    is actually present, and a correct negative (?)
    result in 97 of the cases in which the disease
    is not present. Furthermore, 0.8 of the entire
    population have the disease. Hypotheses
    disease, disease
  • priors P(h) P(disease) 0.008, P(
    disease)0.992
  • likelihoods P(Dh) P(?disease)0.98, P(?
    disease)0.02
  • P(?disease)0.03,
    P(?disease)0.97
  • Maximum posteriors argmax P(hD)
  • P(disease?) P(?disease)P(disease)0.0078
  • P( disease?) P(? disease) P( disease)
    0.0298
  • P(disease?) 0.0078/(0.00780.0298) 0.21
  • P( disease?) 0.0298/(0.00780.0298) 0.79

11
Basic Formula for Probabilities
  • Product rule P(A?B) P(A) P(B)
  • Sum rule P(A?B) P(A) P(B) - P(A?B)
  • Theorem of total probability if A1, A2, , An
    are mutually exclusive events ?Si P(Ai) 1, then
  • P(B) Si P(BAi) P(Ai)

12
Bayes Theorem Example
  • P(x1,x2m1,m2,s) 1/(2ps) exp -Si (xi-mi)2/2s2
  • hm1,m2,s
  • Dx1,,xm

13
Gaussian Probability Function
  • P(Dm1,m2,s) Pm P(xmm1,m2,s)
  • Maximum likelihood hypothesis hML
  • hML argmax m1,m2,s P(Dm1,m2,s)
  • Trick maximize log-likelihood
  • log P(Dm1,m2,s) Sm log P(xmm1,m2,s)
  • Sm log (1/(2ps) exp -Si (xmi-mi)2/2s2
  • -M log (2ps) - Sm Si (xmi-mi)2/2s2

14
Gaussian Probability Function
  • ?log P(Dm1,m2,s)/ ? mi 0
  • Sm xmi-mi 0 ? mi ML 1/M Sm xmi Exm
  • ?log P(Dm1,m2,s)/ ? s 0
  • sML Sm Si (xmi-mi)2 / 2M E(Si (xmi-mi)2 / 2
  • Maximum likelihood hypothesis hML miML,sML

15
Maximum Likelihood Hypothesis
  • mML (0.20, -0.14) sML 1.42

16
Bayes Decision Rule
  • x examples of class c1
  • o examples of class c2

m2,s2
m1,s1
17
Bayes Decision Rule
  • Assume we have two Gaussians distributions
    associated to two separate classes c1, c2.
  • P(xci) P(xmi,si) 1/(2ps) exp -Si
    (xi-mi)2/2s2
  • Bayes decision rule (max posterior probability)
  • Decide c1 if P(c1x) gt P(c2x)
  • otherwise decide c2.
  • if P(c1) P(c2) use maximum likelihood P(xci)
  • else use maximum posterior P(cix) P(xci)
    P(ci)

18
Bayes Decision Rule
c2
c1
19
Two-Category Case
  • Discriminant functions
  • if g(x) gt 0 then c1 else c2
  • g(x) P(c1x) P(c2x)
  • P(xc1) P(c1) - P(xc1) P(c1)
  • g(x) log P(c1x) log P(c2x)
  • log P(xc1)/P(xc2) - log P(c1)/
    P(c2)
  • Gaussian probability functions with identical si
  • g(x) (x-m2)2/2s2 - (x-m1)2/2s2 log P(c1)
    log P(c2)
  • decision surface is a line/hyperplane

20
Learning a Real Valued Function
f
hML
e
  • Consider a real-valued target function f
  • Noisy training examples ltxi,digt
  • di f(xi) ei
  • ei is a random variable drawn from a Gaussian
    distribution with zero mean.
  • The maximum likelihood hypothesis hML is the one
    that minimizes the squared sum of errors
  • hML argmin h?H Si (di h(xi))2

21
Learning a Real Valued Function
  • hML argmax h?H P(Dh)
  • argmax h?H Pi P(xih)
  • argmax h?H Pi (2ps)-0.5 exp
    -(di-h(xi))2/2s2
  • maximizing logarithm log P(Dh)
  • hML argmax h?H Si 0.5 log(2ps)
    -(di-h(xi))2/2s2
  • argmax h?H Si -(di - h(xi))2
  • argmin h?H Si (di h(xi))2

22
Learning to Predict Probabilities
  • Predicting survival probability of a patient
  • Training examples ltxi,digt where di is 0 or 1
  • Objective train a neural network to output a
    probability h(xi) p(di1) given xi
  • Maximum likelihood hypothesis
  • hML argmax h?H Si di ln h(xi) (1-di) ln
    (1-h(xi))
  • maximize cross entropy between di and h(xi)
  • Weight update rule for synapses wk to output
    neuron h(xi) wk wk ? Si
    (di-h(xi)) xk
  • Compare to standard BP weight update rule
  • wk wk ? Si h(xi)(1-h(xi))
    (di-h(xi)) xk

23
Most Probable Classification
  • So far we sought the most probable hypothesis
    hMAP?
  • What is most probable classification of a new
    instance x given the data D?
  • hMAP(x) is not the most probable classification,
    although often a sufficiently good approximation
    of it.
  • Consider three possible hypotheses
  • P(h1D) 0.4, P(h2D) 0.3, P(h3D) 0.3
  • Given a new instance x, h1(x), h2(x)-, h3(x)-
  • hMAP(x) h1(x)
  • most probable classification
  • P()P(h1D)0.4 P(-)P(h2D) P(h3D)
    0.6

24
Bayes Optimal Classifier
  • cmax argmax cj?C S hi?H P(cjhi) P(hiD)
  • Example
  • P(h1D) 0.4, P(h2D) 0.3, P(h3D) 0.3
  • P(h1)1, P(-h1)0
  • P(h2)0, P(-h2)1
  • P(h3)0, P(-h3)1
  • therefore
  • S hi?H P(hi) P(hiD) 0.4
  • S hi?H P(- hi) P(hiD) 0.6
  • argmax cj?C S hi?H P(vjhi) P(hiD) -

25
MAP vs. Bayes Method
  • The maximum posterior hypothesis estimates a
    point hMAP in the hypothesis space H.
  • Bayes method instead estimates and uses a
    complete distribution P(hD).
  • The difference appears when inference MAP or
    Bayes method are used for inference of unseen
    instances and one compares the distributions
    P(xD)
  • MAP P(xD) hMAP(x) with hML argmax h?H
    P(hD)
  • Bayes P(xD) S hi?H P(xhi) P(hiD)
  • For reasonable prior distributions P(h) MAP and
    Bayes solution are equivalent in the asymptotic
    limit of infinite training data D.

26
Naïve Bayes Classifier
  • popular, simple learning algorithm
  • moderate or large training set available
  • assumption attributes that describe instances
    are conditionally independent given
    classification (in practice works surprisingly
    well even if assumption is violated)
  • Applications
  • diagnosis
  • text classification (newsgroup articles 20
    newsgroups, 1000 documents per newsgroup,
    classification accuracy 89)

27
Naïve Bayes Classifier
  • Assume discrete target function F X?C, where
    each instance x described by attributes
    lta1,a2,,angt
  • Most probable value of f(x) is
  • cMAP argmax cj?C P(cj lta1,a2,,angt)
  • argmax cj?C P(lta1,a2,,angtcj) P(cj) /
    P(lta1,a2,,angt)
  • argmax cj?C P(lta1,a2,,angtcj) P(cj)
  • Naïve Bayes assumption P(lta1,a2,,angtcj) Pi
    P(aicj)
  • cNB argmax cj?C P(cj) Pi P(aicj)

28
Naïve Bayes Learning Algorithm
  • Naïve_Bayes_Learn(examples)
  • for each target value cj estimate P(cj)
  • for each attribute value ai estimate of each
    attribute a estimate P(aicj)
  • Classify_New_Instance(x)
  • cNB argmax cj?C P(cj) Pai?x P(aicj)

29
Naïve Bayes Example
  • Consider PlayTennis and new instance
  • ltOutlookSunny, Tempcool, Humidityhigh,
    Windstronggt
  • Compute cNB argmax cj?C P(cj) Pai?x P(aicj)
  • playtennis (9,5-)
  • P(yes) 9/14, P(no) 5/14
  • windstrong (3,3-)
  • P(strongyes) 3/9 , P(strongno) 3/5
  • P(yes) P(sunyes) P(coolyes) P(highyes)
    P(strongyes) 0.005
  • P(no) P(sunno) P(coolno) P(highno)
    P(strongno) 0.021

30
Estimating Probabilities
  • What if none (nc0) of the training instances
    with target value cj have attribute ai?
  • P(aicj) nc/n 0 and P(cj) Pai?x P(aicj)
    0
  • Solution Bayesian estimate for P(aicj)
  • P(aicj) (nc mp)/(n m)
  • n number of training examples for which ccj
  • nc number of examples for which ccj and aai
  • p prior estimate of P(aicj)
  • m weight given to prior (number of virtual
    examples)

31
Bayesian Belief Networks
  • naïve assumption of conditional independency too
    restrictive
  • full probability distribution intractable due to
    lack of data
  • Bayesian belief networks describe conditional
    independence among subsets of variables
  • allows combining prior knowledge about causal
    relationships among variables with observed data

32
Conditional Independence
  • Definition X is conditionally independent of Y
    given Z is the probability distribution governing
    X is independent of the value of Y given the
    value of Z, that is, if
  • ? xi,yj,zk P(XxiYyj,Zzk) P(XxiZzk)
  • or more compactly P(XY,Z) P(XZ)
  • Example Thunder is conditionally independent of
    Rain given Lightning
  • P(Thunder Rain, Lightning) P(Thunder
    Lightning)
  • Notice P(Thunder Rain) ? P(Thunder)
  • Naïve Bayes uses conditional independence to
    justify
  • P(X,YZ) P(XY,Z) P(YZ) P(XZ) P(YZ)

33
Bayesian Belief Network
Storm
BusTour Group
Campfire
Lightning
Campfire
Forestfire
Thunder
  • Network represents a set of conditional
    independence assertions
  • Each node is conditionally independent of its
    non-descendants, given its immediate
    predecessors. (directed acyclic graph)

34
Bayesian Belief Network
Storm
BusTour Group
Campfire
Lightning
Campfire
Forestfire
Thunder
P(CS,B)
  • Network represents joint probability distribution
    over all variables
  • P(Storm,BusGroup,Lightning,Campfire,Thunder,Forest
    fire)
  • P(y1,,yn) Pi1n P(yiParents(Yi))
  • joint distribution is fully defined by graph plus
    P(yiParents(Yi))

35
Expectation Maximization EM
  • when to use
  • data is only partially observable
  • unsupervised clustering target value
    unobservable
  • supervised learning some instance attributes
    unobservable
  • applications
  • training Bayesian Belief Networks
  • unsupervised clustering
  • learning hidden Markov models

36
Generating Data from Mixture of Gaussians
  • Each instance x generated by
  • choosing one of the k Gaussians at random
  • Generating an instance according to that Gaussian

37
EM for Estimating k Means
  • Given
  • instances from X generated by mixture of k
    Gaussians
  • unknown means ltm1,,mkgt of the k Gaussians
  • dont know which instance xi was generated by
    which Gaussian
  • Determine
  • maximum likelihood estimates of ltm1,,mkgt
  • Think of full description of each instance as
    yiltxi,zi1,zi2gt
  • zij is 1 if xi generated by j-th Gaussian
  • xi observable
  • zij unobservable

38
EM for Estimating k Means
  • EM algorithm pick random initial hltm1,m2gt then
    iterate
  • E step Calculate the expected value Ezij of
    each hidden variable zij, assuming the current
    hypothesis hltm1,m2gt holds.
  • Ezij p(xximmj) / Sn12 p(xximmj)
  • exp(-(xi-mj)2/2s2) / Sn12
    exp(-(xi-mn)2/2s2)
  • M step Calculate a new maximum likelihood
    hypothesis hltm1,m2gt assuming the value taken
    on by each hidden variable zij is its expected
    value Ezij calculated in the E-step. Replace
    hltm1,m2gt by hltm1,m2gt
  • mj Si1m Ezij xi / Si1m Ezij

39
EM Algorithm
  • Converges to local maximum likelihood and
    provides estimates of hidden variables zij.
  • In fact local maximum in E ln (P(Yh)
  • Y is complete (observable plus non-observable
    variables) data
  • Expected valued is taken over possible values of
    unobserved variables in Y

40
General EM Problem
  • Given
  • observed data X x1,,xm
  • unobserved data Z z1,,zm
  • parameterized probability distribution P(Yh)
    where
  • Y y1,,ym is the full data yiltxi,zigt
  • h are the parameters
  • Determine
  • h that (locally) maximizes Eln P(Yh)
  • Applications
  • train Bayesian Belief Networks
  • unsupervised clustering
  • hidden Markov models

41
General EM Method
  • Define likelihood function Q(hh) which
    calculates
  • Y X ? Z using observed X and current
    parameters h to estimate Z
  • Q(hh) E ln( P(Yh) h, X
  • EM algorithm
  • Estimation (E) step Calculate Q(hh) using the
    current hypothesis h and the observed data X to
    estimate the probability distribution over Y.
  • Q(hh) E ln( P(Yh) h, X
  • Maximization (M) step Replace hypothesis h by
    the hypothesis h that maximizes this Q function.
  • h argmaxh?H Q(hh)
Write a Comment
User Comments (0)
About PowerShow.com