Title: Statistical Learning From data to distributions
1Statistical Learning(From data to distributions)
2Reminders
- HW5 deadline extended to Friday
3Agenda
- Learning a probability distribution from data
- Maximum likelihood estimation (MLE)
- Maximum a posteriori (MAP) estimation
- Expectation Maximization (EM)
4Motivation
- Agent has made observations (data)
- Now must make sense of it (hypotheses)
- Hypotheses alone may be important (e.g., in basic
science) - For inference (e.g., forecasting)
- To take sensible actions (decision making)
- A basic component of economics, social and hard
sciences, engineering,
5Candy Example
- Candy comes in 2 flavors, cherry and lime, with
identical wrappers - Manufacturer makes 5 (indistinguishable) bags
- Suppose we draw
- What bag are we holding? What flavor will we draw
next?
H1C 100L 0
H2C 75L 25
H3C 50L 50
H4C 25L 75
H5C 0L 100
6Machine Learning vs. Statistics
- Machine Learning ? automated statistics
- This lecture
- Bayesian learning, the more traditional
statistics (RN 20.1-3) - Learning Bayes Nets
7Bayesian Learning
- Main idea Consider the probability of each
hypothesis, given the data - Data d
- Hypotheses P(hid)
h1C 100L 0
h2C 75L 25
h3C 50L 50
h4C 25L 75
h5C 0L 100
8Using Bayes Rule
- P(hid) a P(dhi) P(hi) is the posterior
- (Recall, 1/a Si P(dhi) P(hi))
- P(dhi) is the likelihood
- P(hi) is the hypothesis prior
h1C 100L 0
h2C 75L 25
h3C 50L 50
h4C 25L 75
h5C 0L 100
9Computing the Posterior
- Assume draws are independent
- Let P(h1),,P(h5) (0.1,0.2,0.4,0.2,0.1)
- d 10 x
P(dh1) 0 P(dh2) 0.2510 P(dh3)
0.510 P(dh4) 0.7510 P(dh5) 110
10Posterior Hypotheses
11Predicting the Next Draw
H
- P(Xd) Si P(Xhi,d)P(hid) Si P(Xhi)P(hid)
D
X
Probability that next candy drawn is a lime
P(h1d) 0P(h2d) 0.00P(h3d) 0.00P(h4d)
0.10P(h5d) 0.90
P(Xh1) 0P(Xh2) 0.25P(Xh3) 0.5P(Xh4)
0.75P(Xh5) 1
P(Xd) 0.975
12P(Next Candy is Lime d)
13Other properties of Bayesian Estimation
- Any learning technique trades off between good
fit and hypothesis complexity - Prior can penalize complex hypotheses
- Many more complex hypotheses than simple ones
- Ockhams razor
14Hypothesis Spaces often Intractable
- A hypothesis is a joint probability table over
state variables - 2n entries gt hypothesis space is 0,1(2n)
- 2(2n) deterministic hypotheses6 boolean
variables gt over 1022 hypotheses - Summing over hypotheses is expensive!
15Some Common Simplifications
- Maximum a posteriori estimation (MAP)
- hMAP argmaxhi P(hid)
- P(Xd) ? P(XhMAP)
- Maximum likelihood estimation (ML)
- hML argmaxhi P(dhi)
- P(Xd) ? P(XhML)
- Both approximate the true Bayesian predictions as
the of data grows large
16Maximum a Posteriori
- hMAP argmaxhi P(hid)
- P(Xd) ? P(XhMAP)
P(XhMAP)
P(Xd)
h3
h4
h5
hMAP
17Maximum a Posteriori
- For large amounts of data,P(incorrect
hypothesisd) gt 0 - For small sample sizes, MAP predictions are
overconfident
P(XhMAP)
P(Xd)
18Maximum Likelihood
- hML argmaxhi P(dhi)
- P(Xd) ? P(XhML)
P(XhML)
P(Xd)
undefined
h5
hML
19Maximum Likelihood
- hML hMAP with uniform prior
- Relevance of prior diminishes with more data
- Preferred by some statisticians
- Are priors cheating?
- What is a prior anyway?
20Advantages of MAP and MLE over Bayesian estimation
- Involves an optimization rather than a large
summation - Local search techniques
- For some types of distributions, there are
closed-form solutions that are easily computed
21Learning Coin Flips (Bernoulli distribution)
- Let the unknown fraction of cherries be q
- Suppose draws are independent and identically
distributed (i.i.d) - Observe that c out of N draws are cherries
22Maximum Likelihood
- Likelihood of data dd1,,dN given q
- P(dq) Pj P(djq) qc (1-q)N-c
i.i.d assumption
Gather c cherries together, then N-c limes
23Maximum Likelihood
- Same as maximizing log likelihood
- L(dq) log P(dq) c log q (N-c) log(1-q)
- maxq L(dq)gt dL/dq 0gt 0 c/q
(N-c)/(1-q)gt q c/N
24Maximum Likelihood for BN
- For any BN, the ML parameters of any CPT can be
derived by the fraction of observed values in the
data
N1000
B 200
E 500
P(E) 0.5
P(B) 0.2
Earthquake
Burglar
AE,B 19/20AB 188/200AE 170/500A
1/380
Alarm
25Maximum Likelihood for Gaussian Models
- Observe a continuous variable x1,,xN
- Fit a Gaussian with mean m, std s
- Standard procedure write log likelihoodL N(C
log s) Sj (xj-m)2/(2s2) - Set derivatives to zero
26Maximum Likelihood for Gaussian Models
- Observe a continuous variable x1,,xN
- Resultsm 1/N S xj (sample mean)s2
1/N S (xj-m)2 (sample variance)
27Maximum Likelihood for Conditional Linear
Gaussians
- Y is a child of X
- Data (xj,yj)
- X is gaussian, Y is a linear Gaussian function of
X - Y(x) N(axb,s)
- ML estimate of a, b is given by least squares
regression, s by standard errors
X
Y
28Back to Coin Flips
- What about Bayesian or MAP learning?
- Motivation
- I pick a coin out of my pocket
- 1 flip turns up heads
- Whats the MLE?
29Back to Coin Flips
- Need some prior distribution P(q)
- P(qd) P(dq)P(q) qc (1-q)N-c P(q)
Define, for all q, the probability that I believe
in q
P(q)
q
1
0
30MAP estimate
- Could maximize qc (1-q)N-c P(q) using some
optimization - Turns out for some families of P(q), the MAP
estimate is easy to compute
(Conjugate prior)
P(q)
Beta distributions
q
1
0
31Beta Distribution
- Betaa,b(q) a qa-1 (1-q)b-1
- a, b hyperparameters
- a is a normalizationconstant
- Mean at a/(ab)
32Posterior with Beta Prior
- Posterior qc (1-q)N-c P(q) a qca-1
(1-q)N-cb-1 - MAP estimateq(ca)/(Nab)
- Posterior is also abeta distribution!
- See heads, increment a
- See tails, increment b
- Prior specifies a virtual count of a heads, b
tails
33Does this work in general?
- Only specific distributions have the right type
of prior - Bernoulli, Poisson, geometric, Gaussian,
exponential, - Otherwise, MAP needs a (often expensive)
numerical optimization
34How to deal with missing observations?
- Very difficult statistical problem in general
- E.g., surveys
- Did the person not fill out political affiliation
randomly? - Or do independents do this more often than
someone with a strong affiliation? - Better if a variable is completely hidden
35Expectation Maximization for Gaussian Mixture
models
Clustering N gaussian distributions
Data have labels to which Gaussian they belong
to, but label is a hidden variable
E step compute probability a datapoint belongs
to each gaussian M step compute ML estimates of
each gaussian, weighted by the probability that
each sample belongs to it
36Learning HMMs
- Want to find transition and observation
probabilities - Data many sequences O1t(j) for 1?j?N
- Problem we dont observe the Xs!
X0
X1
X2
X3
O1
O2
O3
37Learning HMMs
- Assume stationary markov chain, discrete states
x1,,xm - Transition parametersqij P(Xt1xjXtxi)
- Observation parameters?i P(OXtxi)
X0
X1
X2
X3
O1
O2
O3
38Learning HMMs
- Assume stationary markov chain, discrete states
x1,,xm - Transition parameterspij P(Xt1xjXtxi)
- Observation parameters?i P(OXtxi)
- Initial statesli P(X0xi)
x1
q13, q31
x3
x2
?3
?2
O
39Expectation Maximization
- Initialize parameters randomly
- E-step infer expected probabilities of hidden
variables over time, given current parameters - M-step maximize likelihood of data over
parameters
x1
q13, q31
x3
x2
P(initial state)
P(transition ij)
P(emission)
q (?1, ?2, ?3,p11,p12,...,p32p33, ?1,?2,?3)
?3
?2
O
40Expectation Maximization
q (?1, ?2, ?3,q11,q12,...,q32q33, ?1,?2,?3)
Initialize q(0)
E Compute EP(Zz q(0),O)
Z all combinations of hidden sequences
x1
x2
x3
x2
x2
x1
x1
x1
x2
x2
x1
x3
x2
q13, q31
x3
x2
Result probability distribution over hidden
state at time t
?3
?2
M compute q(1) ML estimate of transition /
obs. distributions
O
41Expectation Maximization
q (?1, ?2, ?3,q11,q12,...,q32q33, ?1,?2,?3)
Initialize q(0)
E Compute EP(Zz q(0),O)
Z all combinations of hidden sequences
This is the hard part
x1
x2
x3
x2
x2
x1
x1
x1
x2
x2
x1
x3
x2
q13, q31
x3
x2
Result probability distribution over hidden
state at time t
?3
?2
M compute q(1) ML estimate of transition /
obs. distributions
O
42E-Step on HMMs
- Computing expectations can be done by
- Sampling
- Using the forward/backward algorithm on the
unrolled HMM (RN pp. 546) - The latter gives the classic Baum-Welch algorithm
- Note that EM can still get stuck in local optima
or even saddle points
43Next Time