Title: EM algorithm
1EM algorithm
- LING 572
- Fei Xia
- 03/02/06
2Outline
- The EM algorithm
- EM for PM models
- Three special cases
- Inside-outside algorithm
- Forward-backward algorithm
- IBM models for MT
3The EM algorithm
4Basic setting in EM
- X is a set of data points observed data
- T is a parameter vector.
- EM is a method to find ?ML where
- Calculating P(X ?) directly is hard.
- Calculating P(X,Y?) is much simpler, where Y is
hidden data (or missing data).
5The basic EM strategy
- Z (X, Y)
- Z complete data (augmented data)
- X observed data (incomplete data)
- Y hidden data (missing data)
- Given a fixed x, there could be many possible
ys. - Ex given a sentence x, there could be many state
sequences in an HMM that generates x.
6Examples of EM
HMM PCFG MT Coin toss
X (observed) sentences sentences Parallel data Head-tail sequences
Y (hidden) State sequences Parse trees Word alignment Coin id sequences
? aij bijk P(A?BC) t(fe) d(ajj, l, m), p1, p2, ?
Algorithm Forward-backward Inside-outside IBM Models N/A
7The log-likelihood function
- L is a function of ?, while holding X constant
8The iterative approach for MLE
In many cases, we cannot find the solution
directly.
An alternative is to find a sequence
s.t.
9Jensens inequality
10Jensens inequality
log is a concave function
11Maximizing the lower bound
The Q function
12The Q-function
- Define the Q-function (a function of ?)
- Y is a random vector.
- X(x1, x2, , xn) is a constant (vector).
- Tt is the current parameter estimate and is a
constant (vector). - T is the normal variable (vector) that we wish to
adjust. - The Q-function is the expected value of the
complete data log-likelihood P(X,Y?) with
respect to Y given X and ?t.
13The inner loop of the EM algorithm
- E-step calculate
- M-step find
14L(?) is non-decreasing at each iteration
- The EM algorithm will produce a sequence
- It can be proved that
15The inner loop of the Generalized EM algorithm
(GEM)
- E-step calculate
- M-step find
16Recap of the EM algorithm
17Idea 1 find ? that maximizes the likelihood of
training data
18Idea 2 find the ?t sequence
- No analytical solution ? iterative approach, find
-
-
-
- s.t.
19Idea 3 find ?t1 that maximizes a tight lower
bound of
a tight lower bound
20Idea 4 find ?t1 that maximizes the Q function
Lower bound of
The Q function
21The EM algorithm
- Start with initial estimate, ?0
- Repeat until convergence
- E-step calculate
- M-step find
22Important classes of EM problem
- Products of multinomial (PM) models
- Exponential families
- Gaussian mixture
-
23The EM algorithm for PM models
24PM models
Where is a partition of
all the parameters, and for any j
25HMM is a PM
26PCFG
- PCFG each sample point (x,y)
- x is a sentence
- y is a possible parse tree for that sentence.
27PCFG is a PM
28Q-function for PM
29Maximizing the Q function
Maximize
Subject to the constraint
Use Lagrange multipliers
30Optimal solution
Expected count
Normalization factor
31PM Models
is rth parameter in the model. Each parameter
is the member of some multinomial distribution.
Count(x,y, r) is the number of times that
is seen in the expression for P(x, y ?)
32The EM algorithm for PM Models
- Calculate expected counts
- Update parameters
33PCFG example
- Calculate expected counts
- Update parameters
34The EM algorithm for PM models
// for each iteration
// for each training example xi
// for each possible y
// for each parameter
// for each parameter
35Inside-outside algorithm
36Inner loop of the Inside-outside algorithm
- Given an input sequence and
- Calculate inside probability
- Base case
- Recursive case
- Calculate outside probability
- Base case
- Recursive case
-
37Inside-outside algorithm (cont)
3. Collect the counts
4. Normalize and update the parameters
38Expected counts for PCFG rules
This is the formula if we have only one
sentence. Add an outside sum if X contains
multiple sentences.
39Expected counts (cont)
40Relation to EM
- PCFG is a PM Model
- Inside-outside algorithm is a special case of the
EM algorithm for PM Models. - X (observed data) each data point is a sentence
w1m. - Y (hidden data) parse tree Tr.
- T (parameters)
41Forward-backward algorithm
42The inner loop for forward-backward algorithm
- Given an input sequence and
- Calculate forward probability
- Base case
- Recursive case
- Calculate backward probability
- Base case
- Recursive case
- Calculate expected counts
- Update the parameters
43Expected counts
44Expected counts (cont)
45Relation to EM
- HMM is a PM Model
- Forward-backward algorithm is a special case of
the EM algorithm for PM Models. - X (observed data) each data point is an O1T.
- Y (hidden data) state sequence X1T.
- T (parameters) aij, bijk, pi.
46IBM models for MT
47 Expected counts for (f, e) pairs
- Let Ct(f, e) be the fractional count of (f, e)
pair in the training data.
Alignment prob
Actual count of times e and f are linked in
(E,F) by alignment a
48Relation to EM
- IBM models are PM Models.
- The EM algorithm used in IBM models is a special
case of the EM algorithm for PM Models. - X (observed data) each data point is a sentence
pair (F, E). - Y (hidden data) word alignment a.
- T (parameters) t(fe), d(i j, m, n), etc..
49Summary
- The EM algorithm
- An iterative approach
- L(?) is non-decreasing at each iteration
- Optimal solution in M-step exists for many
classes of problems. - The EM algorithm for PM models
- Simpler formulae
- Three special cases
- Inside-outside algorithm
- Forward-backward algorithm
- IBM Models for MT
50Relations among the algorithms
The generalized EM
The EM algorithm
PM
Inside-Outside Forward-backward IBM models
Gaussian Mix
51Strengths of EM
- Numerical stability in every iteration of the EM
algorithm, it increases the likelihood of the
observed data. - The EM handles parameter constraints gracefully.
52Problems with EM
- Convergence can be very slow on some problems and
is intimately related to the amount of missing
information. - It guarantees to improve the probability of the
training corpus, which is different from reducing
the errors directly. - It cannot guarantee to reach global maxima (it
could get struck at the local maxima, saddle
points, etc) - ? The initial estimate is important.
53Additional slides
54Lower bound lemma
If
Then
Proof
55L(?) is non-decreasing
Let
We have
(By lower bound lemma)