EM algorithm - PowerPoint PPT Presentation

1 / 55

About This Presentation

Title:

EM algorithm

Description:

X is a set of data points: observed data. T is a parameter vector. ... Generalized EM algorithm (GEM) E-step: calculate. M-step: find. Recap of the EM algorithm ... – PowerPoint PPT presentation

Number of Views:29

Avg rating:3.0/5.0

Slides: 56

Provided by: facultyWa4

Learn more at: http://faculty.washington.edu

Category:

more less

Transcript and Presenter's Notes

Title: EM algorithm

1
EM algorithm

LING 572
Fei Xia
03/02/06

2
Outline

The EM algorithm
EM for PM models
Three special cases
Inside-outside algorithm
Forward-backward algorithm
IBM models for MT

3
The EM algorithm
4
Basic setting in EM

X is a set of data points observed data
T is a parameter vector.
EM is a method to find ?ML where
Calculating P(X ?) directly is hard.
Calculating P(X,Y?) is much simpler, where Y is
hidden data (or missing data).

5
The basic EM strategy

Z (X, Y)
Z complete data (augmented data)
X observed data (incomplete data)
Y hidden data (missing data)
Given a fixed x, there could be many possible
ys.
Ex given a sentence x, there could be many state
sequences in an HMM that generates x.

6
Examples of EM
HMM PCFG MT Coin toss
X (observed) sentences sentences Parallel data Head-tail sequences
Y (hidden) State sequences Parse trees Word alignment Coin id sequences
? aij bijk P(A?BC) t(fe) d(ajj, l, m), p1, p2, ?
Algorithm Forward-backward Inside-outside IBM Models N/A
7
The log-likelihood function

L is a function of ?, while holding X constant

8
The iterative approach for MLE
In many cases, we cannot find the solution
directly.
An alternative is to find a sequence
s.t.
9
Jensens inequality
10
Jensens inequality
log is a concave function
11
Maximizing the lower bound
The Q function
12
The Q-function

Define the Q-function (a function of ?)
Y is a random vector.
X(x1, x2, , xn) is a constant (vector).
Tt is the current parameter estimate and is a
constant (vector).
T is the normal variable (vector) that we wish to
adjust.
The Q-function is the expected value of the
complete data log-likelihood P(X,Y?) with
respect to Y given X and ?t.

13
The inner loop of the EM algorithm

E-step calculate
M-step find

14
L(?) is non-decreasing at each iteration

The EM algorithm will produce a sequence
It can be proved that

15
The inner loop of the Generalized EM algorithm
(GEM)

E-step calculate
M-step find

16
Recap of the EM algorithm
17
Idea 1 find ? that maximizes the likelihood of
training data
18
Idea 2 find the ?t sequence

No analytical solution ? iterative approach, find
s.t.

19
Idea 3 find ?t1 that maximizes a tight lower
bound of
a tight lower bound
20
Idea 4 find ?t1 that maximizes the Q function
Lower bound of
The Q function
21
The EM algorithm

Start with initial estimate, ?0
Repeat until convergence
E-step calculate
M-step find

22
Important classes of EM problem

Products of multinomial (PM) models
Exponential families
Gaussian mixture

23
The EM algorithm for PM models
24
PM models
Where is a partition of
all the parameters, and for any j
25
HMM is a PM
26
PCFG

PCFG each sample point (x,y)
x is a sentence
y is a possible parse tree for that sentence.

27
PCFG is a PM
28
Q-function for PM
29
Maximizing the Q function
Maximize
Subject to the constraint
Use Lagrange multipliers
30
Optimal solution
Expected count
Normalization factor
31
PM Models
is rth parameter in the model. Each parameter
is the member of some multinomial distribution.
Count(x,y, r) is the number of times that
is seen in the expression for P(x, y ?)
32
The EM algorithm for PM Models

Calculate expected counts
Update parameters

33
PCFG example

Calculate expected counts
Update parameters

34
The EM algorithm for PM models
// for each iteration
// for each training example xi
// for each possible y
// for each parameter
// for each parameter
35
Inside-outside algorithm
36
Inner loop of the Inside-outside algorithm

Given an input sequence and
Calculate inside probability
Base case
Recursive case
Calculate outside probability
Base case
Recursive case

37
Inside-outside algorithm (cont)
3. Collect the counts
4. Normalize and update the parameters
38
Expected counts for PCFG rules
This is the formula if we have only one
sentence. Add an outside sum if X contains
multiple sentences.
39
Expected counts (cont)
40
Relation to EM

PCFG is a PM Model
Inside-outside algorithm is a special case of the
EM algorithm for PM Models.
X (observed data) each data point is a sentence
w1m.
Y (hidden data) parse tree Tr.
T (parameters)

41
Forward-backward algorithm
42
The inner loop for forward-backward algorithm

Given an input sequence and
Calculate forward probability
Base case
Recursive case
Calculate backward probability
Base case
Recursive case
Calculate expected counts
Update the parameters

43
Expected counts
44
Expected counts (cont)
45
Relation to EM

HMM is a PM Model
Forward-backward algorithm is a special case of
the EM algorithm for PM Models.
X (observed data) each data point is an O1T.
Y (hidden data) state sequence X1T.
T (parameters) aij, bijk, pi.

46
IBM models for MT
47
Expected counts for (f, e) pairs

Let Ct(f, e) be the fractional count of (f, e)
pair in the training data.

Alignment prob
Actual count of times e and f are linked in
(E,F) by alignment a
48
Relation to EM

IBM models are PM Models.
The EM algorithm used in IBM models is a special
case of the EM algorithm for PM Models.
X (observed data) each data point is a sentence
pair (F, E).
Y (hidden data) word alignment a.
T (parameters) t(fe), d(i j, m, n), etc..

49
Summary

The EM algorithm
An iterative approach
L(?) is non-decreasing at each iteration
Optimal solution in M-step exists for many
classes of problems.
The EM algorithm for PM models
Simpler formulae
Three special cases
Inside-outside algorithm
Forward-backward algorithm
IBM Models for MT

50
Relations among the algorithms
The generalized EM
The EM algorithm
PM
Inside-Outside Forward-backward IBM models
Gaussian Mix
51
Strengths of EM

Numerical stability in every iteration of the EM
algorithm, it increases the likelihood of the
observed data.
The EM handles parameter constraints gracefully.

52
Problems with EM

Convergence can be very slow on some problems and
is intimately related to the amount of missing
information.
It guarantees to improve the probability of the
training corpus, which is different from reducing
the errors directly.
It cannot guarantee to reach global maxima (it
could get struck at the local maxima, saddle
points, etc)
? The initial estimate is important.