Title: EM algorithm and applications
1 EM algorithm and applications
2Relative Entropy
- Let p,q be two probability distributions on the
same sample space. The relative entropy between p
and q is defined by - H(pq) ?x p(x)logp(x)/q(x)
- ?x p(x)log(1/(q(x)) -
- -? x p(x)log(1/(p(x)).
The inefficiency of assuming distribution q when
the correct distribution is p.
3EM algorithm approximating MLE from Incomplete
Data
- Finding MLE parameters nonlinear optimization
problem
log P(x ?)
E ?log P(x,y?)
?
?
?
General idea of EM Use current point ? to
construct alternative function Q? (which is
nice) if Q?(?)gtQ?(?), than ? has higher
likelihood than ?.
4The EM algorithm
- Consider a model where, for observed data x and
model parameters ? - p(x?)?yp(x,y?).
- (y are hidden data).
- The EM algorithm receives x and parameters ?, and
returns new parameters ?, s.t. p(x?) gt
p(x?). -
5The EM algorithm
- Finding ? which maximizes
- p(x?)?yp(x,y?).
- is equivalent to finding ? which maximizes the
logarithm - log p(x?) log (?yp(x,y ?))
- Which is what the EM algorithm attempts to do.
6The EM algorithm
- In each iteration the EM algorithm does the
following. - (E step) Calculate
- Q? (?) ?y p(yx,?)log p(x,y?)
- (M step) Find ? which maximizes Q? (?)
- (Next iteration sets ? ?? and repeats).
7Example Baum-Welsh EM for HMM
- The Baum-Welsh algorithm is the EM algorithm for
HMM - E step for HMM Q? (?) ?S p(sx,?)log p(s,x?),
where ? are the new parameters akl,ek(b).
(The are the counts of state
transitions and symbol emissions in (s,x)).
8Baum Welsh EM for HMM
- M step For HMM Find ? which maximizes Q? (?).
As we proved, ? is given by the relative
frequencies of the Akls and the Ek(b)s
9A simplest example EM for 2 coin tosses
- Given a coin with two possible outcomes H (head)
and T (tail), with probabilities qH, qT 1- qH. - The coin is tossed twice, but only the 1st
outcome, T, is seen. So the data is x (T,). - We wish to apply the EM algorithm to get
parameters that increase the likelihood of the
data. - Let the initial parameters be ? (qH, qT) (
¼, ¾ ).
10EM for 2 coin tosses
The hidden data which can produce x are the
sequences y1 (T,H) y2(T,T) Hence the
likelihood of x with parameters (qH, qT), is p(x
?) P(x,y1 ?) P(x,y2 ?) qHqTqT2 For the
initial parameters ? ( ¼, ¾ ), we have p(x ?)
¾ ¼ ¾ ¾ ¾
- Note that in this case P(x,yi ?) P(yi ?), for
i 1,2. - we can always define y so that (x,y) y
(otherwise we set y ?(x,y) and replace the y
s by y s).
11EM for 2 coin tosses - E step
- Calculate Q? (?) Q?(qH,qT).
- Q? (?) p(y1x,?)log p(x,y1?)
- p(y2x,?)log p(x,y2?)
- p(y1x,?) p(y1,x?)/p(x?) (¾ ¼)/ (¾)
¼ - p(y2x,?) p(y2,x?)/p(x?) (¾ ¾)/ (¾)
¾ - Thus we have
- Q? (?) ¼ log p(x,y1?) ¾ log p(x,y2?)
12EM for 2 coin tosses - E step
- For a sequence y of coin tosses, let NH(y) be the
number of Hs in y, and NT(y) be the number of
Ts in y. Then - log p(y?) NH(y) log qH NT(y) log qT
- In our example
- y1 (T,H) y2(T,T), hence
- NH(y1) NT(y1)1, NH(y2) 0, NT(y2)2
13Example 2 coin tosses - E step
- Thus
- ¼ log p(x,y1?) ¼ (NH(y1) log qH NT(y1) log
qT) ¼ (log qH log qT) - ¾ log p(x,y2?) ¾ ( NH(y2) log qH NT(y2) log
qT) ¾ (2 log qT) - Substituting in the equation for Q? (?)
- Q? (?) ¼ log p(x,y1?) ¾ log p(x,y2?)
- ( ¼ NH(y1) ¾ NH(y2))log qH ( ¼ NT(y1) ¾
NT(y2))log qT
NT 7/4
NH ¼
Q? (?) NHlog qH NTlog qT
14EM for 2 coin tosses - M step
Find ? which maximizes Q? (?) Q? (?) NHlog qH
NTlog qT ¼ log qH 7/4 log qT We saw earlier
that this is maximized when
The optimal parameters (0,1), will never be
reached by the EM algorithm!
15EM for single random variable (dice)
Now, the probability of each y ((x,y)) is given
by a sequence of dice tosses. The dice has m
outcomes, with probabilities q1,..,qm. Let Nl(y)
(outcome l occurs in y). Then
Let Nl be the expected value of Nl(y), given x
and ? NlE(Nlx,?) ?y p(yx,?) Nl(y),
16Q? (?) for one dice
17EM algorithm for n independent observations x1,,
xn
Expectation step It can be shown that, if the xj
are independent, then
18Example The ABO locus
A locus is a particular place on the chromosome.
Each locus state (called genotype) consists of
two alleles one parental and one maternal.
Some loci (plural of locus) determine
distinguished features. The ABO locus, for
example, determines blood type.
Suppose we randomly sampled N individuals and
found that Na/a have genotype a/a, Na/b have
genotype a/b, etc. Then, the MLE is given by
19The ABO locus
However, testing individuals for their genotype
is a very expensive. Can we estimate the
proportions of genotype using the common cheap
blood test with outcome being one of the four
blood types (A, B, AB, O) ?
The problem is that among individuals measured to
have blood type A, we dont know how many have
genotype a/a and how many have genotype a/o. So
what can we do ?
20The ABO locus
The Hardy-Weinberg equilibrium rule states that
in equilibrium the frequencies of the three
alleles qa,qb,qo in the population determine the
frequencies of the genotypes as follows qa/b
2qa qb, qa/o 2qa qo, qb/o 2qb qo, qa/a
qa2, qb/b qb2, qo/o qo2. In fact,
Hardy-Weinberg equilibrium rule follows from
modeling this problem as data x with hidden
parameters y.
21The ABO locus
The dice outcome are the three possible alleles
a, b and o. The observed data are the blood types
A, B, AB or O. Each blood type is determined by
two successive random sampling of alleles, which
is an ordered genotypes pair this is the
hidden data. For instance blood type A
corresponds to the ordered genotypes pairs (a,a),
(a,o) and (o,a). So we have three parameters of
one dice qa,qb,qo - that we need to estimate.
22EM setting for the ABO locus problem
The observed data x (x1,..,xn) is a sequence of
letters (blood types) from the alphabet
A,B,AB,O. eg (B,A,B,B,O,A,B,A,O,B, AB) are
observations (x1,x11). The hidden data (ie the
ys) for each letter xj is the set of ordered
pairs of alleles that generates it. For instance,
for A it is the set aa, ao, oa. The
parameters ? qa ,qb, qo are the probabilities
of the alleles. We need is to find the
parameters ? qa ,qb, qo that maximize the
likelihood of the given data. We do this by the
EM algorithm
23EM for ABO locus problem
- For each observed blood type xj?A,B,AB,O and
for each allele z in a,b,o we compute Nz(xj) ,
the expected number of times that z appear in xj.
Where the sum is taken over the ordered genotype
pairs yj, and Nz(yj) is the number of times
allele z occurs in the pair yj. eg, Na(o,b)0
Nb(o,b) No(o,b) 1.
24EM for ABO locus problem
The computation for blood type B
P(B?) P((b,b)?) p((b,o)?) p((o,b)?))
qb2 2qbqo. Since Nb((b,b))2, and
Nb((b,o))Nb((o,b)) No((o,b))No((b,o))1, No(B)
and Nb(B) , the expected number of occurrences
of o and b in B, are given by
Observe that Nb(B) No(B) 2
25EM for ABO loci
Similarly, P(A?) qa2 2qaqo.
P(AB?) p((b,a)?) p((a,b)?)) 2qaqb
Na(AB) Nb(AB) 1
P(O?) p((o,o)?) qo2
No(O) 2
Nb(O) Na(O) No(AB) Nb(A) Na(B) 0
26E step compute Na, Nb and No
Let (A)3, (B)5, (AB)1, (O)2 be the number
of observations of A, B, AB, and O respectively.
M step set ?( qa, qb , qo)
27Example the Motif Finding Problem
- Given a set of DNA sequences
- cctgatagacgctatctggctatccacgtacgtaggtcctctgtgcgaat
ctatgcgtttccaaccat - agtactggtgtacatttgatacgtacgtacaccggcaacctgaaacaaac
gctcagaaccagaagtgc - aaacgtacgtgcaccctctttcttcgtggctctggccaacgagggctgat
gtataagacgaaaatttt - agcctccgatgtaagtcatagctgtaactattacctgccacccctattac
atcttacgtacgtataca - ctgttatacaacgcgtcatggcggggtatgcgttttggtcgtcgtacgct
cgatcgttaacgtacgtc - Find the motif in each of the individual sequences
28Motif Finding Problem a reformulation
- Collect all substrings with the same length k
from the input sequences - With N sequences with same length L, nN?(L-k1)
substrings can be derived - Find a significant number of substring that can
be described by a profile model.
29Fitting a mixture model by EM
- A finite mixture model
- data X (X1,,Xn) arises from two or more groups
with g models ? (?1, , ??g). - Indicator vectors Z (Z1,,Zn), where Zi
(Zi1,,Zig), and Zij 1 if Xi is from group j,
and 0 otherwise. - P(Zij 1?) ?j . For any given i, all Zij are
0 except one j - g2 class 1 (the motif) and class 2 (the
background) are given by position specific and a
general multinomial distribution
30Complete data likelihood
- Under the assumption that the pairs (Zi,Xi)
are mutually independent, their joint density may
be written - P(Z, X ?) ?ij ?j P(Xi?j)
Zij - The log likelihood of the model is thus
- log L(?, ? Z, X) ?? Zij log ?j
P(Xi?j) . - The EM algorithm iteratively computes the
expectation of the likelihood given the observed
data X, and initial estimates ? and ? of ? and
? (the E-step), and then maximizes the result in
the free variables ? and ? leading to new
estimates ? and ? (the M-step).
31Mixture models the E-step
- Since the log likelihood is the sum of over i
and j of terms multiplying Zij, and these are
independent across i, we need only consider the
expectation of one such, given Xi. Using initial
parameter values ? and ?, and the fact that
the Zij are binary, we get - E(Zij X,?,?)?jP(Xi?j)/ ?k
?kP(Xi?k) Zij
32Mixture models the M-step
- Now we want to maximize the result of an
E-step - ?? Zij ?j ?? Zij log P(Xi
?j). - The maximization over ? is independent of the
rest and is readily achieved with - ?j ?iZij / n.
33Mixture models the M-step
-
- Note, P(Xi?1) ?j?k fjk I(k,Xij) , and P(Xi
?2) ?j?k f0k I(k,Xij) -
- where Xij is the letter in the jth position of
sample i, and I(k,a) 1 if aak, and 0
otherwise. -
- c0k ?? Zi2I(k,Xij), and cjk ??
Zi1I(k,Xij). -
- Here c0k is the expected number of times
letter ak appears in the background, and cjk the
expected number of times ak appears in
occurrences of the motif in the data. - fjk cjk / ?k1L cjk , j
0,1,,w k 1,,L. -
- In practice, care must be taken to avoid zero
frequencies, so one adds pseudo-counts small
constants ?j ,??j ?, giving - fjk (cjk ?j )/ (?k1L cjk ?) , j
0,1,,w k 1,,L.