Title: CRFs vs CMMs, and Stacking
1CRFs vs CMMs, and Stacking
- William W. Cohen
- Sep 30, 2009
2Announcements
- Wednesday, 9/29
- Project abstract due, one/person
- Next Wed, 10/4
- Sign up for a slot to present a paper
- 20min q/a time
- Warning I might shuffle the schedule around a
little after I see the proposals - Next Friday, 10/8
- Project abstract due, one/team
- Put email addresses of project members on the
proposal
3Conditional Random Fields
4Label Bias Problem - 1
- Consider this as an HMM, and enough training
data to perfectly model it
Pr(0123rib)1 Pr(0453rob)1
Pr(0123)Pr(0453)0.5 Pr(rob0123)
Pr(r0)Pr(o1)Pr(b2) 101 0 Pr(rob0453)
Pr(r0)Pr(o4)Pr(b5) 111
1 Pr(0453rob) Pr(rob0453)Pr(0453) /
Z Pr(0123rob) Pr(rob0123)Pr(0123) / Z
Forward/backward
5Label Bias Problem - 2
- Consider this as an MEMM, and enough training
data to perfectly model it
Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
No next-state classifier will model this
well. There are some things HMMs can learn that
MEMMs cant.
6From MEMMs to CRFs
7CRF inference
succinct like maxent
can see the locality
When will prof Cohen post
To classify, find the highest-weight path through
the lattice. The normalizer is the sum of the
weights of all paths through the lattice.
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
8can see the locality
When will prof Cohen post
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
9- With Zj,y we can also compute stuff like
- whats the probability that y2B ?
- whats the probability that y2B and y3I?
When will prof Cohen post
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
10CRF learning
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
- Goal is to learn how to weight edges in the
graph - weight(yi,yi1) 2(yiB or I) and isCap(xi)
1(yiB and isFirstName(xi) - 5(yi1?B and
isLower(xi) and isUpper(xi1)
11CRF learning from Sha Pereira
12CRF learning from Sha Pereira
13CRF learning from Sha Pereira
Something like forward-backward
- Idea
- Define matrix of y,y affinities at stage j
- Mjy,y unnormalized probability of
transition from y to y at stage j as in notes
above - Mj Mj1 unnormalized probability of any
path through stages j and j1
14Forward backward ideas
a
e
name
name
name
c
g
b
f
nonName
nonName
nonName
d
h
15CRF learning from Sha Pereira
16Sha Pereira results
CRF beats MEMM (McNemars test) MEMM probably
beats voted perceptron
17Sha Pereira results
in minutes, 375k examples
18Klein Manning Conditional Structure vs
Estimation
19Task 1 WSD (Word Sense Disambiguation)
Bushs election-year ad campaign will begin this
summer, with... (sense1) Bush whacking is tiring
but rewardingwho wants to spend all their time
on marked trails? (sense2) Class is
sense1/sense2, features are context words.
20Task 1 WSD (Word Sense Disambiguation)
Model 1 Naive Bayes multinomial model
Use conditional rule to predict sense s from
context-word observations o. Standard NB
training maximizes joint likelihood under
independence assumption
21Task 1 WSD (Word Sense Disambiguation)
Model 2 Keep same functional form, but maximize
conditional likelihood (sound familiar?)
or maybe SenseEval score
or maybe even
22In other words
MaxEnt
Naïve Bayes
Different optimization goals
or, dropping a constraint about fs and ?s
23Task 1 WSD (Word Sense Disambiguation)
- Optimize JL with std NB learning
- Optimize SCL, CL with conjugate gradient
- Also over non-deficient models using Lagrange
penalties to enforce soft version of deficiency
constraint - Makes sure non-conditional version is a valid
probability - Punt on optimizing accuracy
- Penalty for extreme predictions in SCL
24(No Transcript)
25Conclusion maxent beats NB?
All generalizations are wrong?
26Task 2 POS Tagging
- Sequential problem
- Replace NB with HMM model.
- Standard algorithms maximize joint likelihood
- Claim keeping the same model but maximizing
conditional likelihood leads to a CRF - Is this true?
- Alternative is conditional structure (CMM)
27CRF
HMM
28Using conditional structure vs maximizing
conditional likelihood
CMM factors Pr(s,o) into Pr(so)Pr(o). For the
CMM model, adding dependencies btwn observations
does not change Pr(so), ie JL estimate CL
estimate for Pr(so)
29Task 2 POS Tagging
Experiments with a simple feature set For fixed
model, CL is preferred to JL (CRF beats HMM) For
fixed objective, HMM is preferred to MEMM/CMM
30Error analysis for POS tagging
- Label bias is not the issue
- state-state dependencies are weak compared to
observation-state dependencies - too much emphasis on observation, not enough on
previous states (observation bias) - put another way label bias predicts
overprediction of states with few outgoing
transitions, of more generally, low entropy...
31Error analysis for POS tagging
32Stacked Sequential Learning
- William W. Cohen
- Center for Automated Learning and Discovery
- Carnegie Mellon University
Vitor Carvalho Language Technology
Institute Carnegie Mellon University
33Outline
- Motivation
- MEMMs dont work on segmentation tasks
- New method
- Stacked sequential MaxEnt
- Stacked sequential Anything
- Results
- More results...
- Conclusions
34However, in celebration of the locale, I will
present this results in the style of Sir Walter
Scott (1771-1832), author of Ivanhoe and other
classics. In that pleasant district of merry
Pennsylvania which is watered by the river Mon,
there extended since ancient times a large
computer science department. Such being our
chief scene, the date of our story refers to a
period towards the middle of the year 2003 ....
35Chapter 1, in which a graduate student (Vitor)
discovers a bug in his advisors code that he
cannot fix
The problem identifying reply and signature
sections of email messages. The method classify
each line as reply, signature, or other.
36Chapter 1, in which a graduate student discovers
a bug in his advisors code that he cannot fix
The problem identifying reply and signature
sections of email messages. The method classify
each line as reply, signature, or other. The
warmup classify each line is signature or
nonsignature, using learning methods from
Minorthird, and dataset of 600 messages The
results from CEAS-2004, Carvalho Cohen....
37Chapter 1, in which a graduate student discovers
a bug in his advisors code that he cannot fix
But... Minorthirds version of MEMMs has an
accuracy of less than 70 (guessing majority
class gives accuracy around 10!)
38Flashback In which we recall the invention and
re-invention of sequential classification with
recurrent sliding windows, ..., MaxEnt Markov
Models (MEMM)
- From data, learn Pr(yiyi-1,xi)
- MaxEnt model
- To classify a sequence x1,x2,... search for the
best y1,y2,... - Viterbi
- beam search
39Flashback In which we recall the invention and
re-invention of sequential classification with
recurrent sliding windows, ..., MaxEnt Markov
Models (MEMM) ... and also praise their many
virtues relative to CRFs
- MEMMs are easy to implement
- MEMMs train quickly
- no probabilistic inference in the inner loop of
learning - You can use any old classifier (even if its not
probabilistic) - MEMMs scale well with number of classes and
length of history
Pr(Yi Yi-1,Yi-2,...,f1(Xi),f2(Xi),...)...
40The flashback ends and we return again to our
document analysis task , on which the elegant
MEMM method fails miserably for reasons unknown
MEMMs have an accuracy of less than 70 on this
problem but why ?
41Chapter 2, in which, in the fullness of time, the
mystery is investigated...
true
...and it transpires that often the classifier
predicts a signature block that is much longer
than is correct
...as if the MEMM gets stuck predicting the sig
label.
42Chapter 2, in which, in the fullness of time, the
mystery is investigated...
...and it transpires that Pr(YisigYi-1sig)
1-e as estimated from the data, giving the
previous label a very high weight.
43Chapter 2, in which, in the fullness of time, the
mystery is investigated...
- We added sequence noise by randomly switching
around 10 of the lines this - lowers the weight for the previous-label feature
- improves performance for MEMMs
- degrades performance for CRFs
- Adding noise in this case however is a loathsome
bit of hackery.
44Chapter 2, in which, in the fullness of time, the
mystery is investigated...
- Label bias problem CRFs can represent some
distributions that MEMMs cannot Lafferty et al
2000 - e.g., the rib-rob problem
- this doesnt explain why MaxEnt gtgt MEMMs
- Observation bias problem MEMMs can overweight
observation features Klein and Manning 2002 - here we observe the opposite the history
features are overweighted
45Chapter 2, in which, in the fullness of time, the
mystery is investigated...and an explanation is
proposed.
- From data, learn Pr(yiyi-1,xi)
- MaxEnt model
- To classify a sequence x1,x2,... search for the
best y1,y2,... - Viterbi
- beam search
46Chapter 2, in which, in the fullness of time, the
mystery is investigated...and an explanation is
proposed.
- From data, learn Pr(yiyi-1,xi)
- MaxEnt model
- To classify a sequence x1,x2,... search for the
best y1,y2,... - Viterbi
- beam search
Learning data is noise-free, including values for
Yi-1
Classification data values for Yi-1 are noisy
since they come from predictions i.e., the
history values used at learning time are a poor
approximation of the values seen in classification
47Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
- From data, learn Pr(yiyi-1,xi)
- MaxEnt model
- To classify a sequence x1,x2,... search for the
best y1,y2,... - Viterbi
- beam search
While learning, replace the true value for Yi-1
with an approximation of the predicted value of
Yi-1
To approximate the value predicted by MEMMs, use
the value predicted by non-sequential MaxEnt in a
cross-validation experiment. After Wolpert
1992 we call this stacked MaxEnt.
find approximate Ys with a MaxEnt-learned
hypothesis, and then apply the sequential model
to that
48Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
- Learn Pr(yixi) with MaxEnt and save the model as
f(x) - Do k-fold cross-validation with MaxEnt, saving
the cross-validated predictions the
cross-validated predictions yifk(xi) - Augment the original examples with the ys and
compute history features g(x,y) ? x - Learn Pr(yixi) with MaxEnt and save the model
as f(x) - To classify augment x with yf(x), and apply f
to the resulting x i.e., return f(g(x,f(x))
f
f
49Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
- StackedMaxEnt (k5) outperforms MEMMs and
non-sequential MaxEnt, but not CRFs - StackedMaxEnt can also be easily extended....
- Its easy (but expensive) to increase the depth
of stacking - Its easy to increase the history size
- Its easy to build features for future
estimated Yis as well as past Yis. - stacking can be applied to any other sequential
learner
50Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
- StackedMaxEnt can also be easily extended....
- Its easy (but expensive) to increase the depth
of stacking - Its cheap to increase the history size
- Its easy to build features for future
estimated Yis as well as past Yis. - stacking can be applied to any other sequential
learner
Yi-1
Yi
Yi1
. . .
. . .
Yi-1
Yi
Yi1
. . .
. . .
Yi-1
Yi
Yi1
. . .
. . .
Xi-1
Xi
Xi1
. . .
. . .
51Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
- StackedMaxEnt can also be easily extended....
- Its easy (but expensive) to increase the depth
of stacking - Its cheap to increase the history size
- Its easy to build features for future
estimated Yis as well as past Yis. - stacking can be applied to any other sequential
learner
Yi1
Yi
Yi1
Yi-1
Yi
Yi1
. . .
. . .
Xi-1
Xi
Xi1
. . .
. . .
52Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
- StackedMaxEnt can also be easily extended....
- Its cheap to increase the history size, and
build features for future estimated Yis as
well as past Yis.
53Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
- StackedMaxEnt can also be easily extended....
- Its easy (but expensive) to increase the depth
of stacking - Its cheap to increase the history size
- Its easy to build features for future
estimated Yis as well as past Yis. - stacking can be applied to any other sequential
learner
- Learn Pr(yixi) with MaxEnt and save the model as
f(x) - Do k-fold cross-validation with MaxEnt, saving
the cross-validated predictions the
cross-validated predictions yifk(xi) - Augment the original examples with the ys and
compute history features g(x,y) ? x - Learn Pr(yixi) with MaxEnt and save the model
as f(x) - To classify augment x with yf(x), and apply f
to the resulting x i.e., return f(g(x,f(x))
54Chapter 3, in which a novel extension to MEMMs is
proposed and several diverse variants of the
extension are evaluated on signature-block
finding....
non-sequential MaxEnt baseline
Reduction in error rate for stacked-MaxEnt (s-ME)
vs CRFs is 46, which is statistically
significant
stacked MaxEnt, no future
With large windows stackedME is better than CRF
baseline
CRF baseline
stacked MaxEnt, stackedCRFs with large
historyfuture
window/history size
55Chapter 4, in which the experiment above is
repeated on a new domain, and then repeated again
on yet another new domain.
stacking (wk5)
-stacking
newsgroup FAQ segmentation (2 labels x three
newsgroups)
video segmentation
56Chapter 4, in which the experiment above is
repeated on a new domain, and then repeated again
on yet another new domain.
57Chapter 5, in which all the experiments above
were repeated for a second set of learners the
voted perceptron (VP), the voted-perceptron-traine
d HMM (VP-HMM), and their stacked versions.
58Chapter 5, in which all the experiments above
were repeated for a second set of learners the
voted perceptron (VP), the voted-perceptron-traine
d HMM (VP-HMM), and their stacked versions.
- Stacking usually improves or leaves unchanged
- MaxEnt (pgt0.98)
- VotedPerc (pgt0.98)
- VPHMM (pgt0.98)
- CRFs (pgt0.92)
- on a randomly chosen problem using a 1-tailed
sign test
59Chapter 4b, in which the experiment above is
repeated again for yet one more new domain....
- Classify pop songs as happy or sad
- 1-second long song frames inherit the mood of
their containing song - Song frames are classified with a sequential
classifier - Song mood is majority class of all its frames
- 52,188 frames from 201 songs, 130 features per
frame, used k5, w25
60Epilog in which the speaker discusses certain
issues of possible interest to the listener, who
is now fully informed of the technical issues (or
it may be, only better rested) and thus receptive
to such commentary
- Scope
- we considered only segmentation taskssequences
with long runs of identical labelsand 2-class
problems. - MEMM fails here.
- Issue
- learner is brittle w.r.t. assumptions
- training data for local model is assumed to be
error-free, which is systematically wrong
- Solution sequential stacking
- model-free way to improve robustness
- stacked MaxEnt outperforms or ties CRFs on 8/10
tasks stacked VP outperforms CRFs on 8/9 tasks. - a meta-learning method applies to any base
learner, and can also reduce error of CRF
substantially - experiments with non-segmentation problems (NER)
had no large gains
61Epilog in which the speaker discusses certain
issues of possible interest to the listener, who
is now fully informed of the technical issues (or
it may be, only better rested) and thus receptive
to such commentary
- ... and in which finally, the speaker realizes
that the structure of the epic romantic novel is
ill-suited to talks of this ilk, and perhaps even
the very medium of PowerPoint itself, but
none-the-less persists with a final animation...
Sir W. Scott R.I.P.