CRFs vs CMMs, and Stacking - PowerPoint PPT Presentation

About This Presentation

Title:

CRFs vs CMMs, and Stacking

Description:

Title: PowerPoint Presentation Last modified by: William W. Cohen Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:106

Avg rating:3.0/5.0

Slides: 62

Provided by: cmue48

Learn more at: http://www.cs.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: CRFs vs CMMs, and Stacking

1
CRFs vs CMMs, and Stacking

William W. Cohen
Sep 30, 2009

2
Announcements

Wednesday, 9/29
Project abstract due, one/person
Next Wed, 10/4
Sign up for a slot to present a paper
20min q/a time
Warning I might shuffle the schedule around a
little after I see the proposals
Next Friday, 10/8
Project abstract due, one/team
Put email addresses of project members on the
proposal

3
Conditional Random Fields

Review

4
Label Bias Problem - 1

Consider this as an HMM, and enough training
data to perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123)Pr(0453)0.5 Pr(rob0123)
Pr(r0)Pr(o1)Pr(b2) 101 0 Pr(rob0453)
Pr(r0)Pr(o4)Pr(b5) 111
1 Pr(0453rob) Pr(rob0453)Pr(0453) /
Z Pr(0123rob) Pr(rob0123)Pr(0123) / Z
Forward/backward
5
Label Bias Problem - 2

Consider this as an MEMM, and enough training
data to perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
No next-state classifier will model this
well. There are some things HMMs can learn that
MEMMs cant.
6
From MEMMs to CRFs
7
CRF inference
succinct like maxent
can see the locality
When will prof Cohen post

To classify, find the highest-weight path through
the lattice. The normalizer is the sum of the
weights of all paths through the lattice.
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
8
can see the locality
When will prof Cohen post
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
9

With Zj,y we can also compute stuff like
whats the probability that y2B ?
whats the probability that y2B and y3I?

When will prof Cohen post
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
10
CRF learning
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O

Goal is to learn how to weight edges in the
graph
weight(yi,yi1) 2(yiB or I) and isCap(xi)
1(yiB and isFirstName(xi) - 5(yi1?B and
isLower(xi) and isUpper(xi1)

11
CRF learning from Sha Pereira
12
CRF learning from Sha Pereira
13
CRF learning from Sha Pereira
Something like forward-backward

Idea
Define matrix of y,y affinities at stage j
Mjy,y unnormalized probability of
transition from y to y at stage j as in notes
above
Mj Mj1 unnormalized probability of any
path through stages j and j1

14
Forward backward ideas
a
e
name
name
name
c
g
b
f
nonName
nonName
nonName
d
h
15
CRF learning from Sha Pereira
16
Sha Pereira results
CRF beats MEMM (McNemars test) MEMM probably
beats voted perceptron
17
Sha Pereira results
in minutes, 375k examples
18
Klein Manning Conditional Structure vs
Estimation
19
Task 1 WSD (Word Sense Disambiguation)
Bushs election-year ad campaign will begin this
summer, with... (sense1) Bush whacking is tiring
but rewardingwho wants to spend all their time
on marked trails? (sense2) Class is
sense1/sense2, features are context words.
20
Task 1 WSD (Word Sense Disambiguation)
Model 1 Naive Bayes multinomial model
Use conditional rule to predict sense s from
context-word observations o. Standard NB
training maximizes joint likelihood under
independence assumption
21
Task 1 WSD (Word Sense Disambiguation)
Model 2 Keep same functional form, but maximize
conditional likelihood (sound familiar?)
or maybe SenseEval score
or maybe even
22
In other words
MaxEnt
Naïve Bayes
Different optimization goals
or, dropping a constraint about fs and ?s
23
Task 1 WSD (Word Sense Disambiguation)

Optimize JL with std NB learning
Optimize SCL, CL with conjugate gradient
Also over non-deficient models using Lagrange
penalties to enforce soft version of deficiency
constraint
Makes sure non-conditional version is a valid
probability
Punt on optimizing accuracy
Penalty for extreme predictions in SCL

24
(No Transcript)
25
Conclusion maxent beats NB?
All generalizations are wrong?
26
Task 2 POS Tagging

Sequential problem
Replace NB with HMM model.
Standard algorithms maximize joint likelihood

Claim keeping the same model but maximizing
conditional likelihood leads to a CRF
Is this true?
Alternative is conditional structure (CMM)

27
CRF
HMM
28
Using conditional structure vs maximizing
conditional likelihood
CMM factors Pr(s,o) into Pr(so)Pr(o). For the
CMM model, adding dependencies btwn observations
does not change Pr(so), ie JL estimate CL
estimate for Pr(so)
29
Task 2 POS Tagging
Experiments with a simple feature set For fixed
model, CL is preferred to JL (CRF beats HMM) For
fixed objective, HMM is preferred to MEMM/CMM
30
Error analysis for POS tagging

Label bias is not the issue
state-state dependencies are weak compared to
observation-state dependencies
too much emphasis on observation, not enough on
previous states (observation bias)
put another way label bias predicts
overprediction of states with few outgoing
transitions, of more generally, low entropy...

31
Error analysis for POS tagging
32
Stacked Sequential Learning

William W. Cohen
Center for Automated Learning and Discovery
Carnegie Mellon University

Vitor Carvalho Language Technology
Institute Carnegie Mellon University
33
Outline

Motivation
MEMMs dont work on segmentation tasks
New method
Stacked sequential MaxEnt
Stacked sequential Anything
Results
More results...
Conclusions

34
However, in celebration of the locale, I will
present this results in the style of Sir Walter
Scott (1771-1832), author of Ivanhoe and other
classics. In that pleasant district of merry
Pennsylvania which is watered by the river Mon,
there extended since ancient times a large
computer science department. Such being our
chief scene, the date of our story refers to a
period towards the middle of the year 2003 ....
35
Chapter 1, in which a graduate student (Vitor)
discovers a bug in his advisors code that he
cannot fix
The problem identifying reply and signature
sections of email messages. The method classify
each line as reply, signature, or other.
36
Chapter 1, in which a graduate student discovers
a bug in his advisors code that he cannot fix
The problem identifying reply and signature
sections of email messages. The method classify
each line as reply, signature, or other. The
warmup classify each line is signature or
nonsignature, using learning methods from
Minorthird, and dataset of 600 messages The
results from CEAS-2004, Carvalho Cohen....
37
Chapter 1, in which a graduate student discovers
a bug in his advisors code that he cannot fix
But... Minorthirds version of MEMMs has an
accuracy of less than 70 (guessing majority
class gives accuracy around 10!)
38
Flashback In which we recall the invention and
re-invention of sequential classification with
recurrent sliding windows, ..., MaxEnt Markov
Models (MEMM)

From data, learn Pr(yiyi-1,xi)
MaxEnt model
To classify a sequence x1,x2,... search for the
best y1,y2,...
Viterbi
beam search

39
Flashback In which we recall the invention and
re-invention of sequential classification with
recurrent sliding windows, ..., MaxEnt Markov
Models (MEMM) ... and also praise their many
virtues relative to CRFs

MEMMs are easy to implement
MEMMs train quickly
no probabilistic inference in the inner loop of
learning
You can use any old classifier (even if its not
probabilistic)
MEMMs scale well with number of classes and
length of history

Pr(Yi Yi-1,Yi-2,...,f1(Xi),f2(Xi),...)...
40
The flashback ends and we return again to our
document analysis task , on which the elegant
MEMM method fails miserably for reasons unknown
MEMMs have an accuracy of less than 70 on this
problem but why ?
41
Chapter 2, in which, in the fullness of time, the
mystery is investigated...
true
...and it transpires that often the classifier
predicts a signature block that is much longer
than is correct
...as if the MEMM gets stuck predicting the sig
label.
42
Chapter 2, in which, in the fullness of time, the
mystery is investigated...
...and it transpires that Pr(YisigYi-1sig)
1-e as estimated from the data, giving the
previous label a very high weight.
43
Chapter 2, in which, in the fullness of time, the
mystery is investigated...

We added sequence noise by randomly switching
around 10 of the lines this
lowers the weight for the previous-label feature
improves performance for MEMMs
degrades performance for CRFs
Adding noise in this case however is a loathsome
bit of hackery.

44
Chapter 2, in which, in the fullness of time, the
mystery is investigated...

Label bias problem CRFs can represent some
distributions that MEMMs cannot Lafferty et al
2000
e.g., the rib-rob problem
this doesnt explain why MaxEnt gtgt MEMMs
Observation bias problem MEMMs can overweight
observation features Klein and Manning 2002
here we observe the opposite the history
features are overweighted

45
Chapter 2, in which, in the fullness of time, the
mystery is investigated...and an explanation is
proposed.

From data, learn Pr(yiyi-1,xi)
MaxEnt model
To classify a sequence x1,x2,... search for the
best y1,y2,...
Viterbi
beam search

46
Chapter 2, in which, in the fullness of time, the
mystery is investigated...and an explanation is
proposed.

From data, learn Pr(yiyi-1,xi)
MaxEnt model
To classify a sequence x1,x2,... search for the
best y1,y2,...
Viterbi
beam search

Learning data is noise-free, including values for
Yi-1
Classification data values for Yi-1 are noisy
since they come from predictions i.e., the
history values used at learning time are a poor
approximation of the values seen in classification
47
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem

From data, learn Pr(yiyi-1,xi)
MaxEnt model
To classify a sequence x1,x2,... search for the
best y1,y2,...
Viterbi
beam search

While learning, replace the true value for Yi-1
with an approximation of the predicted value of
Yi-1
To approximate the value predicted by MEMMs, use
the value predicted by non-sequential MaxEnt in a
cross-validation experiment. After Wolpert
1992 we call this stacked MaxEnt.
find approximate Ys with a MaxEnt-learned
hypothesis, and then apply the sequential model
to that
48
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem

Learn Pr(yixi) with MaxEnt and save the model as
f(x)
Do k-fold cross-validation with MaxEnt, saving
the cross-validated predictions the
cross-validated predictions yifk(xi)
Augment the original examples with the ys and
compute history features g(x,y) ? x
Learn Pr(yixi) with MaxEnt and save the model
as f(x)
To classify augment x with yf(x), and apply f
to the resulting x i.e., return f(g(x,f(x))

f
f
49
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem

StackedMaxEnt (k5) outperforms MEMMs and
non-sequential MaxEnt, but not CRFs
StackedMaxEnt can also be easily extended....
Its easy (but expensive) to increase the depth
of stacking
Its easy to increase the history size
Its easy to build features for future
estimated Yis as well as past Yis.
stacking can be applied to any other sequential
learner

50
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem

StackedMaxEnt can also be easily extended....
Its easy (but expensive) to increase the depth
of stacking
Its cheap to increase the history size
Its easy to build features for future
estimated Yis as well as past Yis.
stacking can be applied to any other sequential
learner

Yi-1
Yi
Yi1
. . .
. . .
Yi-1
Yi
Yi1
. . .
. . .

Yi-1
Yi
Yi1
. . .
. . .
Xi-1
Xi
Xi1
. . .
. . .
51
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem

StackedMaxEnt can also be easily extended....
Its easy (but expensive) to increase the depth
of stacking
Its cheap to increase the history size
Its easy to build features for future
estimated Yis as well as past Yis.
stacking can be applied to any other sequential
learner

Yi1
Yi
Yi1

Yi-1
Yi
Yi1

. . .
. . .
Xi-1
Xi
Xi1
. . .
. . .
52
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem

StackedMaxEnt can also be easily extended....
Its cheap to increase the history size, and
build features for future estimated Yis as
well as past Yis.

53
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem

StackedMaxEnt can also be easily extended....
Its easy (but expensive) to increase the depth
of stacking
Its cheap to increase the history size
Its easy to build features for future
estimated Yis as well as past Yis.
stacking can be applied to any other sequential
learner

Learn Pr(yixi) with MaxEnt and save the model as
f(x)
Do k-fold cross-validation with MaxEnt, saving
the cross-validated predictions the
cross-validated predictions yifk(xi)
Augment the original examples with the ys and
compute history features g(x,y) ? x
Learn Pr(yixi) with MaxEnt and save the model
as f(x)
To classify augment x with yf(x), and apply f
to the resulting x i.e., return f(g(x,f(x))

54
Chapter 3, in which a novel extension to MEMMs is
proposed and several diverse variants of the
extension are evaluated on signature-block
finding....
non-sequential MaxEnt baseline
Reduction in error rate for stacked-MaxEnt (s-ME)
vs CRFs is 46, which is statistically
significant
stacked MaxEnt, no future
With large windows stackedME is better than CRF
baseline
CRF baseline
stacked MaxEnt, stackedCRFs with large
historyfuture
window/history size
55
Chapter 4, in which the experiment above is
repeated on a new domain, and then repeated again
on yet another new domain.
stacking (wk5)
-stacking
newsgroup FAQ segmentation (2 labels x three
newsgroups)
video segmentation
56
Chapter 4, in which the experiment above is
repeated on a new domain, and then repeated again
on yet another new domain.
57
Chapter 5, in which all the experiments above
were repeated for a second set of learners the
voted perceptron (VP), the voted-perceptron-traine
d HMM (VP-HMM), and their stacked versions.
58
Chapter 5, in which all the experiments above
were repeated for a second set of learners the
voted perceptron (VP), the voted-perceptron-traine
d HMM (VP-HMM), and their stacked versions.

Stacking usually improves or leaves unchanged
MaxEnt (pgt0.98)
VotedPerc (pgt0.98)
VPHMM (pgt0.98)
CRFs (pgt0.92)
on a randomly chosen problem using a 1-tailed
sign test

59
Chapter 4b, in which the experiment above is
repeated again for yet one more new domain....

Classify pop songs as happy or sad
1-second long song frames inherit the mood of
their containing song
Song frames are classified with a sequential
classifier
Song mood is majority class of all its frames
52,188 frames from 201 songs, 130 features per
frame, used k5, w25

60
Epilog in which the speaker discusses certain
issues of possible interest to the listener, who
is now fully informed of the technical issues (or
it may be, only better rested) and thus receptive
to such commentary

Scope
we considered only segmentation taskssequences
with long runs of identical labelsand 2-class
problems.
MEMM fails here.
Issue
learner is brittle w.r.t. assumptions
training data for local model is assumed to be
error-free, which is systematically wrong

Solution sequential stacking
model-free way to improve robustness
stacked MaxEnt outperforms or ties CRFs on 8/10
tasks stacked VP outperforms CRFs on 8/9 tasks.
a meta-learning method applies to any base
learner, and can also reduce error of CRF
substantially
experiments with non-segmentation problems (NER)
had no large gains

61
Epilog in which the speaker discusses certain
issues of possible interest to the listener, who
is now fully informed of the technical issues (or
it may be, only better rested) and thus receptive
to such commentary

... and in which finally, the speaker realizes
that the structure of the epic romantic novel is
ill-suited to talks of this ilk, and perhaps even
the very medium of PowerPoint itself, but
none-the-less persists with a final animation...

Sir W. Scott R.I.P.

Write a Comment

User Comments (0)