CRFs vs CMMs, and Stacking - PowerPoint PPT Presentation

About This Presentation
Title:

CRFs vs CMMs, and Stacking

Description:

Title: PowerPoint Presentation Last modified by: William W. Cohen Created Date: 1/1/1601 12:00:00 AM Document presentation format: On-screen Show (4:3) – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 62
Provided by: cmue48
Learn more at: http://www.cs.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: CRFs vs CMMs, and Stacking


1
CRFs vs CMMs, and Stacking
  • William W. Cohen
  • Sep 30, 2009

2
Announcements
  • Wednesday, 9/29
  • Project abstract due, one/person
  • Next Wed, 10/4
  • Sign up for a slot to present a paper
  • 20min q/a time
  • Warning I might shuffle the schedule around a
    little after I see the proposals
  • Next Friday, 10/8
  • Project abstract due, one/team
  • Put email addresses of project members on the
    proposal

3
Conditional Random Fields
  • Review

4
Label Bias Problem - 1
  • Consider this as an HMM, and enough training
    data to perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123)Pr(0453)0.5 Pr(rob0123)
Pr(r0)Pr(o1)Pr(b2) 101 0 Pr(rob0453)
Pr(r0)Pr(o4)Pr(b5) 111
1 Pr(0453rob) Pr(rob0453)Pr(0453) /
Z Pr(0123rob) Pr(rob0123)Pr(0123) / Z
Forward/backward
5
Label Bias Problem - 2
  • Consider this as an MEMM, and enough training
    data to perfectly model it

Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
No next-state classifier will model this
well. There are some things HMMs can learn that
MEMMs cant.
6
From MEMMs to CRFs
7
CRF inference
succinct like maxent
can see the locality
When will prof Cohen post

To classify, find the highest-weight path through
the lattice. The normalizer is the sum of the
weights of all paths through the lattice.
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
8
can see the locality
When will prof Cohen post
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
9
  • With Zj,y we can also compute stuff like
  • whats the probability that y2B ?
  • whats the probability that y2B and y3I?

When will prof Cohen post
B
B
B
B
B
I
I
I
I
I
O
O
O
O
O
10
CRF learning
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
  • Goal is to learn how to weight edges in the
    graph
  • weight(yi,yi1) 2(yiB or I) and isCap(xi)
    1(yiB and isFirstName(xi) - 5(yi1?B and
    isLower(xi) and isUpper(xi1)

11
CRF learning from Sha Pereira
12
CRF learning from Sha Pereira
13
CRF learning from Sha Pereira
Something like forward-backward
  • Idea
  • Define matrix of y,y affinities at stage j
  • Mjy,y unnormalized probability of
    transition from y to y at stage j as in notes
    above
  • Mj Mj1 unnormalized probability of any
    path through stages j and j1

14
Forward backward ideas
a
e
name
name
name
c
g
b
f
nonName
nonName
nonName
d
h
15
CRF learning from Sha Pereira
16
Sha Pereira results
CRF beats MEMM (McNemars test) MEMM probably
beats voted perceptron
17
Sha Pereira results
in minutes, 375k examples
18
Klein Manning Conditional Structure vs
Estimation
19
Task 1 WSD (Word Sense Disambiguation)
Bushs election-year ad campaign will begin this
summer, with... (sense1) Bush whacking is tiring
but rewardingwho wants to spend all their time
on marked trails? (sense2) Class is
sense1/sense2, features are context words.
20
Task 1 WSD (Word Sense Disambiguation)
Model 1 Naive Bayes multinomial model
Use conditional rule to predict sense s from
context-word observations o. Standard NB
training maximizes joint likelihood under
independence assumption
21
Task 1 WSD (Word Sense Disambiguation)
Model 2 Keep same functional form, but maximize
conditional likelihood (sound familiar?)
or maybe SenseEval score
or maybe even
22
In other words
MaxEnt
Naïve Bayes
Different optimization goals
or, dropping a constraint about fs and ?s
23
Task 1 WSD (Word Sense Disambiguation)
  • Optimize JL with std NB learning
  • Optimize SCL, CL with conjugate gradient
  • Also over non-deficient models using Lagrange
    penalties to enforce soft version of deficiency
    constraint
  • Makes sure non-conditional version is a valid
    probability
  • Punt on optimizing accuracy
  • Penalty for extreme predictions in SCL

24
(No Transcript)
25
Conclusion maxent beats NB?
All generalizations are wrong?
26
Task 2 POS Tagging
  • Sequential problem
  • Replace NB with HMM model.
  • Standard algorithms maximize joint likelihood
  • Claim keeping the same model but maximizing
    conditional likelihood leads to a CRF
  • Is this true?
  • Alternative is conditional structure (CMM)

27
CRF
HMM
28
Using conditional structure vs maximizing
conditional likelihood
CMM factors Pr(s,o) into Pr(so)Pr(o). For the
CMM model, adding dependencies btwn observations
does not change Pr(so), ie JL estimate CL
estimate for Pr(so)
29
Task 2 POS Tagging
Experiments with a simple feature set For fixed
model, CL is preferred to JL (CRF beats HMM) For
fixed objective, HMM is preferred to MEMM/CMM
30
Error analysis for POS tagging
  • Label bias is not the issue
  • state-state dependencies are weak compared to
    observation-state dependencies
  • too much emphasis on observation, not enough on
    previous states (observation bias)
  • put another way label bias predicts
    overprediction of states with few outgoing
    transitions, of more generally, low entropy...

31
Error analysis for POS tagging
32
Stacked Sequential Learning
  • William W. Cohen
  • Center for Automated Learning and Discovery
  • Carnegie Mellon University

Vitor Carvalho Language Technology
Institute Carnegie Mellon University
33
Outline
  • Motivation
  • MEMMs dont work on segmentation tasks
  • New method
  • Stacked sequential MaxEnt
  • Stacked sequential Anything
  • Results
  • More results...
  • Conclusions

34
However, in celebration of the locale, I will
present this results in the style of Sir Walter
Scott (1771-1832), author of Ivanhoe and other
classics. In that pleasant district of merry
Pennsylvania which is watered by the river Mon,
there extended since ancient times a large
computer science department. Such being our
chief scene, the date of our story refers to a
period towards the middle of the year 2003 ....
35
Chapter 1, in which a graduate student (Vitor)
discovers a bug in his advisors code that he
cannot fix
The problem identifying reply and signature
sections of email messages. The method classify
each line as reply, signature, or other.
36
Chapter 1, in which a graduate student discovers
a bug in his advisors code that he cannot fix
The problem identifying reply and signature
sections of email messages. The method classify
each line as reply, signature, or other. The
warmup classify each line is signature or
nonsignature, using learning methods from
Minorthird, and dataset of 600 messages The
results from CEAS-2004, Carvalho Cohen....
37
Chapter 1, in which a graduate student discovers
a bug in his advisors code that he cannot fix
But... Minorthirds version of MEMMs has an
accuracy of less than 70 (guessing majority
class gives accuracy around 10!)
38
Flashback In which we recall the invention and
re-invention of sequential classification with
recurrent sliding windows, ..., MaxEnt Markov
Models (MEMM)
  • From data, learn Pr(yiyi-1,xi)
  • MaxEnt model
  • To classify a sequence x1,x2,... search for the
    best y1,y2,...
  • Viterbi
  • beam search

39
Flashback In which we recall the invention and
re-invention of sequential classification with
recurrent sliding windows, ..., MaxEnt Markov
Models (MEMM) ... and also praise their many
virtues relative to CRFs
  • MEMMs are easy to implement
  • MEMMs train quickly
  • no probabilistic inference in the inner loop of
    learning
  • You can use any old classifier (even if its not
    probabilistic)
  • MEMMs scale well with number of classes and
    length of history

Pr(Yi Yi-1,Yi-2,...,f1(Xi),f2(Xi),...)...
40
The flashback ends and we return again to our
document analysis task , on which the elegant
MEMM method fails miserably for reasons unknown
MEMMs have an accuracy of less than 70 on this
problem but why ?
41
Chapter 2, in which, in the fullness of time, the
mystery is investigated...
true
...and it transpires that often the classifier
predicts a signature block that is much longer
than is correct
...as if the MEMM gets stuck predicting the sig
label.
42
Chapter 2, in which, in the fullness of time, the
mystery is investigated...
...and it transpires that Pr(YisigYi-1sig)
1-e as estimated from the data, giving the
previous label a very high weight.
43
Chapter 2, in which, in the fullness of time, the
mystery is investigated...
  • We added sequence noise by randomly switching
    around 10 of the lines this
  • lowers the weight for the previous-label feature
  • improves performance for MEMMs
  • degrades performance for CRFs
  • Adding noise in this case however is a loathsome
    bit of hackery.

44
Chapter 2, in which, in the fullness of time, the
mystery is investigated...
  • Label bias problem CRFs can represent some
    distributions that MEMMs cannot Lafferty et al
    2000
  • e.g., the rib-rob problem
  • this doesnt explain why MaxEnt gtgt MEMMs
  • Observation bias problem MEMMs can overweight
    observation features Klein and Manning 2002
  • here we observe the opposite the history
    features are overweighted

45
Chapter 2, in which, in the fullness of time, the
mystery is investigated...and an explanation is
proposed.
  • From data, learn Pr(yiyi-1,xi)
  • MaxEnt model
  • To classify a sequence x1,x2,... search for the
    best y1,y2,...
  • Viterbi
  • beam search

46
Chapter 2, in which, in the fullness of time, the
mystery is investigated...and an explanation is
proposed.
  • From data, learn Pr(yiyi-1,xi)
  • MaxEnt model
  • To classify a sequence x1,x2,... search for the
    best y1,y2,...
  • Viterbi
  • beam search

Learning data is noise-free, including values for
Yi-1
Classification data values for Yi-1 are noisy
since they come from predictions i.e., the
history values used at learning time are a poor
approximation of the values seen in classification
47
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
  • From data, learn Pr(yiyi-1,xi)
  • MaxEnt model
  • To classify a sequence x1,x2,... search for the
    best y1,y2,...
  • Viterbi
  • beam search

While learning, replace the true value for Yi-1
with an approximation of the predicted value of
Yi-1
To approximate the value predicted by MEMMs, use
the value predicted by non-sequential MaxEnt in a
cross-validation experiment. After Wolpert
1992 we call this stacked MaxEnt.
find approximate Ys with a MaxEnt-learned
hypothesis, and then apply the sequential model
to that
48
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
  • Learn Pr(yixi) with MaxEnt and save the model as
    f(x)
  • Do k-fold cross-validation with MaxEnt, saving
    the cross-validated predictions the
    cross-validated predictions yifk(xi)
  • Augment the original examples with the ys and
    compute history features g(x,y) ? x
  • Learn Pr(yixi) with MaxEnt and save the model
    as f(x)
  • To classify augment x with yf(x), and apply f
    to the resulting x i.e., return f(g(x,f(x))

f
f
49
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
  • StackedMaxEnt (k5) outperforms MEMMs and
    non-sequential MaxEnt, but not CRFs
  • StackedMaxEnt can also be easily extended....
  • Its easy (but expensive) to increase the depth
    of stacking
  • Its easy to increase the history size
  • Its easy to build features for future
    estimated Yis as well as past Yis.
  • stacking can be applied to any other sequential
    learner

50
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
  • StackedMaxEnt can also be easily extended....
  • Its easy (but expensive) to increase the depth
    of stacking
  • Its cheap to increase the history size
  • Its easy to build features for future
    estimated Yis as well as past Yis.
  • stacking can be applied to any other sequential
    learner

Yi-1
Yi
Yi1
. . .
. . .
Yi-1
Yi
Yi1
. . .
. . .



Yi-1
Yi
Yi1
. . .
. . .
Xi-1
Xi
Xi1
. . .
. . .
51
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
  • StackedMaxEnt can also be easily extended....
  • Its easy (but expensive) to increase the depth
    of stacking
  • Its cheap to increase the history size
  • Its easy to build features for future
    estimated Yis as well as past Yis.
  • stacking can be applied to any other sequential
    learner

Yi1
Yi
Yi1


Yi-1
Yi
Yi1

. . .
. . .
Xi-1
Xi
Xi1
. . .
. . .
52
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
  • StackedMaxEnt can also be easily extended....
  • Its cheap to increase the history size, and
    build features for future estimated Yis as
    well as past Yis.

53
Chapter 3, in which a novel extension to MEMMs is
proposed that will correct the performance problem
  • StackedMaxEnt can also be easily extended....
  • Its easy (but expensive) to increase the depth
    of stacking
  • Its cheap to increase the history size
  • Its easy to build features for future
    estimated Yis as well as past Yis.
  • stacking can be applied to any other sequential
    learner
  • Learn Pr(yixi) with MaxEnt and save the model as
    f(x)
  • Do k-fold cross-validation with MaxEnt, saving
    the cross-validated predictions the
    cross-validated predictions yifk(xi)
  • Augment the original examples with the ys and
    compute history features g(x,y) ? x
  • Learn Pr(yixi) with MaxEnt and save the model
    as f(x)
  • To classify augment x with yf(x), and apply f
    to the resulting x i.e., return f(g(x,f(x))

54
Chapter 3, in which a novel extension to MEMMs is
proposed and several diverse variants of the
extension are evaluated on signature-block
finding....
non-sequential MaxEnt baseline
Reduction in error rate for stacked-MaxEnt (s-ME)
vs CRFs is 46, which is statistically
significant
stacked MaxEnt, no future
With large windows stackedME is better than CRF
baseline
CRF baseline
stacked MaxEnt, stackedCRFs with large
historyfuture
window/history size
55
Chapter 4, in which the experiment above is
repeated on a new domain, and then repeated again
on yet another new domain.
stacking (wk5)
-stacking
newsgroup FAQ segmentation (2 labels x three
newsgroups)
video segmentation
56
Chapter 4, in which the experiment above is
repeated on a new domain, and then repeated again
on yet another new domain.
57
Chapter 5, in which all the experiments above
were repeated for a second set of learners the
voted perceptron (VP), the voted-perceptron-traine
d HMM (VP-HMM), and their stacked versions.
58
Chapter 5, in which all the experiments above
were repeated for a second set of learners the
voted perceptron (VP), the voted-perceptron-traine
d HMM (VP-HMM), and their stacked versions.
  • Stacking usually improves or leaves unchanged
  • MaxEnt (pgt0.98)
  • VotedPerc (pgt0.98)
  • VPHMM (pgt0.98)
  • CRFs (pgt0.92)
  • on a randomly chosen problem using a 1-tailed
    sign test

59
Chapter 4b, in which the experiment above is
repeated again for yet one more new domain....
  • Classify pop songs as happy or sad
  • 1-second long song frames inherit the mood of
    their containing song
  • Song frames are classified with a sequential
    classifier
  • Song mood is majority class of all its frames
  • 52,188 frames from 201 songs, 130 features per
    frame, used k5, w25

60
Epilog in which the speaker discusses certain
issues of possible interest to the listener, who
is now fully informed of the technical issues (or
it may be, only better rested) and thus receptive
to such commentary
  • Scope
  • we considered only segmentation taskssequences
    with long runs of identical labelsand 2-class
    problems.
  • MEMM fails here.
  • Issue
  • learner is brittle w.r.t. assumptions
  • training data for local model is assumed to be
    error-free, which is systematically wrong
  • Solution sequential stacking
  • model-free way to improve robustness
  • stacked MaxEnt outperforms or ties CRFs on 8/10
    tasks stacked VP outperforms CRFs on 8/9 tasks.
  • a meta-learning method applies to any base
    learner, and can also reduce error of CRF
    substantially
  • experiments with non-segmentation problems (NER)
    had no large gains

61
Epilog in which the speaker discusses certain
issues of possible interest to the listener, who
is now fully informed of the technical issues (or
it may be, only better rested) and thus receptive
to such commentary
  • ... and in which finally, the speaker realizes
    that the structure of the epic romantic novel is
    ill-suited to talks of this ilk, and perhaps even
    the very medium of PowerPoint itself, but
    none-the-less persists with a final animation...

Sir W. Scott R.I.P.
Write a Comment
User Comments (0)
About PowerShow.com