Title: Conditional Random Fields
1Conditional Random Fields
- William W. Cohen
- Feb 13, 2007
2One winter day in a certain unnamed small college
town, there was a snowstorm of such epic
proportions that many roads were closed down.
However, one stalwart and dedicated student
decided to make the trek to class anyway. Because
of the treacherous conditions, she arrived at the
lecture hall forty minutes late, only to find the
room empty except for the professor, busy
lecturing, and one other classmate. She took the
seat next to him. After a few minutes, she leaned
over and asked her fellow student, "What's the
prof talking about?" The other student replied,
"How should I know? I only got here ten minutes
before you.
- Lillian Lee, Cornell CS
3Announcements
- This week
- Office hours Fri 1030-1200
- Lecture 1 Sha Pereira, Lafferty et al 2001,
Klein and Manning - Lecture 2 Stacked Sequential Learning
- Three student presentations
4Review motivation for CMMs
Ideally we would like to use many, arbitrary,
overlapping features of words.
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1
is Wisniewski
part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
5Motivation for CMMs
S
S
S
identity of word ends in -ski is capitalized is
part of a noun phrase is in a list of city
names is under node X in WordNet is in bold
font is indented is in hyperlink anchor
t
-
1
t
t1
is Wisniewski
part ofnoun phrase
ends in -ski
O
O
O
t
t
1
-
t
1
Idea replace generative model in HMM with a
maxent model, where state depends on observations
and previous state
6Implications of the model
- Does this do what we want?
- Q does Yi-1 depend on Xi1 ?
- a nodes is conditionally independent of its
non-descendents given its parents
7Inference for MXPOST
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
(Approx view) find best path, weights are now on
arcs from state to state.
8Inference for MXPOST
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
More accurately find total flow to each node,
weights are now on arcs from state to state.
Flow out of a node is always fixed
9Label Bias Problem
- P(1 and 2 ro) P(2 1 and ro)P(1 ro)
P(2 1 and o)P(1 r) - P(1 and 2 ri) P(2 1 and ri)P(1 ri)
P(2 1 and i)P(1 r) - Since P(2 1 and x) 1 for all x, P(1 and 2
ro) P(1 and 2 ri) - In the training data, label value 2 is the only
label value observed after label value 1 - Therefore P(2 1) 1, so P(2 1 and x) 1 for
all x - However, we expect P(1 and 2 ri) to be
greater than P(1 and 2 ro). - Per-state normalization does not allow the
required expectation
10Label Bias Problem
- Consider this MEMM, and enough training data to
perfectly model it
Pr(0123rib)1 Pr(0453rob)1
Pr(0123rob) Pr(10,r)/Z1 Pr(21,o)/Z2
Pr(32,b)/Z3 0.5 1 1
Pr(0453rib) Pr(40,r)/Z1 Pr(54,i)/Z2
Pr(35,b)/Z3 0.5 1 1
11How important is label bias?
- Could be avoided in this case by changing
structure
- Our models are always wrong is this wrongness
a problem? - See Klein Mannings paper for more on this.
12Another view of label bias Sha Pereira
So whats the alternative?
13Inference for MXPOST
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
More accurately find total flow to each node,
weights are now on arcs from state to state.
Flow out of a node is always fixed
14Another max-flow scheme
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
More accurately find total flow to each node,
weights are now on arcs from state to state.
Flow out of a node is always fixed
15Another max-flow scheme MRFs
When will prof Cohen post
the notes
B
B
B
B
B
B
B
I
I
I
I
I
I
I
O
O
O
O
O
O
O
Find total flow to each node, weights are now on
edges from state to state. Goal is to learn how
to weight edges in the graph, given features from
the examples.
16CRFs vs MEMMs
- MEMMs
- Sequence classification fx?y is reduced to many
cases of ordinary classification, fxi?yi - combined with Viterbi or beam search
- CRFs
- Sequence classification fx?y is done by
- Converting x,Y to a MRF
- Using flow computations on the MRF to compute
some best yx
x1 x2 x3 x4 x5 x6
x1 x2 x3 x4 x5 x6
Pr(Yx2,y1)
MRF f(Y1,Y2), f(Y2,Y3),.
Pr(Yx4,y3)
Pr(Yx5,y5)
Pr(Yx2,y1)
y1 y2 y3 y4 y5 y6
y1 y2 y3 y4 y5 y6
17The math Review of maxent
18Review of maxent/MEMM/CMMs
We know how to compute this.
19Details on CMMs
20From CMMs to CRFs
Recall why were unhappy we dont want local
normalization
How to compute this?
21Whats the new model look like?
Whats independent?
22Whats the new model look like?
Whats independent now??
y1
y2
y3
x
23Hammerley-Clifford
- For positive distributions P(x1,,xn)
- Pr(xix1,,xi-1,xi1,,xn) Pr(xiNeighbors(xi))
- Pr(AB,S) Pr(AS) where A,B are sets of nodes
and S is a set that separates A and B - P can be written as normalized product of clique
potentials
So this is very general any Markov distribution
can be written in this form (modulo nits like
positive distribution)
24Definition of CRFs
X is a random variable over data sequences to be
labeled Y is a random variable over corresponding
label sequences
25Example of CRFs
26Graphical comparison among HMMs, MEMMs and CRFs
HMM MEMM CRF
27Lafferty et al notation
28Conditional Distribution (contd)
- CRFs use the observation-dependent
normalization Z(x) for the conditional
distributions
Z(x) is a normalization over the data sequence x
- Learning
- Lafferty et als IIS-based method is rather
inefficient. - Gradient-based methods are faster
- Trickiest bit is computing normalization, which
is over exponentially many y vectors.
29CRF learning from Sha Pereira
30CRF learning from Sha Pereira
31CRF learning from Sha Pereira
Something like forward-backward
- Idea
- Define matrix of y,y affinities at stage i
- Miy,y unnormalized probability of
transition from y to y at stage I - Mi Mi1 unnormalized probability of any
path through stages i and i1
32y1
y2
y3
x
y1
y2
y3
33Forward backward ideas
a
e
name
name
name
c
g
b
f
nonName
nonName
nonName
d
h
34CRF learning from Sha Pereira
35Sha Pereira results
CRF beats MEMM (McNemars test) MEMM probably
beats voted perceptron
36Sha Pereira results
in minutes, 375k examples
37Some recent results
ICML 2006
38Some recent results
39POS tagging Experiments in Lafferty et al
- Compared HMMs, MEMMs, and CRFs on Penn treebank
POS tagging - Each word in a given input sentence must be
labeled with one of 45 syntactic tags - Add a small set of orthographic features whether
a spelling begins with a number or upper case
letter, whether it contains a hyphen, and if it
contains one of the following suffixes -ing,
-ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies - oov out-of-vocabulary (not observed in the
training set)
40POS tagging vs MXPost