Title: Log-Linear Models in NLP
1Log-Linear Models in NLP
- Noah A. Smith
- Department of Computer Science /
- Center for Language and Speech Processing
- Johns Hopkins University
- nasmith_at_cs.jhu.edu
2Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
3Data
For now, were just talking about modeling data.
No task.
How to assign probability to each shape type?
4Maximum Likelihood
3 .19
2 .12
0 0
1 .06
4 .25
3 .19
1 .06
0 0
0 0
1 .06
0 0
1 .06
Fewer parameters?
How to smooth?
11 degrees of freedom (12 1).
5Some other kinds of models
11 degrees of freedom (1 4 6).
Color
0.5
0.5
large 0.000
small 1.000
large 0.333
small 0.667
large 0.250
small 0.750
large 1.000
small 0.000
large 0.000
small 1.000
large 0.000
small 1.000
0.125
0.375
0.500
0.125
0.375
0.500
These two are the same!
These two are the same!
Pr(Color, Shape, Size) Pr(Color) Pr(Shape
Color) Pr(Size Color, Shape)
6Some other kinds of models
9 degrees of freedom (1 2 6).
Color
0.5
0.5
large 0.000
small 1.000
large 0.333
small 0.667
large 0.250
small 0.750
large 1.000
small 0.000
large 0.000
small 1.000
large 0.000
small 1.000
0.125
0.375
0.500
Pr(Color, Shape, Size) Pr(Color) Pr(Shape)
Pr(Size Color, Shape)
7Some other kinds of models
7 degrees of freedom (1 2 4).
Color
large 0.667
large 0.333
small 0.462
small 0.538
large 0.375
small 0.625
large 0.333
large 0.333
large 0.333
small 0.077
small 0.385
small 0.538
No zeroes here ...
Pr(Color, Shape, Size) Pr(Size) Pr(Shape
Size) Pr(Color Size)
8Some other kinds of models
4 degrees of freedom (1 2 1).
Color
0.5
0.5
large 0.375
small 0.625
0.125
0.375
0.500
Pr(Color, Shape, Size) Pr(Size) Pr(Shape)
Pr(Color)
9This is difficult.
- Different factorizations affect
- smoothing
- parameters (model size)
- model complexity
- interpretability
- goodness of fit ...
Usually, this isnt done empirically, either!
10Desiderata
- You decide which features to use.
- Some intuitive criterion tells you how to use
them in the model. - Empirical.
11Maximum Entropy
- Make the model as uniform as possible ...
- but I noticed a few things that I want to model
... - so pick a model that fits the data on those
things.
12Occams Razor
One should not increase, beyond what is
necessary, the number of entities required to
explain anything.
13Uniform model
small 0.083 0.083 0.083
small 0.083 0.083 0.083
large 0.083 0.083 0.083
large 0.083 0.083 0.083
14Constraint Pr(small) 0.625
small 0.104 0.104 0.104
small 0.104 0.104 0.104
large 0.063 0.063 0.063
large 0.063 0.063 0.063
0.625
15Pr( , small) 0.048
0.048
small 0.024 0.144 0.144
small 0.024 0.144 0.144
large 0.063 0.063 0.063
large 0.063 0.063 0.063
0.625
16Pr(large, ) 0.125
0.048
small 0.024 0.144 0.144
small 0.024 0.144 0.144
large 0.063 0.063 0.063
large 0.063 0.063 0.063
?
0.625
17Questions
Is there an efficient way to solve this problem?
Does a solution always exist?
Is there a way to express the model succinctly?
What to do if it doesnt?
18Entropy
- A statistical measurement on a distribution.
- Measured in bits.
- ? 0, log2X
- High entropy close to uniform
- Low entropy close to deterministic
- Concave in p.
19The Max Ent Problem
H
p2
p1
20The Max Ent Problem
objective function is H
probabilities sum to 1 ...
picking a distribution
... and are nonnegative
expected feature value under the model
n constraints
expected feature value from the data
21The Max Ent Problem
H
p2
p1
22About feature constraints
1 if x is large and light, 0 otherwise
1 if x is small, 0 otherwise
23Mathematical Magic
constrained X variables (p) concave in p
unconstrained N variables (?) concave in ?
24Whats the catch?
The model takes on a specific, parameterized
form. It can be shown that any max-ent model
must take this form.
25Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
26Log-linear models
Log
linear
27Log-linear models
One parameter (?i) for each feature.
Unnormalized probability, or weight
Partition function
28Mathematical Magic
Max ent problem
constrained X variables (p) concave in p
unconstrained N variables (?) concave in ?
Log-linear ML problem
29What does MLE mean?
Independence among examples
Arg max is the same in the log domain
30MLE Then and Now
Directed models Log-linear models
Concave Concave
Constrained (simplex) Unconstrained
Count and normalize (closed form solution) Iterative methods
31Iterative Methods
All of these methods are correct and will
converge to the right answer its just a matter
of how fast.
- Generalized Iterative Scaling
- Improved Iterative Scaling
- Gradient Ascent
- Newton/Quasi-Newton Methods
- Conjugate Gradient
- Limited-Memory Variable Metric
- ...
32Questions
Is there an efficient way to solve this problem?
Yes, many iterative methods.
Does a solution always exist?
Is there a way to express the model succinctly?
Yes, if the constraints come from the data.
Yes, a log-linear model.
33Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
34Conditional Estimation
labels
examples
Training Objective
Classification Rule
35Maximum Likelihood
label
object
36Maximum Likelihood
label
object
37Maximum Likelihood
label
object
38Maximum Likelihood
label
object
39Conditional Likelihood
label
object
40Remember
- log-linear models
-
- conditional estimation
41The Whole Picture
Directed models Log-linear models
MLE Count Normalize Unconstrained concave optimization
CLE Constrained concave optimization Unconstrained concave optimization
42Log-linear models MLE vs. CLE
Sum over all example types ? all labels.
Sum over all labels.
43Classification Rule
- Pick the most probable label y
We dont need to compute the partition function
at test time!
But it does need to be computed during training.
44Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
45Ratnaparkhis POS Tagger (1996)
- Probability model
- Assume unseen words behave like rare words.
- Rare words count lt 5
- Training GIS
- Testing/Decoding beam search
46Features common words
the stories about well-heeled communities and developers
DT NNS IN JJ NNS CC NNS
about
IN
stories
IN
the
IN
well-heeled
IN
communities
IN
NNS IN
DT NNS IN
47Features rare words
the stories about well-heeled communities and developers
DT NNS IN JJ NNS CC NNS
about
JJ
stories
JJ
communities
JJ
and
JJ
IN JJ
NNS IN JJ
...-...
JJ
...d
JJ
...ed
JJ
...led
JJ
...eled
JJ
w...
JJ
we...
JJ
wel...
JJ
well...
JJ
48The Label Bias Problem
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to wealth
VBN TO NN
born to run
VBN TO VB
born to run
VBN TO VB
born to run
VBN TO VB
(4)
(6)
49The Label Bias Problem
Pr(VBN born) Pr(IN VBN, to) Pr(NN VBN, IN,
wealth) 1 .4 1
born
to
VBN, IN
wealth
VBN
IN, NN
to
run
VBN, TO
TO, VB
Pr(VBN born) Pr(TO VBN, to) Pr(VB VBN, TO,
wealth) 1 .6 1
born to wealth
50Is this symptomatic of log-linear models?
No!
51Tagging Decisions
tag3
A
B
tag3
tag2
A
At each decision point, the total weight is 1.
C
tag3
D
tag3
tagn
tag2
B
tag1
tag1
A
tag3
Choose the path with the greatest weight.
B
tag3
You never pay a penalty for it!
C
D
tag3
tag2
You must choose tag2 B, even if B is a terrible
tag for word2. Pr(tag2 B anything at all!) 1
B
tag3
52Tagging Decisions in an HMM
tag3
A
B
tag3
tag2
A
At each decision point, the total weight can be
0.
C
tag3
D
tag3
tagn
tag2
B
tag1
tag1
A
tag3
Choose the path with the greatest weight.
B
tag3
C
D
tag3
tag2
You may choose to discontinue this path if B
cant tag word2. Or pay a high cost.
B
tag3
53Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
54Conditional Random Fields
- Lafferty, McCallum, and Pereira (2001)
- Whole sentence model with local features
55Simple CRFs as Graphs
PRP
NN
VBZ
ADV
Weights, added together.
My
cat
begs
silently
PRP
NN
VBZ
ADV
Compare with an HMM
Log-probs, added together.
My
cat
begs
silently
56What can CRFs do that HMMs cant?
PRP
NN
VBZ
ADV
My
cat
begs
silently
57An Algorithmic Connection
Total weight of all paths.
What is the partition function?
58CRF weight training
- Maximize log-likelihood
- Gradient
Total weight of all paths.
Forward algorithm.
Expected feature values.
Forward-backward algorithm.
59Forward, Backward, and Expectations
fk is the number of firings each firing is at
some position
Markovian property
backward weight
forward weight
forward weight
60Forward, Backward, and Expectations
forward weight
backward weight
forward weight to final state weight of all
paths
61Forward-Backwards Clients
Training a CRF Baum-Welch
supervised (labeled data) unsupervised
concave bumpy
converges to global max converges to local max
max p(y x) (conditional training) max p(x) (y unknown)
62A Glitch
- Suppose we notice that -ly words are always
adverbs. - Call this feature 7.
The expectation cant exceed the max (it cant
even reach it).
-ly words are all ADV this is maximal.
The gradient will always be positive.
63The Dark Side of Log-Linear Models
64Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
65Regularization
- ?s shouldnt have huge magnitudes
- Model must generalize to test data
- Example quadratic penalty
66Bayesian RegularizationMaximum A Posteriori
Estimation
67Independent Gaussians Prior (Chen and Rosenfeld,
2000)
Independence
Gaussian
0-mean, identical variance.
Quadratic penalty!
68Alternatives
Not differentiable.
- Different variances for different parameters
- Laplacian prior (1-norm)
- Exponential prior (Goodman, 2004)
- Relax the constraints (Kazama Tsujii, 2003)
All ?k 0.
69Effect of the penalty
?k
70Kazama Tsujiis box constraints
The primal Max Ent problem
71Sparsity
- Fewer features ? better generalization
- E.g., support vector machines
- Kazama Tsujiis prior, and Goodmans, give
sparsity.
72Sparsity
Gradient is 0.
penalty
Cusp function is not differentiable here.
?k
73Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
74Feature Selection
- Sparsity from priors is one way to pick the
features. (Maybe not a good way.) - Della Pietra, Della Pietra, and Lafferty (1997)
gave another way.
75Back to the original example.
76Nine features.
?i log counti
- f1 1 if , 0 otherwise
- f2 1 if , 0 otherwise
- f3 1 if , 0 otherwise
- f4 1 if , 0 otherwise
- f5 1 if , 0 otherwise
- f6 1 if , 0 otherwise
- f7 1 if , 0 otherwise
- f8 1 if , 0 otherwise
- f9 1 unless some other feature fires ?9 ltlt 0
Whats wrong here?
77The Della Pietras Laffertys Algorithm
- Start out with no features.
- Consider a set of candidates.
- Atomic features.
- Current features conjoined with atomic features.
- Pick the candidate g with the greatest gain
- Add g to the model.
- Retrain all parameters.
- Go to 2.
78Feature Induction Example
Selected features
PRP
NN
VBZ
ADV
My
cat
begs
silently
Atomic features
Other candidates
PRP
NN
VBZ
ADV
PRP NN
NN cat
My
cat
begs
silently
NN VBZ
79Outline
- Maximum Entropy principle
- Log-linear models
- Conditional modeling for classification
- Ratnaparkhis tagger
- Conditional random fields
- Smoothing
- Feature Selection
80Conclusions
The math is beautiful and easy to implement.
You pick the features the rest is just math!
Log-linear models
Probabilistic models robustness data-oriented mat
hematically understood
Hacks explanatory power exploit experts choice
of features (can be) more data-oriented
81Thank you!