Back to Conditional Log-Linear Modeling - PowerPoint PPT Presentation

1 / 32

About This Presentation

Title:

Back to Conditional Log-Linear Modeling

Description:

... as a collection of features (here, tree = a collection of context-free rules) ... Our weights have always been conditional log-probs ( 0) ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 33

Provided by: jasone2

Learn more at: http://www.cs.jhu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Back to Conditional Log-Linear Modeling

1
Back to Conditional Log-Linear Modeling
2
Probability is Useful
summary of half of the course (statistics)

We love probability distributions!
Weve learned how to define use p() functions.
Pick best output text T from a set of candidates
speech recognition (HW2) machine translation
OCR spell correction...
maximize p1(T) for some appropriate distribution
p1
Pick best annotation T for a fixed input I
text categorization parsing part-of-speech
tagging
maximize p(T I) equivalently maximize joint
probability p(I,T)
often define p(I,T) by noisy channel p(I,T)
p(T) p(I T)
speech recognition other tasks above are cases
of this too
were maximizing an appropriate p1(T) defined by
p(T I)
Pick best probability distribution (a
meta-problem!)
really, pick best parameters ? train HMM, PCFG,
n-grams, clusters
maximum likelihood smoothing EM if unsupervised
(incomplete data)
Bayesian smoothing max p(?data) max p(?,
data) p(?)p(data?)

3
Probability is Flexible
summary of other half of the course (linguistics)

We love probability distributions!
Weve learned how to define use p() functions.
We want p() to define probability of linguistic
objects
Trees of (non)terminals (PCFGs CKY, Earley,
pruning, inside-outside)
Sequences of words, tags, morphemes, phonemes
(n-grams, FSAs, FSTs regex compilation,
best-paths, forward-backward, collocations)
Vectors (decis.lists, Gaussians, naïve Bayes
Yarowsky, clustering/k-NN)
Weve also seen some not-so-probabilistic stuff
Syntactic features, semantics, morph., Gold.
Could be stochasticized?
Methods can be quantitative data-driven but not
fully probabilistic transf.-based learning,
bottom-up clustering, LSA, competitive linking
But probabilities have wormed their way into most
things
p() has to capture our intuitions about the
ling. data

4
An Alternative Tradition

Old AI hacking technique
Possible parses (or whatever) have scores.
Pick the one with the best score.
How do you define the score?
Completely ad hoc!
Throw anything you want into the stew
Add a bonus for this, a penalty for that, etc.
Learns over time as you adjust bonuses and
penalties by hand to improve performance. ?
Total kludge, but totally flexible too
Can throw in any intuitions you might have

5
An Alternative Tradition

Old AI hacking technique
Possible parses (or whatever) have scores.
Pick the one with the best score.
How do you define the score?
Completely ad hoc!
Throw anything you want into the stew
Add a bonus for this, a penalty for that, etc.
Learns over time as you adjust bonuses and
penalties by hand to improve performance. ?
Total kludge, but totally flexible too
Can throw in any intuitions you might have

Exposé at 9 Probabilistic RevolutionNot Really a
Revolution, Critics Say Log-probabilities no
more than scores in disguise Were just adding
stuff up like the old corrupt regime did, admits
spokesperson
6
Nuthin but adding weights

n-grams log p(w7 w5, w6) log p(w8 w6,
w7)
PCFG log p(NP VP S) log p(Papa NP) log
p(VP PP VP)
HMM tagging log p(t7 t5, t6) log p(w7
t7)
Noisy channel log p(source) log p(data
source)
Cascade of composed FSTs log p(A) log
p(B A) log p(C B)
Naïve Bayes log p(Class) log p(feature1
Class) log p(feature2 Class)
Note Today well use logprob not
logprobi.e., bigger weights are better.

7
Nuthin but adding weights

n-grams log p(w7 w5, w6) log p(w8 w6,
w7)
PCFG log p(NP VP S) log p(Papa NP) log
p(VP PP VP)
Can describe any linguistic object as collection
of features(here, a trees features are all
of its component rules) (different meaning of
features from singular/plural/etc.)
Weight of the object total weight of features
Our weights have always been conditional
log-probs (? 0)
but what if we changed that?
HMM tagging log p(t7 t5, t6) log p(w7
t7)
Noisy channel log p(source) log p(data
source)
Cascade of FSTs log p(A) log p(B A)
log p(C B)
Naïve Bayes log(Class) log(feature1 Class)
log(feature2 Class)

8
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)

n-grams log p(w7 w5, w6) log p(w8 w6,
w7)
PCFG log p(NP VP S) log p(Papa NP) log
p(VP PP VP)
HMM tagging log p(t7 t5, t6) log p(w7
t7)
Noisy channel log p(source) log p(data
source)
Cascade of FSTs log p(A) log p(B A)
log p(C B)
Naïve Bayes log p(Class) log p(feature1
Class) log p(feature2 Class)

9
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)

n-grams ?(w7 w5, w6) ?(w8
w6, w7)
PCFG ?(NP VP S) ?(Papa NP)
?(VP PP VP)
HMM tagging ?(t7 t5, t6) ?(w7
t7)
Noisy channel ?(source) ?(data
source)
Cascade of FSTs ?(A) ?(B A)
?(C B)
Naïve Bayes ?(Class) ?(feature1
Class) ?(feature2 Class)

In practice, ? is a hash table Maps from feature
name (a string or object) to feature weight (a
float) e.g., ?(NP VP S) weight of the S ?
NP VP rule, say -0.1 or 1.3
10
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)
?(that this) prettiername

n-grams ?(w5 w6 w7) ?(w6 w7
w8)
PCFG ?(S ? NP VP) ?(NP ? Papa)
?(VP ? VP PP)
HMM tagging ?(t5 t6 t7) ?(t7
? w7)
Noisy channel ?(source) ?(source,
data)
Cascade of FSTs ?(A) ?(A, B)
?(B, C)
Naïve Bayes ?(Class) ?(Class,
feature 1) ?(Class, feature2)

In practice, ? is a hash table Maps from feature
name (a string or object) to feature weight (a
float) e.g., ?(S ? NP VP) weight of the S ? NP
VP rule, say -0.1 or 1.3
11
What if our weights were arbitrary real numbers?
Change log p(this that) to ?(that this)

n-grams ?(w5 w6 w7) ?(w6 w7
w8)
Best string is the one whose trigrams have the
highest total weight
PCFG ?(S ? NP VP) ?(NP ? Papa)
?(VP ? VP PP)
Best parse is one whose rules have highest total
weight (use CKY/Earley)
HMM tagging ?(t5 t6 t7) ?(t7
? w7)
Best tagging has highest total weight of all
transitions and emissions
Noisy channel ?(source) ?(source,
data)
To guess source max (weight of source weight
of source-data match)
Naïve Bayes ?(Class) ?(Class, feature 1)
?(Class, feature 2)
Best class maximizes prior weight weight of
compatibility with features

12
The general problem

Given some input x
Occasionally empty, e.g., no input needed for a
generative n-gram or model of strings (randsent)
Consider a set of candidate outputs y
Classifications for x (small number often just
2)
Taggings of x (exponentially many)
Parses of x (exponential, even
infinite)
Translations of x (exponential, even
infinite)
Want to find the best y, given x

13
Finding the best y given x

Given some input x
Consider a set of candidate outputs y
Define a scoring function score(x,y)
Were talking about linear functions A sum of
feature weights
Choose y that maximizes score(x,y)
Easy when only two candidates y (spam
classification, binary WSD, etc.) just try both!
Hard for structured prediction but you now know
how!
At least for linear scoring functions with
certain kinds of features.
Generalizing beyond this is an active area!
Approximate inference in graphical models,
integer linear programming, weighted MAX-SAT,
etc. see 600.325/425 Declarative Methods

Given sentence x
You know how to find max-score parse y (or
min-cost parse)
Provided that the score of a parse a sum over
its indiv. rules
Each rule score can add up several features of
that rule
But a feature cant look at 2 rules at once (how
to solve?)

1 S ? NP VP 6 S ? Vst NP 2 S ? S PP 1 VP ? V
NP 2 VP ? VP PP 1 NP ? Det N 2 NP ? NP PP 3
NP ? NP NP 0 PP ? P NP
15

Given upper string x
You know how to find lower string y such that
score(x,y) is highest
Provided that score(x,y) is a sum of arc scores
along the best path that transduces x to y
Each arc score can add up several features of
that arc
But a feature cant look at 2 arcs at once (how
to solve?)

16
Linear model notation

Given some input x
Consider a set of candidate outputs y
Define a scoring function score(x,y)
Linear function A sum of feature weights (you
pick the features!)
Choose y that maximizes score(x,y)

17
Linear model notation

Given some input x
Consider a set of candidate outputs y
Define a scoring function score(x,y)
Linear function A sum of feature weights (you
pick the features!)
Choose y that maximizes score(x,y)

600.465 - Intro to NLP - J. Eisner
17
18
Probabilists Rally Behind Paradigm
83 of

.2, .4, .6, .8! Were not gonna take your
bait!
Can estimate our parameters automatically
e.g., log p(t7 t5, t6) (trigram
tag probability)
from supervised or unsupervised data
Our results are more meaningful
Can use probabilities to place bets, quantify
risk
e.g., how sure are we that this is the correct
parse?
Our results can be meaningfully combined ?
modularity!
Multiply indep. conditional probs normalized,
unlike scores
p(English text) p(English phonemes English
text) p(Jap. phonemes English phonemes)
p(Jap. text Jap. phonemes)
p(semantics) p(syntax semantics)
p(morphology syntax) p(phonology
morphology) p(sounds phonology)

19
Probabilists Regret Being Bound by Principle

Problem with our courses principled approach
All weve had is the chain rule backoff.
But this forced us to make some tough either-or
decisions.
p(t7 t5, t6) do we want to back off to t6 or
t5?
p(S ? NP VP S) with features do we want to
back off first from number or gender features
first?
p(spam message text) which words of the
message do we back off from??

p(Paul Revere wins weathers clear, ground is
dry, jockey getting over sprain, Epitaph also in
race, Epitaph was recently bought by Gonzalez,
race is on May 17, )
20
News Flash! Hope arrives

So far Chain rule backoff
directed graphical model
Bayesian network or Bayes net
locally normalized model
We do have a good trick to help with this
Conditional log-linear model look back at
smoothing lecture
Solves problems on previous slide!
Computationally a bit harder to train
Have to compute Z(x) for each condition x

21
Gradient-based training
General function maximization algorithms include
gradient ascent, L-BFGS, simulated annealing

Gradually try to adjust ? in a direction that
will improve the function were trying to
maximize
So compute that functions partial derivatives
with respect to the feature weights in ? the
gradient.
Heres how the key part works out

22
Why Bother?

Gives us probs, not just scores.
Can use em to bet, or combine w/ other probs.
We can now learn weights from data!

23
News Flash! More hope

So far Chain rule backoff
directed graphical model
Bayesian network or Bayes net
locally normalized model
Also consider Markov Random Field
undirected graphical model
log-linear model (globally normalized)
exponential model
maximum entropy model
Gibbs distribution

24
Maximum Entropy

Suppose there are 10 classes, A through J.
I dont give you any other information.
Question Given message m what is your guess for
p(C m)?
Suppose I tell you that 55 of all messages are
in class A.
Question Now what is your guess for p(C m)?
Suppose I also tell you that 10 of all messages
contain Buy and 80 of these are in class A or C.
Question Now what is your guess for p(C m),
if m contains Buy?
OUCH!

25
Maximum Entropy

Column A sums to 0.55 (55 of all messages are
in class A)

26
Maximum Entropy

Column A sums to 0.55
Row Buy sums to 0.1 (10 of all messages
contain Buy)

27
Maximum Entropy

Column A sums to 0.55
Row Buy sums to 0.1
(Buy, A) and (Buy, C) cells sum to 0.08 (80 of
the 10)

Given these constraints, fill in cells as
equally as possible maximize the entropy
(related to cross-entropy, perplexity)
Entropy -.051 log .051 - .0025 log .0025 - .029
log .029 -
Largest if probabilities are evenly distributed

28
Maximum Entropy

Column A sums to 0.55
Row Buy sums to 0.1
(Buy, A) and (Buy, C) cells sum to 0.08 (80 of
the 10)

Given these constraints, fill in cells as
equally as possible maximize the entropy
Now p(Buy, C) .029 and p(C Buy) .29
We got a compromise p(C Buy) lt p(A Buy) lt
.55

29
Generalizing to More Features
lt100
Other
30
What we just did

For each feature (contains Buy), see what
fraction of training data has it
Many distributions p(c,m) would predict these
fractions (including the unsmoothed one where all
mass goes to feature combos weve actually seen)
Of these, pick distribution that has max entropy
Amazing Theorem This distribution has the form
p(m,c) (1/Z(?)) exp ?i ?i fi(m,c)
So it is log-linear. In fact it is the same
log-linear distribution that maximizes ?j p(mj,
cj) as before!
Gives another motivation for our log-linear
approach.

31
Overfitting

If we have too many features, we can choose
weights to model the training data perfectly.
If we have a feature that only appears in spam
training, not ling training, it will get weight ?
to maximize p(spam feature) at 1.
These behaviors overfit the training data.
Will probably do poorly on test data.

32
Solutions to Overfitting

Throw out rare features.
Require every feature to occur gt 4 times, and to
occur at least once with each output class.
Only keep 1000 features.
Add one at a time, always greedily picking the
one that most improves performance on held-out
data.
Smooth the observed feature counts.
Smooth the weights by using a prior.
max p(?data) max p(?, data) p(?)p(data?)
decree p(?) to be high when most weights close to
0