Title: Back to Conditional Log-Linear Modeling
1Back to Conditional Log-Linear Modeling
2Probability is Useful
summary of half of the course (statistics)
- We love probability distributions!
- Weve learned how to define use p() functions.
- Pick best output text T from a set of candidates
- speech recognition (HW2) machine translation
OCR spell correction... - maximize p1(T) for some appropriate distribution
p1 - Pick best annotation T for a fixed input I
- text categorization parsing part-of-speech
tagging - maximize p(T I) equivalently maximize joint
probability p(I,T) - often define p(I,T) by noisy channel p(I,T)
p(T) p(I T) - speech recognition other tasks above are cases
of this too - were maximizing an appropriate p1(T) defined by
p(T I) - Pick best probability distribution (a
meta-problem!) - really, pick best parameters ? train HMM, PCFG,
n-grams, clusters - maximum likelihood smoothing EM if unsupervised
(incomplete data) - Bayesian smoothing max p(?data) max p(?,
data) p(?)p(data?)
3Probability is Flexible
summary of other half of the course (linguistics)
- We love probability distributions!
- Weve learned how to define use p() functions.
- We want p() to define probability of linguistic
objects - Trees of (non)terminals (PCFGs CKY, Earley,
pruning, inside-outside) - Sequences of words, tags, morphemes, phonemes
(n-grams, FSAs, FSTs regex compilation,
best-paths, forward-backward, collocations) - Vectors (decis.lists, Gaussians, naïve Bayes
Yarowsky, clustering/k-NN) - Weve also seen some not-so-probabilistic stuff
- Syntactic features, semantics, morph., Gold.
Could be stochasticized? - Methods can be quantitative data-driven but not
fully probabilistic transf.-based learning,
bottom-up clustering, LSA, competitive linking - But probabilities have wormed their way into most
things - p() has to capture our intuitions about the
ling. data
4An Alternative Tradition
- Old AI hacking technique
- Possible parses (or whatever) have scores.
- Pick the one with the best score.
- How do you define the score?
- Completely ad hoc!
- Throw anything you want into the stew
- Add a bonus for this, a penalty for that, etc.
- Learns over time as you adjust bonuses and
penalties by hand to improve performance. ? - Total kludge, but totally flexible too
- Can throw in any intuitions you might have
5An Alternative Tradition
- Old AI hacking technique
- Possible parses (or whatever) have scores.
- Pick the one with the best score.
- How do you define the score?
- Completely ad hoc!
- Throw anything you want into the stew
- Add a bonus for this, a penalty for that, etc.
- Learns over time as you adjust bonuses and
penalties by hand to improve performance. ? - Total kludge, but totally flexible too
- Can throw in any intuitions you might have
Exposé at 9 Probabilistic RevolutionNot Really a
Revolution, Critics Say Log-probabilities no
more than scores in disguise Were just adding
stuff up like the old corrupt regime did, admits
spokesperson
6Nuthin but adding weights
- n-grams log p(w7 w5, w6) log p(w8 w6,
w7) - PCFG log p(NP VP S) log p(Papa NP) log
p(VP PP VP) - HMM tagging log p(t7 t5, t6) log p(w7
t7) - Noisy channel log p(source) log p(data
source) - Cascade of composed FSTs log p(A) log
p(B A) log p(C B) - Naïve Bayes log p(Class) log p(feature1
Class) log p(feature2 Class) - Note Today well use logprob not
logprobi.e., bigger weights are better.
7Nuthin but adding weights
- n-grams log p(w7 w5, w6) log p(w8 w6,
w7) - PCFG log p(NP VP S) log p(Papa NP) log
p(VP PP VP) - Can describe any linguistic object as collection
of features(here, a trees features are all
of its component rules) (different meaning of
features from singular/plural/etc.) - Weight of the object total weight of features
- Our weights have always been conditional
log-probs (? 0) - but what if we changed that?
- HMM tagging log p(t7 t5, t6) log p(w7
t7) - Noisy channel log p(source) log p(data
source) - Cascade of FSTs log p(A) log p(B A)
log p(C B) - Naïve Bayes log(Class) log(feature1 Class)
log(feature2 Class)
8What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)
- n-grams log p(w7 w5, w6) log p(w8 w6,
w7) - PCFG log p(NP VP S) log p(Papa NP) log
p(VP PP VP) - HMM tagging log p(t7 t5, t6) log p(w7
t7) - Noisy channel log p(source) log p(data
source) - Cascade of FSTs log p(A) log p(B A)
log p(C B) - Naïve Bayes log p(Class) log p(feature1
Class) log p(feature2 Class)
9What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)
- n-grams ?(w7 w5, w6) ?(w8
w6, w7) - PCFG ?(NP VP S) ?(Papa NP)
?(VP PP VP) - HMM tagging ?(t7 t5, t6) ?(w7
t7) - Noisy channel ?(source) ?(data
source) - Cascade of FSTs ?(A) ?(B A)
?(C B) - Naïve Bayes ?(Class) ?(feature1
Class) ?(feature2 Class)
In practice, ? is a hash table Maps from feature
name (a string or object) to feature weight (a
float) e.g., ?(NP VP S) weight of the S ?
NP VP rule, say -0.1 or 1.3
10What if our weights were arbitrary real numbers?
Change log p(this that) to ?(this that)
?(that this) prettiername
- n-grams ?(w5 w6 w7) ?(w6 w7
w8) - PCFG ?(S ? NP VP) ?(NP ? Papa)
?(VP ? VP PP) - HMM tagging ?(t5 t6 t7) ?(t7
? w7) - Noisy channel ?(source) ?(source,
data) - Cascade of FSTs ?(A) ?(A, B)
?(B, C) - Naïve Bayes ?(Class) ?(Class,
feature 1) ?(Class, feature2)
In practice, ? is a hash table Maps from feature
name (a string or object) to feature weight (a
float) e.g., ?(S ? NP VP) weight of the S ? NP
VP rule, say -0.1 or 1.3
11What if our weights were arbitrary real numbers?
Change log p(this that) to ?(that this)
- n-grams ?(w5 w6 w7) ?(w6 w7
w8) - Best string is the one whose trigrams have the
highest total weight - PCFG ?(S ? NP VP) ?(NP ? Papa)
?(VP ? VP PP) - Best parse is one whose rules have highest total
weight (use CKY/Earley) - HMM tagging ?(t5 t6 t7) ?(t7
? w7) - Best tagging has highest total weight of all
transitions and emissions - Noisy channel ?(source) ?(source,
data) - To guess source max (weight of source weight
of source-data match) - Naïve Bayes ?(Class) ?(Class, feature 1)
?(Class, feature 2) - Best class maximizes prior weight weight of
compatibility with features
12The general problem
- Given some input x
- Occasionally empty, e.g., no input needed for a
generative n-gram or model of strings (randsent) - Consider a set of candidate outputs y
- Classifications for x (small number often just
2) - Taggings of x (exponentially many)
- Parses of x (exponential, even
infinite) - Translations of x (exponential, even
infinite) -
- Want to find the best y, given x
13Finding the best y given x
- Given some input x
- Consider a set of candidate outputs y
- Define a scoring function score(x,y)
- Were talking about linear functions A sum of
feature weights - Choose y that maximizes score(x,y)
- Easy when only two candidates y (spam
classification, binary WSD, etc.) just try both! - Hard for structured prediction but you now know
how! - At least for linear scoring functions with
certain kinds of features. - Generalizing beyond this is an active area!
- Approximate inference in graphical models,
integer linear programming, weighted MAX-SAT,
etc. see 600.325/425 Declarative Methods
14- Given sentence x
- You know how to find max-score parse y (or
min-cost parse) - Provided that the score of a parse a sum over
its indiv. rules - Each rule score can add up several features of
that rule - But a feature cant look at 2 rules at once (how
to solve?)
1 S ? NP VP 6 S ? Vst NP 2 S ? S PP 1 VP ? V
NP 2 VP ? VP PP 1 NP ? Det N 2 NP ? NP PP 3
NP ? NP NP 0 PP ? P NP
15- Given upper string x
- You know how to find lower string y such that
score(x,y) is highest - Provided that score(x,y) is a sum of arc scores
along the best path that transduces x to y - Each arc score can add up several features of
that arc - But a feature cant look at 2 arcs at once (how
to solve?)
16Linear model notation
- Given some input x
- Consider a set of candidate outputs y
- Define a scoring function score(x,y)
- Linear function A sum of feature weights (you
pick the features!) - Choose y that maximizes score(x,y)
17Linear model notation
- Given some input x
- Consider a set of candidate outputs y
- Define a scoring function score(x,y)
- Linear function A sum of feature weights (you
pick the features!) - Choose y that maximizes score(x,y)
600.465 - Intro to NLP - J. Eisner
17
18Probabilists Rally Behind Paradigm
83 of
- .2, .4, .6, .8! Were not gonna take your
bait! - Can estimate our parameters automatically
- e.g., log p(t7 t5, t6) (trigram
tag probability) - from supervised or unsupervised data
- Our results are more meaningful
- Can use probabilities to place bets, quantify
risk - e.g., how sure are we that this is the correct
parse? - Our results can be meaningfully combined ?
modularity! - Multiply indep. conditional probs normalized,
unlike scores - p(English text) p(English phonemes English
text) p(Jap. phonemes English phonemes)
p(Jap. text Jap. phonemes) - p(semantics) p(syntax semantics)
p(morphology syntax) p(phonology
morphology) p(sounds phonology)
19Probabilists Regret Being Bound by Principle
- Problem with our courses principled approach
- All weve had is the chain rule backoff.
- But this forced us to make some tough either-or
decisions. - p(t7 t5, t6) do we want to back off to t6 or
t5? - p(S ? NP VP S) with features do we want to
back off first from number or gender features
first? - p(spam message text) which words of the
message do we back off from??
p(Paul Revere wins weathers clear, ground is
dry, jockey getting over sprain, Epitaph also in
race, Epitaph was recently bought by Gonzalez,
race is on May 17, )
20News Flash! Hope arrives
- So far Chain rule backoff
- directed graphical model
- Bayesian network or Bayes net
- locally normalized model
- We do have a good trick to help with this
- Conditional log-linear model look back at
smoothing lecture - Solves problems on previous slide!
- Computationally a bit harder to train
- Have to compute Z(x) for each condition x
21Gradient-based training
General function maximization algorithms include
gradient ascent, L-BFGS, simulated annealing
- Gradually try to adjust ? in a direction that
will improve the function were trying to
maximize - So compute that functions partial derivatives
with respect to the feature weights in ? the
gradient. - Heres how the key part works out
22Why Bother?
- Gives us probs, not just scores.
- Can use em to bet, or combine w/ other probs.
- We can now learn weights from data!
23News Flash! More hope
- So far Chain rule backoff
- directed graphical model
- Bayesian network or Bayes net
- locally normalized model
- Also consider Markov Random Field
- undirected graphical model
- log-linear model (globally normalized)
- exponential model
- maximum entropy model
- Gibbs distribution
24Maximum Entropy
- Suppose there are 10 classes, A through J.
- I dont give you any other information.
- Question Given message m what is your guess for
p(C m)? - Suppose I tell you that 55 of all messages are
in class A. - Question Now what is your guess for p(C m)?
- Suppose I also tell you that 10 of all messages
contain Buy and 80 of these are in class A or C. - Question Now what is your guess for p(C m),
if m contains Buy? - OUCH!
25Maximum Entropy
- Column A sums to 0.55 (55 of all messages are
in class A)
26Maximum Entropy
- Column A sums to 0.55
- Row Buy sums to 0.1 (10 of all messages
contain Buy)
27Maximum Entropy
- Column A sums to 0.55
- Row Buy sums to 0.1
- (Buy, A) and (Buy, C) cells sum to 0.08 (80 of
the 10)
- Given these constraints, fill in cells as
equally as possible maximize the entropy
(related to cross-entropy, perplexity) - Entropy -.051 log .051 - .0025 log .0025 - .029
log .029 - - Largest if probabilities are evenly distributed
28Maximum Entropy
- Column A sums to 0.55
- Row Buy sums to 0.1
- (Buy, A) and (Buy, C) cells sum to 0.08 (80 of
the 10)
- Given these constraints, fill in cells as
equally as possible maximize the entropy - Now p(Buy, C) .029 and p(C Buy) .29
- We got a compromise p(C Buy) lt p(A Buy) lt
.55
29Generalizing to More Features
lt100
Other
30What we just did
- For each feature (contains Buy), see what
fraction of training data has it - Many distributions p(c,m) would predict these
fractions (including the unsmoothed one where all
mass goes to feature combos weve actually seen) - Of these, pick distribution that has max entropy
- Amazing Theorem This distribution has the form
p(m,c) (1/Z(?)) exp ?i ?i fi(m,c) - So it is log-linear. In fact it is the same
log-linear distribution that maximizes ?j p(mj,
cj) as before! - Gives another motivation for our log-linear
approach.
31Overfitting
- If we have too many features, we can choose
weights to model the training data perfectly. - If we have a feature that only appears in spam
training, not ling training, it will get weight ?
to maximize p(spam feature) at 1. - These behaviors overfit the training data.
- Will probably do poorly on test data.
32Solutions to Overfitting
- Throw out rare features.
- Require every feature to occur gt 4 times, and to
occur at least once with each output class. - Only keep 1000 features.
- Add one at a time, always greedily picking the
one that most improves performance on held-out
data. - Smooth the observed feature counts.
- Smooth the weights by using a prior.
- max p(?data) max p(?, data) p(?)p(data?)
- decree p(?) to be high when most weights close to
0