Title: Word clustering: Smaller models, Faster training
1Word clusteringSmaller models, Faster training
- Joshua Goodman
- Machine Learning and Applied Statistics
- Microsoft Research
2Quick Overview
- Microsoft Research Overview (5 minutes)
- What Ive done (55 minutes)
- Word clusters
- How word clusters make language models
- Smaller
- Faster
3Microsoft Research
- 10th Anniversary Last Week
- Very roughly 500 researchers
- I dont know what 430 of them are doing
- Speech Group
- I used to be there
- Machine Learning and Applied Statistics
- My new home
- Natural Language Processing
4Speech Recognition Research
- Ciprian Chelba, Milind Mahajan
- Language Model Adaptation
- Kuansan Wang
- Dialog systems
- Asela Gunawardana
- Acoustic Model Adaptation
- Li Deng, Alex Acero, Jasha Droppo
- Noise Robustness (Great results in Aurora
Competition) - Yeyi Wang
- Understanding
5Natural Language Processing
- Mostly non-statistical
- Main project Machine Translation
- Parse in source language
- Translate deep structure
- Generate in target language
- Good initial results (beating Systran in target
domain)
6Statistical People in NLP
- Bob Moore
- Automatically building translation lexicons
- Eric Ringger
- Using machine learning for generation
- Michele Banko
- Working with Bob, Eric Brill, and Me
7Machine Learning and Applied Statistics
- Lots of non-language stuff
- I wont talk about it
- Bayes networks, Bayesian approaches
- Lots of language stuff too
8Language Things in MLAS
- Hagai Attias
- Noise robustness (with speech group)
- Eric Brill and Michele Banko
- Question answering using the web as a resource
- Joshua Goodman, Eric Brill and Michele Banko
- Machine learning for Grammar Checking
9Machine Learning for Grammar Checking
- Example to or too
- Take 50 million words of Wall Street Journal data
- Train classifier using nearby words
- Take real data, find places where we predict
too but see to. - Mark as errors
Strunk and White
10Overview Word clusters solve problems --
Smaller, faster
- Background What are word clusters
- Word clusters for smaller models
- Use a clustering technique that leads to larger
models, then prune - Up to 3 times smaller at same perplexity
- Word clusters for faster training of maximum
entropy models - Train two models, each of which predicts half as
much. Up to 35 times faster training
11A bad language model
12A bad language model
13A bad language model
14A bad language model
15Whats a Language Model
- For our purposes today, a language model gives
the probability of a word given its context - P(truth and nothing but the) ? 0.2
- P(roof and nuts sing on the) ? 0.00000001
- Useful for speech recognition, hand writing, OCR,
etc.
16The Trigram Approximation
- Assume each word depends only on the previous two
words - P(the whole truth and nothing but) ?
- P(thenothing but)
17Trigrams, continued
- Find probabilities by counting in real text
P(the nothing but) ? - C(nothing but the) /
C(nothing but) - Smoothing need to combine trigram P(the
nothing but) with bigram P(the nothing) with
unigram P(the) otherwise, too many things
youve never seen
18Perplexity
- Perplexity standard measure of language model
accuracy lower is better - Corresponds to average branching factor of model
19Trigram Problems
- Models are potentially huge similar in size to
training data - Largest part of commercial recognizers
- Sophisticated variations can be slow to learn
- Maximum entropy could take weeks, months, or
years!
20What are word clusters?
- CLUSTERING CLASSES (same thing)
- What is P(Tuesday party on)
- Similar to P(Monday party on)
- Similar to P(Tuesday celebration on)
- Put words in clusters
- WEEKDAY Sunday, Monday, Tuesday,
- EVENTparty, celebration, birthday,
21Putting words into clusters
- One cluster per word hard clustering
- WEEKDAY Sunday, Monday, Tuesday,
- MONTH January, February, April, May, June,
- Soft clustering (each word belongs to more than
one cluster) possible, but complicates things.
You get fractional counts.
22Clustering how to get them
- Build them by hand
- Works ok when almost no data
- Part of Speech (POS) tags
- Tends not to work as well as automatic
- Automatic Clustering
- Swap words between clusters to minimize perplexity
23Clustering automatic
- Minimize perplexity of P(zY)
- Put words into clusters randomly
- Swap words between clusters whenever overall
perplexity of P(zY) goes down - Doing this naively is very slow, but mathematical
tricks speed it up
24Clustering fast
- Use top-down splitting at each
level - consider swapping each word
- between two clusters.
- not bottom up merging!
- (considers all pairs of
- Clusters!)
25Clustering example
- Imagine following counts
- C(Tuesday party on) 0
- C(Wednesday celebration before) 100
- C(Tuesday WEEKDAY) 1000
- Then
- P(Tuesday party on) ? 0
- P(WEEKDAY EVENT PREPOSITION) ? large
- P(Tuesday WEEKDAY) ? large
- P(WEEKDAY EVENT PREPOSITION) ? P(Tuesday
WEEKDAY) ? large
26Two actual WSJ clusters
- MONDAYS
- FRIDAYS
- THURSDAY
- MONDAY
- EURODOLLARS
- SATURDAY
- WEDNESDAY
- FRIDAY
- TENTERHOOKS
- TUESDAY
- SUNDAY
- CONDITION
- PARTY
- FESCO
- CULT
- NILSON
- PETA
- CAMPAIGN
- WESTPAC
- FORCE
- CONRAN
- DEPARTMENT
- PENH
- GUILD
-
-
27How to use clusters
- Let x, y, z be words, X, Y, Z be the clusters of
those words. - P(zxy) ? P(ZXY) ? P(zZ)
- P(Tuesday party on) ? P(WEEKDAY EVENT
PREPOSITION) ? P(Tuesday WEEKDAY) - Much smoother, smaller model than normal P(zxy),
but higher perplexity.
28Predictive clustering
- IMPORTANT FACT -- with no smoothing, etc.
We are using hard clusters, so if we know z then
we know the cluster, Z, so P(z, Z, history)
P(z, history)
29Predictive clustering
- Equality with no smoothing, etc.
- P(Zhistory)?? P(zhistory, Z)
- With smoothing, tends to be better
- May have trouble figuring out probability of
P(Tuesdayparty on) but can guess - P(WEEKDAYparty on)?P(Tuesdayparty on WEEKDAY)
? - P(WEEKDAYparty on)?P(Tuesday on WEEKDAY)
30Compression - Introduction
- We have billions of words of training data.
- Most large-vocabulary models are limited by model
size. - The most important question in language modeling
is What is the best language model we can build
that will fit in the available memory? - Relatively little research.
- New results, up to a factor of 3 or more smaller
than previous state of the art at the same
perplexity.
31Compression overview
- Review previous techniques
- Count cutoffs
- Stolcke pruning
- IBM clustering
- Describe new techniques (Stolcke pruning
predictive clustering) - Show experimental results
- Up to factor of 3 or more size decrease (at same
perplexity) versus Stolcke Pruning.
32Count cutoffs
- Simple, commonly used technique
- Just remove n-grams with
- small counts
33Stolcke pruning
- Consider P(City New York) vs. P(City York)
- Probabilities are almost the same
- Pruning P(City New York) has almost no cost,
even though C(New York City) is big. - Consider pruning P(lightbulb change a) much
more likely than P(lightbulb a)
34IBM clustering
- Use P(ZXY)?? P(zZ)
- Dont interpolate P(zxy) of course
- Model is much smaller, but higher perplexity.
- How does it compare to count cutoffs, etc? No one
ever tried comparison!
35Predictive clustering
- Predictive clustering P(Zxy) ? P(zxyZ)
- Model actually larger than original P(zxy)
- For each original P(zxy), we must store
P(zxyZ). In addition, need P(Zxy) - Normal model stores
- P(Sunday party on), P(Mondayparty on),
P(Tuedayparty on), - Clustered, pruned model stores
- P(WEEKDAYparty on)
- P(SundayWEEKDAY), P(Monday WEEKDAY),
P(TuedayWEEKDAY),
36Experiments
37Different Clusterings
- Let xj, xk,xl be alternate clusterings for x
- Example
- Tuesdayl WEEKDAY
- Tuesdayj DAYS-MONTHS-TIMES
- Tuesdayk NOUNS
- You can think of l, j, and k as being the number
of clusters. - Example P(zl xy) ? P(zl xj yj )
38Different Clusterings (continued)
- Example P(zl xy) ? P(zl xj yj )
- P(WEEKDAYparty on) ?
- P(WEEKDAYpartyj onj )
- P(WEEKDAY NOUN PREP)
- Or
- P(WEEKDAYparty on) ?
- P(WEEKDAYpartyk onk )
- P(WEEKDAY EVENT LOC-PREP)
39Both Clustering
- P(zl xy) ? P(zl xj yj )
- P(z xyzl ) ? P(z xk yk zl )
- Substitute into predictive clustering
- P(zxy)
- P(zl xy) ? P(zxyzl ) ?
- P(zl xj yj ) ? P(z xk yk zl )
40Example
- P(zxy)
- P(zl xy) ? P(z xyzl ) ?
- P(zl xj yj ) ? P(z xk yk zl )
- P(Tuesday party on)
- P(WEEKDAYparty on) ? P(Tuesday party
on WEEKDAY) ? - P(WEEKDAYNOUN PREP) ? P(Tuesday EVENT
LOC-PREP WEEKDAY)
41Size reduction
- P(zxy) P(zl xy) ? P(zxyzl ) ?
- P(zl xj yj) ? P(z xk
yk zl ) - Optimal setting for k is often very large, e.g.
whole vocabulary. - Unpruned model is typically larger than
unclustered, but smaller than predictive. - Pruned model is smaller than unclustered and
smaller than predictive at same perplexity
42Experiments
43WSJ (English) results -- relative
44Chinese Newswire Results(with Jianfeng Gao, MSR
Beijing)
45Compression conclusion
- We can achieve up to a factor of 3 or more
reduction at the same perplexity by using Both
Clustering combined with Stolcke pruning. - Model is surprising it actually increases the
model size and then prunes it down smaller - Results are similar for Chinese and English.
46Maximum Entropy Speedups
- Many people think Maximum Entropy is the future
of language modeling - (not me anymore)
- Allows lots of different information to be
combined - Very slow to train weeks
- Predictive cluster models are up to 35 times
faster to train
47Maximum entropy overview
- Describe what maximum entropy is
- Explain how to train maxent models, and why it is
slow - Show how predictive clustering can speed it up
- Give experimental results showing factor of 35
speedup. - Talk about application to other areas
48Maximum Entropy Introduction
- Im busy next weekend. We are having a big
party on - How likely is Friday?
- Reasonably likely to start 0.001
- weekend occurs nearby 2 times as likely
- Previous word is on 3 times as likely
- Previous words are party on 5 times as likely
- 0.001 ? 2 ? 3 ? 5 0.03
- Need to normalize 0.03 / ?? P(all words)
49Maximum Entropy what is it
- Product of many indicator functions
- fj is an indicator 1 if some condition holds,
e.g. fj (w, wi-2 , wi-1 ) 1 if wFriday,wi-2
party, wi-1 on - Can create bigrams, trigrams, skipping, caches,
triggers with right indicator functions. - Z?? is a normalization constant
50Maximum entropy training
- How to get the ??s Iterative EM algorithm
- Requires computing probability distribution in
all training contexts. For each training
context - Requires determining all indicators that might
apply - Requires computing normalization constant
- Note that number of indicators that can apply,
and time to normalize are both bounded by a
factor of vocabulary size.
51Example party on Tuesday
- Consider party on Friday
- We need to compute P(Fridayparty on),
P(Tuesdayparty on), P(fishparty on), etc. - Number of trigram indicators (fj s) that we need
to consider bounded by vocabulary size - Number of words to normalize vocabulary size.
52Solution Predictive Clustering
- Create two separate maximum entropy models
P(Zwxy) and P(zwxyZ). - Imagine 10,000 word vocabulary, 100 clusters, 100
words per cluster. - Time to train first model will be proportional to
number of clusters (100) - Time to train second model proportional to number
of words per cluster (100) - 10,000 / 200 50 times speedup
53Predictive clustering example
- Consider party on Tuesday, P(Zwxy)
- We need to know P(WEEKDAYparty on),
P(MONTHparty on), P(ANIMALparty on), etc. - Number of trigram indicators (fj s) that we need
to consider bounded by number clusters - Normalize only over number of clusters
54Predictive clustering example(continued)
- Consider party on Tuesday, P(zwxyZ)
- We need to know P(Mondayparty on WEEKDAY),
P(Tuesdayparty on WEEKDAY), etc. Note that
P(fish party on WEEKDAY) 0 - Number of trigram indicators (fj s) we need to
consider bounded by number words in cluster. - Normalize only over number of words in cluster
55Improvements testing
- May also speed up testing.
- If running decoder with all words, then we need
to compute P(zwxy) for all z, and no speedup. - If using maximum entropy as a postprocessing
step, on a lattice or n-best list, may still lead
to speedups, since only need to compute a few zs
for each context wxy.
56Maximum entropy results
57Maximum entropy conclusions
- At 10,000 predictive hurts a little
- At any larger size they help.
- Amount they help increases as training data size
increases. - Triple predictive gives a factor of 35 over fast
unigram at 10,000,000 words training - Perplexity actually decreases slightly, even with
faster training!
58Overall conclusion Predictive clustering ??
Smaller, faster
- Clustering is a well known technique
- Smaller New ways of using clustering to reduce
language model size up to 50 reduction in size
at same perplexity. - Faster New ways of speeding up training for
maximum entropy models.
59Speedup applied to other areas
- Can apply to any problem with many outputs, not
just words - Example collaborative filtering tasks
- This speedup can be used with most machine
learning algorithms applied to problems with many
outputs - Examples neural networks, decision trees
60Neural Networks
- Imagine a neural network with a large number of
outputs (10,000) - Requires backpropagating one 1, and 9,999 0s
61Maximum Entropy trainingInner loop
- For each word w in vocabulary
- Pw ? 1
- next w
- For each non-zero fj
- Pw ? Pw ??
- next j
- z ?
- For each word w in vocabulary
- observedw?observedwPw/z
- next w