Word clustering: Smaller models, Faster training - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

Word clustering: Smaller models, Faster training

Description:

10th Anniversary Last Week. Very roughly 500 researchers. I don't know what 430 of them are doing ... Li Deng, Alex Acero, Jasha Droppo. Noise Robustness (Great ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 62

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Word clustering: Smaller models, Faster training

1
Word clusteringSmaller models, Faster training

Joshua Goodman
Machine Learning and Applied Statistics
Microsoft Research

2
Quick Overview

Microsoft Research Overview (5 minutes)
What Ive done (55 minutes)
Word clusters
How word clusters make language models
Smaller
Faster

3
Microsoft Research

10th Anniversary Last Week
Very roughly 500 researchers
I dont know what 430 of them are doing
Speech Group
I used to be there
Machine Learning and Applied Statistics
My new home
Natural Language Processing

4
Speech Recognition Research

Ciprian Chelba, Milind Mahajan
Language Model Adaptation
Kuansan Wang
Dialog systems
Asela Gunawardana
Acoustic Model Adaptation
Li Deng, Alex Acero, Jasha Droppo
Noise Robustness (Great results in Aurora
Competition)
Yeyi Wang
Understanding

5
Natural Language Processing

Mostly non-statistical
Main project Machine Translation
Parse in source language
Translate deep structure
Generate in target language
Good initial results (beating Systran in target
domain)

6
Statistical People in NLP

Bob Moore
Automatically building translation lexicons
Eric Ringger
Using machine learning for generation
Michele Banko
Working with Bob, Eric Brill, and Me

7
Machine Learning and Applied Statistics

Lots of non-language stuff
I wont talk about it
Bayes networks, Bayesian approaches
Lots of language stuff too

8
Language Things in MLAS

Hagai Attias
Noise robustness (with speech group)
Eric Brill and Michele Banko
Question answering using the web as a resource
Joshua Goodman, Eric Brill and Michele Banko
Machine learning for Grammar Checking

9
Machine Learning for Grammar Checking

Example to or too
Take 50 million words of Wall Street Journal data
Train classifier using nearby words
Take real data, find places where we predict
too but see to.
Mark as errors

Strunk and White
10
Overview Word clusters solve problems --
Smaller, faster

Background What are word clusters
Word clusters for smaller models
Use a clustering technique that leads to larger
models, then prune
Up to 3 times smaller at same perplexity
Word clusters for faster training of maximum
entropy models
Train two models, each of which predicts half as
much. Up to 35 times faster training

11
A bad language model
12
A bad language model
13
A bad language model
14
A bad language model
15
Whats a Language Model

For our purposes today, a language model gives
the probability of a word given its context
P(truth and nothing but the) ? 0.2
P(roof and nuts sing on the) ? 0.00000001
Useful for speech recognition, hand writing, OCR,
etc.

16
The Trigram Approximation

Assume each word depends only on the previous two
words
P(the whole truth and nothing but) ?
P(thenothing but)

17
Trigrams, continued

Find probabilities by counting in real text
P(the nothing but) ?
C(nothing but the) /
C(nothing but)
Smoothing need to combine trigram P(the
nothing but) with bigram P(the nothing) with
unigram P(the) otherwise, too many things
youve never seen

18
Perplexity

Perplexity standard measure of language model
accuracy lower is better
Corresponds to average branching factor of model

19
Trigram Problems

Models are potentially huge similar in size to
training data
Largest part of commercial recognizers
Sophisticated variations can be slow to learn
Maximum entropy could take weeks, months, or
years!

20
What are word clusters?

CLUSTERING CLASSES (same thing)
What is P(Tuesday party on)
Similar to P(Monday party on)
Similar to P(Tuesday celebration on)
Put words in clusters
WEEKDAY Sunday, Monday, Tuesday,
EVENTparty, celebration, birthday,

21
Putting words into clusters

One cluster per word hard clustering
WEEKDAY Sunday, Monday, Tuesday,
MONTH January, February, April, May, June,
Soft clustering (each word belongs to more than
one cluster) possible, but complicates things.
You get fractional counts.

22
Clustering how to get them

Build them by hand
Works ok when almost no data
Part of Speech (POS) tags
Tends not to work as well as automatic
Automatic Clustering
Swap words between clusters to minimize perplexity

23
Clustering automatic

Minimize perplexity of P(zY)
Put words into clusters randomly
Swap words between clusters whenever overall
perplexity of P(zY) goes down
Doing this naively is very slow, but mathematical
tricks speed it up

24
Clustering fast

Use top-down splitting at each
level
consider swapping each word
between two clusters.
not bottom up merging!
(considers all pairs of
Clusters!)

25
Clustering example

Imagine following counts
C(Tuesday party on) 0
C(Wednesday celebration before) 100
C(Tuesday WEEKDAY) 1000
Then
P(Tuesday party on) ? 0
P(WEEKDAY EVENT PREPOSITION) ? large
P(Tuesday WEEKDAY) ? large
P(WEEKDAY EVENT PREPOSITION) ? P(Tuesday
WEEKDAY) ? large

26
Two actual WSJ clusters

MONDAYS
FRIDAYS
THURSDAY
MONDAY
EURODOLLARS
SATURDAY
WEDNESDAY
FRIDAY
TENTERHOOKS
TUESDAY
SUNDAY
CONDITION

PARTY
FESCO
CULT
NILSON
PETA
CAMPAIGN
WESTPAC
FORCE
CONRAN
DEPARTMENT
PENH
GUILD

27
How to use clusters

Let x, y, z be words, X, Y, Z be the clusters of
those words.
P(zxy) ? P(ZXY) ? P(zZ)
P(Tuesday party on) ? P(WEEKDAY EVENT
PREPOSITION) ? P(Tuesday WEEKDAY)
Much smoother, smaller model than normal P(zxy),
but higher perplexity.

28
Predictive clustering

IMPORTANT FACT -- with no smoothing, etc.

We are using hard clusters, so if we know z then
we know the cluster, Z, so P(z, Z, history)
P(z, history)
29
Predictive clustering

Equality with no smoothing, etc.
P(Zhistory)?? P(zhistory, Z)
With smoothing, tends to be better
May have trouble figuring out probability of
P(Tuesdayparty on) but can guess
P(WEEKDAYparty on)?P(Tuesdayparty on WEEKDAY)
?
P(WEEKDAYparty on)?P(Tuesday on WEEKDAY)

30
Compression - Introduction

We have billions of words of training data.
Most large-vocabulary models are limited by model
size.
The most important question in language modeling
is What is the best language model we can build
that will fit in the available memory?
Relatively little research.
New results, up to a factor of 3 or more smaller
than previous state of the art at the same
perplexity.

31
Compression overview

Review previous techniques
Count cutoffs
Stolcke pruning
IBM clustering
Describe new techniques (Stolcke pruning
predictive clustering)
Show experimental results
Up to factor of 3 or more size decrease (at same
perplexity) versus Stolcke Pruning.

32
Count cutoffs

Simple, commonly used technique
Just remove n-grams with
small counts

33
Stolcke pruning

Consider P(City New York) vs. P(City York)
Probabilities are almost the same
Pruning P(City New York) has almost no cost,
even though C(New York City) is big.
Consider pruning P(lightbulb change a) much
more likely than P(lightbulb a)

34
IBM clustering

Use P(ZXY)?? P(zZ)
Dont interpolate P(zxy) of course
Model is much smaller, but higher perplexity.
How does it compare to count cutoffs, etc? No one
ever tried comparison!

35
Predictive clustering

Predictive clustering P(Zxy) ? P(zxyZ)
Model actually larger than original P(zxy)
For each original P(zxy), we must store
P(zxyZ). In addition, need P(Zxy)
Normal model stores
P(Sunday party on), P(Mondayparty on),
P(Tuedayparty on),
Clustered, pruned model stores
P(WEEKDAYparty on)
P(SundayWEEKDAY), P(Monday WEEKDAY),
P(TuedayWEEKDAY),

36
Experiments
37
Different Clusterings

Let xj, xk,xl be alternate clusterings for x
Example
Tuesdayl WEEKDAY
Tuesdayj DAYS-MONTHS-TIMES
Tuesdayk NOUNS
You can think of l, j, and k as being the number
of clusters.
Example P(zl xy) ? P(zl xj yj )

38
Different Clusterings (continued)

Example P(zl xy) ? P(zl xj yj )
P(WEEKDAYparty on) ?
P(WEEKDAYpartyj onj )
P(WEEKDAY NOUN PREP)
Or
P(WEEKDAYparty on) ?
P(WEEKDAYpartyk onk )
P(WEEKDAY EVENT LOC-PREP)

39
Both Clustering

P(zl xy) ? P(zl xj yj )
P(z xyzl ) ? P(z xk yk zl )
Substitute into predictive clustering
P(zxy)
P(zl xy) ? P(zxyzl ) ?
P(zl xj yj ) ? P(z xk yk zl )

40
Example

P(zxy)
P(zl xy) ? P(z xyzl ) ?
P(zl xj yj ) ? P(z xk yk zl )
P(Tuesday party on)
P(WEEKDAYparty on) ? P(Tuesday party
on WEEKDAY) ?
P(WEEKDAYNOUN PREP) ? P(Tuesday EVENT
LOC-PREP WEEKDAY)

41
Size reduction

P(zxy) P(zl xy) ? P(zxyzl ) ?
P(zl xj yj) ? P(z xk
yk zl )
Optimal setting for k is often very large, e.g.
whole vocabulary.
Unpruned model is typically larger than
unclustered, but smaller than predictive.
Pruned model is smaller than unclustered and
smaller than predictive at same perplexity

42
Experiments
43
WSJ (English) results -- relative
44
Chinese Newswire Results(with Jianfeng Gao, MSR
Beijing)
45
Compression conclusion

We can achieve up to a factor of 3 or more
reduction at the same perplexity by using Both
Clustering combined with Stolcke pruning.
Model is surprising it actually increases the
model size and then prunes it down smaller
Results are similar for Chinese and English.

46
Maximum Entropy Speedups

Many people think Maximum Entropy is the future
of language modeling
(not me anymore)
Allows lots of different information to be
combined
Very slow to train weeks
Predictive cluster models are up to 35 times
faster to train

47
Maximum entropy overview

Describe what maximum entropy is
Explain how to train maxent models, and why it is
slow
Show how predictive clustering can speed it up
Give experimental results showing factor of 35
speedup.
Talk about application to other areas

48
Maximum Entropy Introduction

Im busy next weekend. We are having a big
party on
How likely is Friday?
Reasonably likely to start 0.001
weekend occurs nearby 2 times as likely
Previous word is on 3 times as likely
Previous words are party on 5 times as likely
0.001 ? 2 ? 3 ? 5 0.03
Need to normalize 0.03 / ?? P(all words)

49
Maximum Entropy what is it

Product of many indicator functions
fj is an indicator 1 if some condition holds,
e.g. fj (w, wi-2 , wi-1 ) 1 if wFriday,wi-2
party, wi-1 on
Can create bigrams, trigrams, skipping, caches,
triggers with right indicator functions.
Z?? is a normalization constant

50
Maximum entropy training

How to get the ??s Iterative EM algorithm
Requires computing probability distribution in
all training contexts. For each training
context
Requires determining all indicators that might
apply
Requires computing normalization constant
Note that number of indicators that can apply,
and time to normalize are both bounded by a
factor of vocabulary size.

51
Example party on Tuesday

Consider party on Friday
We need to compute P(Fridayparty on),
P(Tuesdayparty on), P(fishparty on), etc.
Number of trigram indicators (fj s) that we need
to consider bounded by vocabulary size
Number of words to normalize vocabulary size.

52
Solution Predictive Clustering

Create two separate maximum entropy models
P(Zwxy) and P(zwxyZ).
Imagine 10,000 word vocabulary, 100 clusters, 100
words per cluster.
Time to train first model will be proportional to
number of clusters (100)
Time to train second model proportional to number
of words per cluster (100)
10,000 / 200 50 times speedup

53
Predictive clustering example

Consider party on Tuesday, P(Zwxy)
We need to know P(WEEKDAYparty on),
P(MONTHparty on), P(ANIMALparty on), etc.
Number of trigram indicators (fj s) that we need
to consider bounded by number clusters
Normalize only over number of clusters

54
Predictive clustering example(continued)

Consider party on Tuesday, P(zwxyZ)
We need to know P(Mondayparty on WEEKDAY),
P(Tuesdayparty on WEEKDAY), etc. Note that
P(fish party on WEEKDAY) 0
Number of trigram indicators (fj s) we need to
consider bounded by number words in cluster.
Normalize only over number of words in cluster

55
Improvements testing

May also speed up testing.
If running decoder with all words, then we need
to compute P(zwxy) for all z, and no speedup.
If using maximum entropy as a postprocessing
step, on a lattice or n-best list, may still lead
to speedups, since only need to compute a few zs
for each context wxy.

56
Maximum entropy results
57
Maximum entropy conclusions

At 10,000 predictive hurts a little
At any larger size they help.
Amount they help increases as training data size
increases.
Triple predictive gives a factor of 35 over fast
unigram at 10,000,000 words training
Perplexity actually decreases slightly, even with
faster training!

58
Overall conclusion Predictive clustering ??
Smaller, faster

Clustering is a well known technique
Smaller New ways of using clustering to reduce
language model size up to 50 reduction in size
at same perplexity.
Faster New ways of speeding up training for
maximum entropy models.

59
Speedup applied to other areas

Can apply to any problem with many outputs, not
just words
Example collaborative filtering tasks
This speedup can be used with most machine
learning algorithms applied to problems with many
outputs
Examples neural networks, decision trees

60
Neural Networks