Word clustering: Smaller models, Faster training - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Word clustering: Smaller models, Faster training

Description:

10th Anniversary Last Week. Very roughly 500 researchers. I don't know what 430 of them are doing ... Li Deng, Alex Acero, Jasha Droppo. Noise Robustness (Great ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 62
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Word clustering: Smaller models, Faster training


1
Word clusteringSmaller models, Faster training
  • Joshua Goodman
  • Machine Learning and Applied Statistics
  • Microsoft Research

2
Quick Overview
  • Microsoft Research Overview (5 minutes)
  • What Ive done (55 minutes)
  • Word clusters
  • How word clusters make language models
  • Smaller
  • Faster

3
Microsoft Research
  • 10th Anniversary Last Week
  • Very roughly 500 researchers
  • I dont know what 430 of them are doing
  • Speech Group
  • I used to be there
  • Machine Learning and Applied Statistics
  • My new home
  • Natural Language Processing

4
Speech Recognition Research
  • Ciprian Chelba, Milind Mahajan
  • Language Model Adaptation
  • Kuansan Wang
  • Dialog systems
  • Asela Gunawardana
  • Acoustic Model Adaptation
  • Li Deng, Alex Acero, Jasha Droppo
  • Noise Robustness (Great results in Aurora
    Competition)
  • Yeyi Wang
  • Understanding

5
Natural Language Processing
  • Mostly non-statistical
  • Main project Machine Translation
  • Parse in source language
  • Translate deep structure
  • Generate in target language
  • Good initial results (beating Systran in target
    domain)

6
Statistical People in NLP
  • Bob Moore
  • Automatically building translation lexicons
  • Eric Ringger
  • Using machine learning for generation
  • Michele Banko
  • Working with Bob, Eric Brill, and Me

7
Machine Learning and Applied Statistics
  • Lots of non-language stuff
  • I wont talk about it
  • Bayes networks, Bayesian approaches
  • Lots of language stuff too

8
Language Things in MLAS
  • Hagai Attias
  • Noise robustness (with speech group)
  • Eric Brill and Michele Banko
  • Question answering using the web as a resource
  • Joshua Goodman, Eric Brill and Michele Banko
  • Machine learning for Grammar Checking

9
Machine Learning for Grammar Checking
  • Example to or too
  • Take 50 million words of Wall Street Journal data
  • Train classifier using nearby words
  • Take real data, find places where we predict
    too but see to.
  • Mark as errors

Strunk and White
10
Overview Word clusters solve problems --
Smaller, faster
  • Background What are word clusters
  • Word clusters for smaller models
  • Use a clustering technique that leads to larger
    models, then prune
  • Up to 3 times smaller at same perplexity
  • Word clusters for faster training of maximum
    entropy models
  • Train two models, each of which predicts half as
    much. Up to 35 times faster training

11
A bad language model
12
A bad language model
13
A bad language model
14
A bad language model
15
Whats a Language Model
  • For our purposes today, a language model gives
    the probability of a word given its context
  • P(truth and nothing but the) ? 0.2
  • P(roof and nuts sing on the) ? 0.00000001
  • Useful for speech recognition, hand writing, OCR,
    etc.

16
The Trigram Approximation
  • Assume each word depends only on the previous two
    words
  • P(the whole truth and nothing but) ?
  • P(thenothing but)

17
Trigrams, continued
  • Find probabilities by counting in real text
    P(the nothing but) ?
  • C(nothing but the) /
    C(nothing but)
  • Smoothing need to combine trigram P(the
    nothing but) with bigram P(the nothing) with
    unigram P(the) otherwise, too many things
    youve never seen

18
Perplexity
  • Perplexity standard measure of language model
    accuracy lower is better
  • Corresponds to average branching factor of model

19
Trigram Problems
  • Models are potentially huge similar in size to
    training data
  • Largest part of commercial recognizers
  • Sophisticated variations can be slow to learn
  • Maximum entropy could take weeks, months, or
    years!

20
What are word clusters?
  • CLUSTERING CLASSES (same thing)
  • What is P(Tuesday party on)
  • Similar to P(Monday party on)
  • Similar to P(Tuesday celebration on)
  • Put words in clusters
  • WEEKDAY Sunday, Monday, Tuesday,
  • EVENTparty, celebration, birthday,

21
Putting words into clusters
  • One cluster per word hard clustering
  • WEEKDAY Sunday, Monday, Tuesday,
  • MONTH January, February, April, May, June,
  • Soft clustering (each word belongs to more than
    one cluster) possible, but complicates things.
    You get fractional counts.

22
Clustering how to get them
  • Build them by hand
  • Works ok when almost no data
  • Part of Speech (POS) tags
  • Tends not to work as well as automatic
  • Automatic Clustering
  • Swap words between clusters to minimize perplexity

23
Clustering automatic
  • Minimize perplexity of P(zY)
  • Put words into clusters randomly
  • Swap words between clusters whenever overall
    perplexity of P(zY) goes down
  • Doing this naively is very slow, but mathematical
    tricks speed it up

24
Clustering fast
  • Use top-down splitting at each
    level
  • consider swapping each word
  • between two clusters.
  • not bottom up merging!
  • (considers all pairs of
  • Clusters!)

25
Clustering example
  • Imagine following counts
  • C(Tuesday party on) 0
  • C(Wednesday celebration before) 100
  • C(Tuesday WEEKDAY) 1000
  • Then
  • P(Tuesday party on) ? 0
  • P(WEEKDAY EVENT PREPOSITION) ? large
  • P(Tuesday WEEKDAY) ? large
  • P(WEEKDAY EVENT PREPOSITION) ? P(Tuesday
    WEEKDAY) ? large

26
Two actual WSJ clusters
  • MONDAYS
  • FRIDAYS
  • THURSDAY
  • MONDAY
  • EURODOLLARS
  • SATURDAY
  • WEDNESDAY
  • FRIDAY
  • TENTERHOOKS
  • TUESDAY
  • SUNDAY
  • CONDITION
  • PARTY
  • FESCO
  • CULT
  • NILSON
  • PETA
  • CAMPAIGN
  • WESTPAC
  • FORCE
  • CONRAN
  • DEPARTMENT
  • PENH
  • GUILD

27
How to use clusters
  • Let x, y, z be words, X, Y, Z be the clusters of
    those words.
  • P(zxy) ? P(ZXY) ? P(zZ)
  • P(Tuesday party on) ? P(WEEKDAY EVENT
    PREPOSITION) ? P(Tuesday WEEKDAY)
  • Much smoother, smaller model than normal P(zxy),
    but higher perplexity.

28
Predictive clustering
  • IMPORTANT FACT -- with no smoothing, etc.

We are using hard clusters, so if we know z then
we know the cluster, Z, so P(z, Z, history)
P(z, history)
29
Predictive clustering
  • Equality with no smoothing, etc.
  • P(Zhistory)?? P(zhistory, Z)
  • With smoothing, tends to be better
  • May have trouble figuring out probability of
    P(Tuesdayparty on) but can guess
  • P(WEEKDAYparty on)?P(Tuesdayparty on WEEKDAY)
    ?
  • P(WEEKDAYparty on)?P(Tuesday on WEEKDAY)

30
Compression - Introduction
  • We have billions of words of training data.
  • Most large-vocabulary models are limited by model
    size.
  • The most important question in language modeling
    is What is the best language model we can build
    that will fit in the available memory?
  • Relatively little research.
  • New results, up to a factor of 3 or more smaller
    than previous state of the art at the same
    perplexity.

31
Compression overview
  • Review previous techniques
  • Count cutoffs
  • Stolcke pruning
  • IBM clustering
  • Describe new techniques (Stolcke pruning
    predictive clustering)
  • Show experimental results
  • Up to factor of 3 or more size decrease (at same
    perplexity) versus Stolcke Pruning.

32
Count cutoffs
  • Simple, commonly used technique
  • Just remove n-grams with
  • small counts

33
Stolcke pruning
  • Consider P(City New York) vs. P(City York)
  • Probabilities are almost the same
  • Pruning P(City New York) has almost no cost,
    even though C(New York City) is big.
  • Consider pruning P(lightbulb change a) much
    more likely than P(lightbulb a)

34
IBM clustering
  • Use P(ZXY)?? P(zZ)
  • Dont interpolate P(zxy) of course
  • Model is much smaller, but higher perplexity.
  • How does it compare to count cutoffs, etc? No one
    ever tried comparison!

35
Predictive clustering
  • Predictive clustering P(Zxy) ? P(zxyZ)
  • Model actually larger than original P(zxy)
  • For each original P(zxy), we must store
    P(zxyZ). In addition, need P(Zxy)
  • Normal model stores
  • P(Sunday party on), P(Mondayparty on),
    P(Tuedayparty on),
  • Clustered, pruned model stores
  • P(WEEKDAYparty on)
  • P(SundayWEEKDAY), P(Monday WEEKDAY),
    P(TuedayWEEKDAY),

36
Experiments
37
Different Clusterings
  • Let xj, xk,xl be alternate clusterings for x
  • Example
  • Tuesdayl WEEKDAY
  • Tuesdayj DAYS-MONTHS-TIMES
  • Tuesdayk NOUNS
  • You can think of l, j, and k as being the number
    of clusters.
  • Example P(zl xy) ? P(zl xj yj )

38
Different Clusterings (continued)
  • Example P(zl xy) ? P(zl xj yj )
  • P(WEEKDAYparty on) ?
  • P(WEEKDAYpartyj onj )
  • P(WEEKDAY NOUN PREP)
  • Or
  • P(WEEKDAYparty on) ?
  • P(WEEKDAYpartyk onk )
  • P(WEEKDAY EVENT LOC-PREP)

39
Both Clustering
  • P(zl xy) ? P(zl xj yj )
  • P(z xyzl ) ? P(z xk yk zl )
  • Substitute into predictive clustering
  • P(zxy)
  • P(zl xy) ? P(zxyzl ) ?
  • P(zl xj yj ) ? P(z xk yk zl )

40
Example
  • P(zxy)
  • P(zl xy) ? P(z xyzl ) ?
  • P(zl xj yj ) ? P(z xk yk zl )
  • P(Tuesday party on)
  • P(WEEKDAYparty on) ? P(Tuesday party
    on WEEKDAY) ?
  • P(WEEKDAYNOUN PREP) ? P(Tuesday EVENT
    LOC-PREP WEEKDAY)

41
Size reduction
  • P(zxy) P(zl xy) ? P(zxyzl ) ?
  • P(zl xj yj) ? P(z xk
    yk zl )
  • Optimal setting for k is often very large, e.g.
    whole vocabulary.
  • Unpruned model is typically larger than
    unclustered, but smaller than predictive.
  • Pruned model is smaller than unclustered and
    smaller than predictive at same perplexity

42
Experiments
43
WSJ (English) results -- relative
44
Chinese Newswire Results(with Jianfeng Gao, MSR
Beijing)
45
Compression conclusion
  • We can achieve up to a factor of 3 or more
    reduction at the same perplexity by using Both
    Clustering combined with Stolcke pruning.
  • Model is surprising it actually increases the
    model size and then prunes it down smaller
  • Results are similar for Chinese and English.

46
Maximum Entropy Speedups
  • Many people think Maximum Entropy is the future
    of language modeling
  • (not me anymore)
  • Allows lots of different information to be
    combined
  • Very slow to train weeks
  • Predictive cluster models are up to 35 times
    faster to train

47
Maximum entropy overview
  • Describe what maximum entropy is
  • Explain how to train maxent models, and why it is
    slow
  • Show how predictive clustering can speed it up
  • Give experimental results showing factor of 35
    speedup.
  • Talk about application to other areas

48
Maximum Entropy Introduction
  • Im busy next weekend. We are having a big
    party on
  • How likely is Friday?
  • Reasonably likely to start 0.001
  • weekend occurs nearby 2 times as likely
  • Previous word is on 3 times as likely
  • Previous words are party on 5 times as likely
  • 0.001 ? 2 ? 3 ? 5 0.03
  • Need to normalize 0.03 / ?? P(all words)

49
Maximum Entropy what is it
  • Product of many indicator functions
  • fj is an indicator 1 if some condition holds,
    e.g. fj (w, wi-2 , wi-1 ) 1 if wFriday,wi-2
    party, wi-1 on
  • Can create bigrams, trigrams, skipping, caches,
    triggers with right indicator functions.
  • Z?? is a normalization constant

50
Maximum entropy training
  • How to get the ??s Iterative EM algorithm
  • Requires computing probability distribution in
    all training contexts. For each training
    context
  • Requires determining all indicators that might
    apply
  • Requires computing normalization constant
  • Note that number of indicators that can apply,
    and time to normalize are both bounded by a
    factor of vocabulary size.

51
Example party on Tuesday
  • Consider party on Friday
  • We need to compute P(Fridayparty on),
    P(Tuesdayparty on), P(fishparty on), etc.
  • Number of trigram indicators (fj s) that we need
    to consider bounded by vocabulary size
  • Number of words to normalize vocabulary size.

52
Solution Predictive Clustering
  • Create two separate maximum entropy models
    P(Zwxy) and P(zwxyZ).
  • Imagine 10,000 word vocabulary, 100 clusters, 100
    words per cluster.
  • Time to train first model will be proportional to
    number of clusters (100)
  • Time to train second model proportional to number
    of words per cluster (100)
  • 10,000 / 200 50 times speedup

53
Predictive clustering example
  • Consider party on Tuesday, P(Zwxy)
  • We need to know P(WEEKDAYparty on),
    P(MONTHparty on), P(ANIMALparty on), etc.
  • Number of trigram indicators (fj s) that we need
    to consider bounded by number clusters
  • Normalize only over number of clusters

54
Predictive clustering example(continued)
  • Consider party on Tuesday, P(zwxyZ)
  • We need to know P(Mondayparty on WEEKDAY),
    P(Tuesdayparty on WEEKDAY), etc. Note that
    P(fish party on WEEKDAY) 0
  • Number of trigram indicators (fj s) we need to
    consider bounded by number words in cluster.
  • Normalize only over number of words in cluster

55
Improvements testing
  • May also speed up testing.
  • If running decoder with all words, then we need
    to compute P(zwxy) for all z, and no speedup.
  • If using maximum entropy as a postprocessing
    step, on a lattice or n-best list, may still lead
    to speedups, since only need to compute a few zs
    for each context wxy.

56
Maximum entropy results
57
Maximum entropy conclusions
  • At 10,000 predictive hurts a little
  • At any larger size they help.
  • Amount they help increases as training data size
    increases.
  • Triple predictive gives a factor of 35 over fast
    unigram at 10,000,000 words training
  • Perplexity actually decreases slightly, even with
    faster training!

58
Overall conclusion Predictive clustering ??
Smaller, faster
  • Clustering is a well known technique
  • Smaller New ways of using clustering to reduce
    language model size up to 50 reduction in size
    at same perplexity.
  • Faster New ways of speeding up training for
    maximum entropy models.

59
Speedup applied to other areas
  • Can apply to any problem with many outputs, not
    just words
  • Example collaborative filtering tasks
  • This speedup can be used with most machine
    learning algorithms applied to problems with many
    outputs
  • Examples neural networks, decision trees

60
Neural Networks
  • Imagine a neural network with a large number of
    outputs (10,000)
  • Requires backpropagating one 1, and 9,999 0s

61
Maximum Entropy trainingInner loop
  • For each word w in vocabulary
  • Pw ? 1
  • next w
  • For each non-zero fj
  • Pw ? Pw ??
  • next j
  • z ?
  • For each word w in vocabulary
  • observedw?observedwPw/z
  • next w
Write a Comment
User Comments (0)
About PowerShow.com