Title: A Bit of Progress in Language Modeling Extended Version
1A Bit of Progress in Language Modeling Extended
Version
- Presented by Louis-Tsai
- Speech Lab, CSIE, NTNU
- louis_at_csie.ntnu.edu.tw
2IntroductionOverview
- LM is the art of determining the probability of a
sequence of words - Speech recognition, optical character
recognition, handwriting recognition, machine
translation, spelling correction - Improvements
- Higher-order n-grams
- Skipping models
- Clustering
- Caching
- Sentence-mixture models
3IntroductionTechnique introductions
- The goal of a LM is to determine the probability
of a word sequence w1wn, P(w1wn) - Trigram assumption
4IntroductionTechnique introductions
- C(wi-2wi-1wi) represent the number of
occurrences of wi-2wi-1wi in the training corpus,
and similarly for C(wi-2wi-1) - There are many three word sequences that never
occur, consider the sequence party on Tuesday,
what is P(Tuesday party on)?
5IntroductionSmoothing
- The training corpus might not contain any
instances of the phrase, so C(party on Tuesday)
would be 0, while there might still be 20
instances of the phrase party on? P(Tuesday
party on) 0 - Smoothing techniques take some probability away
from some occurrences - Imagine we have party on Stan Chens birthday
in the training data and occurs only one time
6IntroductionSmoothing
- By taking some probability away from some words,
such as Stan and redistributing it to other
words, such as Tuesday, zero probabilities can
be avoided - Katz smoothingJelinek-Mercer smoothing (deleted
interpolation)Kneser-Ney smoothing
7IntroductionHigher-order n-grams
- The most obvious extension to trigram models is
to simply move to higher-order n-grams, such as
four-grams and five-grams - There is a significant interaction between
smoothing and n-gram order higher-order n-grams
work better with Kneser-Ney smoothing than with
some other methods, especially Katz smoothing
8IntroductionSkipping
- We condition on a different context than the
previous two words - Instead computing P(wiwi-2wi-1) of computing
P(wiwi-3wi-2)
9IntroductionClustering
- Clustering (classing) models attempt to make use
of the similarities between words - If we have seen occurrences of phrases like
party on Monday and party on Wednesday then
we might imagine that the word Tuesday is also
likely to follow the phrase party on
10IntroductionCaching
- Caching models make use of the observation that
if you use a word, you are likely to use it again
11IntroductionSentence Mixture
- Sentence Mixture models make use of the
observation that there are many different
sentence types, and that making models for each
type of sentence may be better than using one
global model
12IntroductionEvaluation
- A LM that assigned equal probability to 100 words
would have perplexity 100
13IntroductionEvaluation
- In general, the perplexity of a LM is equal to
the geometric average of the inverse probability
of the words measured on test data
14(No Transcript)
15IntroductionEvaluation
- true model for any data source will have the
lowest possible perplexity - The lower the perplexity of our model, the closer
it is, in some sense, to the true model - Entropy, which is simply log2 of perplexity
- Entropy is the average number of bits per word
that would be necessary to encode the test data
using an optimal coder
16IntroductionEvaluation
- entropy 5?4perplexity 32?16 50
- entropy 5?4.5perplexity 32?
29.3
17IntroductionEvaluation
- Experiments corpus 1996 NAB
- Experiments performed at 4 different training
data sizes 100K words, 1M words, 10M words, 284M
words - Heldout and test data taken from the 1994 WSJ
- Heldout data 20K words
- Test data 20K words
- Vocabulary 58,546 words
18Smoothingsimply interpolation
- where 0??,??1
- In practice, the uniform distribution are also
interpolated this ensures that no word is
assigned probability 0
19SmoothingKatz smoothing
- Katz smoothing is based on the Good-Turing
formula - Let nr represent the number of n-grams that occur
r times - discount
20SmoothingKatz smoothing
(r1)nr10
- Let N represent the total size of the training
set, this left-over probability will be equal to
n1/N
Sumn1
21SmoothingKatz smoothing
- Consider a bigram model of a phrase such as
Pkatz(Francisco on). Since the phrase San
Francisco is fairly common, the unigram
probability will also be
fairly high. - This means that using Katz smoothing, the
probabilitywill also be fairly high. But,
the word Francisco occurs in exceedingly few
contexts, and its probability of occurring in a
new one is very low
22SmoothingKneser-Ney smoothing
- KN smoothing uses a modified backoff distribution
based on the number of contexts each word occurs
in, rather than the number of occurrences of the
word. Thus, the probability PKN(Francisco on)
would be fairly low, while for a word like
Tuesday that occurs in many contexts, PKN(Tuesday
on) would be relatively high, even if the
phrase on Tuesday did not occur in the training
data
23SmoothingKneser-Ney smoothing
- Backoff Kneser-Ney smoothing
- where vC(vwi)gt0 is the number of words v
that wi can occur in the context, D is the
discount, ? is a normalization constant such that
the probabilities sum to 1
24SmoothingKneser-Ney smoothing
Va,b,c,d
b
b
c
a
a
c
d
a
a
b
b
b
b
c
c
a
a
b
b
c
c
c
c
d
d
a
d
c
25SmoothingKneser-Ney smoothing
- Interpolated models always combine both the
higher-order and the lower-order distribution - Interpolated Kneser-Ney smoothingwhere
?(wi-1) is a normalization constant such that the
probabilities sum to 1
26SmoothingKneser-Ney smoothing
- Multiple discounts, one for one counts, another
for tow counts, and another for three or more
counts. But it have too many parameters - Modified Kneser-Ney smoothing
27SmoothingJelinek-mercer smoothing
- Combines different N-gram orders by linearly
interpolating all three models whenever computing
trigram
28Smoothingabsolute discounting
- Absolute discounting subtracting a fixed discount
Dlt1 from each nonzero count
29Witten-Bell Discounting
- Key ConceptThings Seen Once Use the count of
things youve seen once to help estimate the
count of things youve never seen - So we estimate the total probability mass of all
the zero N-grams with the number of types divided
by the number of tokens plus observed types
N the number of tokensT observed types
30Witten-Bell Discounting
- T/(NT) gives the total probability of unseen
N-grams, we need to divide this up among all the
zero N-grams - We could just choose to divide it equally
Z is the total number of N-grams with count zero
31Witten-Bell Discounting
Alternatively, we can represent the smoothed
counts directly as
32Witten-Bell Discounting
33Witten-Bell Discounting
- For bigramT the number of bigram types, N
the number of bigram token
3420 words per sentence
35Higher-order n-grams
- Trigram P(wiwi-2wi-1) ? five-gram
P(wiwi-4wi-3wi-2wi-1) - In many cases, no sequence of the form
wi-4wi-3wi-2wi-1 will have been seen in the
training data?backoff to or interpolation with
four-grams, trigrams, bigrams, or even unigrams - But in those cases where such a long sequence has
been seen, it may be a good predictor of wi
360.06
0.02
0.01
284,000,000
37Higher-order n-grams
- As we can see, the behavior for Katz smoothing is
very different than the behavior for KN
smoothing? the main cause of this difference was
backoff smoothing techniques, such as Katz
smoothing, or even the backoff version of KN
smoothing - Backoff smoothing techniques work poorly on low
counts, especially one counts, and that as the
n-grams order increases, the number of one counts
increases
38Higher-order n-grams
- Katz smoothing has its best performance around
the trigram level, and actually gets worse as
this level is exceeded - KN smoothing is essentially monotonic even
through 20-grams - The plateau point for KN smoothing depends on the
amount of training data availablesmall (100,000
words) at trigram levelfull (284 million words)
at 5 to 7 gram - (6-gram has .02 bits better than 5-gram, 7-gram
has .01 bits better than 6-gram)
39Skipping
- When considering a 5-gram context, there are many
subsets of the 5-gram we could consider, such as
P(wiwi-4wi-3wi-1) or P(wiwi-4wi-2wi-1) - If have never seen Show John a good time but we
have seen Show Stan a good time. A normal
5-gram predicting P(time show John a good)
would back off to P(time John a good) and from
there to P(time a good), which would have a
relatively low probability - A skipping model of the from P(wiwi-4wi-2wi-1)
would assign high probability to P(time show
____ a good)
40Skipping
- These skipping 5-grams are then interpolated with
a normal 5-gram, forming models such aswhere
0? ? ?1 and 0 ? ? ?1 and 0 ? (1-?-?) ?1 - Another (and more traditional) use for skipping
is as a sort of poor mans higher order n-gram.
One can, for instance, create a model of the
formno component probability depends on more
than two previous words but the overall
probability is 4-gram-like, since it depends on
wi-3, wi-2, and wi-1
? P(wiwi-4wi-3wi-2wi-1) ? P(wiwi-4wi-3wi-1)
(1-?-?) P(wiwi-4wi-2wi-1)
? P(wiwi-2wi-1) ? P(wiwi-3wi-1) (1-?-?)
P(wiwi-3wi-2)
41Skipping
- For a 5-gram skipping experiments, all contexts
depended on at most the previous four words,
wi-4, wi-3, wi-2,and wi-1, but used the four
words in a variety of ways - For readability and conciseness, we define v
wi-4, w wi-3, x wi-2, y wi-1
42(No Transcript)
43Skipping
- First model interpolated dependencies on vw_y and
v_xy?does not work well on the smallest training
data size, but is competitive for larger ones - In second model, we add vwx_ into first
model?roughly .02 to .04 bits over the first
model - Next, adding back in the dependencies on the
missing words, xvwy, wvxy, and yvwx that is, all
models depended on the same variables, but with
the interpolation order modified - e.g., by xvwy, we refer to a model of the form
P(zvwxy) interpolated with P(zvw_y)
interpolated with P(zw_y) interpolated with
P(zy) interpolated with P(z)
44Skipping
- Interpolating together vwyx, vxyw, wxyv (base on
vwxy) This model puts each of the four preceding
words in the last position for one
component?this model does not work as well as
the previous two, leading us to conclude that the
y word is by far the most important
45Skipping
- Interpolating together vwyx, vywx, yvwx, which
put the y word in each possible position in the
backoff model?this was overall the worst model,
reconfirming the intuition that the y word is
critical - Finally we interpolating together vwyx, vxyw,
wxyv, vywx, yvwx, xvwy, wvxy? the result is a
marginal gain less than 0.01 bits over the
best previous model
46(No Transcript)
47Skipping
- 1-back word (y) xy, wy, vy, uy and ty
- 4-gram level xy, wy and wx
- The improvement over 4-gram pairs was still
marginal
48Clustering
- Consider a probability such as P(Tuesday party
on) - Perhaps the training data contains no instances
of the phrase party on Tuesday, although other
phrase such as party on Wednesday and party on
Friday do appear - We can put words into classes, such as the word
Tuesday into the class WEEKDAY - P(Tuesday party on WEEKDAY)
49Clustering
- When each word belongs to only one class, which
is called hard clustering, this decomposition is
a strict equality a fact that can be trivially
provenLet Wi represent the cluster of word wi
(1)
50Clustering
- Since each word belongs to a single cluster,
P(Wiwi) 1
(2)
(2) ?? (1) ?
(3)
predictive clustering
51Clustering
- Another type of clustering we can do is to
cluster the words in the contexts. For instance,
if party is in the class EVENT and on is in
the class PREPOSITION, then we could writeor
more generallyCombining (4) with (3) we get
(4)
(5)
fullibm clustering
52Clustering
- Use the approximation P(wWi-2Wi-1W) P(wW) to
getfullibm clustering uses more information
than ibm clustering, we assumed that it would
lead to improvements (goodibm)
(6)
ibm clustering
53Clustering
(7)
index clustering
- Backoff/interpolation go fromP(Tuesday party
EVENT on PREPOSITION) toP(Tuesday EVENT on
PREPOSITION) toP(Tuesday on PREPOSITION)
toP(Tuesday PREPOSITION) toP(Tuesday)since
each word belongs to a single cluster ?redundant
54Clustering
- C(party EVENT on PREPOSITION) C(party
on)C(EVENT on PREPOSITION) C(EVENT on) - We generally write an index clustered model as
fullibmpredict clustering
55Clustering
- indexpredict, combining index and predictive
- combinepredict, interpolating a normal trigram
with a predictive clustered trigram
56Clustering
- allcombinenotop, which is an interpolation of a
normal trigram, a fullibm-like model, an index
model, a predictive model, a true fullibm model,
and an indexpredict model
normal trigram
fullibm-like
index midel
predictive
true fullibm
indexpredict
57Clustering
- allcombine, interpolates the predict-type models
first at the cluster level, before interpolating
with the word level model
normal trigram
fullibm-like
index midel
predictive
true fullibm
indexpredict
58baseline
59Clustering
- The value of clustering decreases with training
data increases, since clustering is a technique
for dealing with data sparseness - ibm clustering consistently works very well
60(No Transcript)
61Clustering
- In Fig.6 we show a comparison of several
techniques using Katz smoothing and the same
techniques with KN smoothing. The results are
similar, with same interesting exceptions - Indexpredict works well for the KN smoothing
model, but very poorly for the Katz smoothed
model. - This shows that smoothing can have a significant
effect on other techniques, such as clustering
62Other ways to perform Clustering
- Cluster groups of words instead of individual
words - could compute
- For instance, in a trigram model, one could
cluster contexts like New York and Los
Angeles as CITY, and on Wednesday and late
tomorrow as TIME
63Finding Clusters
- There is no need for the clusters used for
different positions to be the same - ibm clustering P(wiWi)P(WiWi-2Wi-1)Wi
cluster predictive cluster,Wi-1 and Wi-2
conditional cluster - The predictive and conditional clusters can be
different, consider words a and an, in general, a
and an can follow the same words, and so, for
predictive clustering, belong in the same
cluster. But, there are very few words that can
follow both a and an so for conditional
clustering, they belong in different clusters
64Finding Clusters
- The clusters are found automatically using a tool
that attempts to minimize perplexity - For the conditional clusters, we try to minimize
the perplexity of training data for a bigram of
the form P(wiWi-1), which is equivalent to
maximizing
65Finding Clusters
- For the predictive clusters, we try to minimize
the perplexity of training data of
P(Wiwi-1)P(wiWi)
P(Wiwi)P(Wiwi)P(wi)P(Wiwi) 1
P(wi-1Wi)P(wi-1Wi)P(Wi)
66Caching
- If a speaker uses a word, it is likely that he
will use the same word again in the near future - We could form a smoothed bigram or trigram from
the previous words, and interpolate this with the
standard trigramwhere Ptricache(ww1wi-1) is
a simple interpolated trigram model, using counts
from the preceding words in the same document
67Caching
- When interpolating three probabilities P1(w),
P2(w), and P3(w), rather than usewe actually
useThis allows us to simplify the constraints
of the search
68Caching
- Conditional caching weight the trigram cache
differently depending on whether or not we have
previously seen the context
69Caching
- Assume that the more data we have, the more
useful each cache is. Thus we make ?, ? and ? be
linear functions of the amount of data in the
cache - Always set ?maxwordsweight to at or near
1,000,000 while assigning ?multiplier to a small
value (100 or less)
70Caching
- Finally, we can try conditionally combining
unigram, bigram, and trigram caches
71(No Transcript)
72Caching
- As can be seen, caching is potentially one of the
most powerful techniques we can apply, leading to
performance improvements of up to 0.6 bits on
small data. Even on large data, the improvement
is still substantial, up to 0.23 bits - On all data size, the n-gram caches perform
substantially better than the unigram cache, but
which version of the n-gram is used appears to
make only a small difference
73Caching
- It should be noted that all of these results
assume that the previous words are known exactly - In a speech recognition system, it is possible
for a cache to look-in errorif recognition
speech ? wreck a nice beach, later, speech
recognition ? beach wreck ignitionsince the
probability of beach will be significantly
raised
74Sentence Mixture Models
- There may be several different sentence types
within a corpus these types could be grouped by
topic, or style, or some other criterion - In WSJ data, we might assume that there are three
types financial market sentences (with a great
deal of numbers and stock name), business
sentences (promotions, demotions, mergers) and
general news stories - Of course, in general, we do not know the
sentence type until we have heard the sentence.
Therefore, instead, we treat the sentence type as
a hidden variable
75Sentence Mixture Models
- Let sj denote the condition that the sentence
under consideration is a sentence of type j. Then
the probability of the sentence, given that it is
of type j can be written as - Let s0 be a special context that is always true
- Let there be S different sentence types (4?S?8)
let ?0?S be sentence interpolation parameters,
that
76Sentence Mixture Models
- The overall probability of a sentence w1wn is
- Eq (8) can be read as saying that there is a
hidden variable, the sentence type the prior
probability for each sentence type is ?j - The probability P(wiwi-2wi-1sj) may suffer from
data sparsity, so they are linearly interpolated
with the global model P(wiwi-2wi-1)
(8)
77Sentence Mixture Models
- Sentence types for the training data were found
by using the same clustering program used for
clustering words in this case, we tried to
minimize the sentence-cluster unigram
perplexities - Let s(i) represent the sentence type assigned to
the sentence that word i is part of. (All words
in a given sentence are assigned to the same
type) - We tried to put sentences into clusters in such a
way that was maximized
78Relationship between training data size, n-gram
order, and number of types
790.08
0.12
80Sentence Mixture Models
- Note that we dont trust results for 128
mixtures. With 128 sentence types, there are 773
parameters, and the system may not have had
enough heldout data to accurately estimate the
parameters - Ideally, we would run this experiment with a
larger heldout set, but it already required 5.5
days with 20,000 words, so this is impractical
81Sentence Mixture Models
- We suspected that sentence mixture models would
be more useful on larger training data sizewith
100,000 words, only .1 bits,with 284,000,000
words, its nearly .3 bits - This bodes well for the future of sentence
mixture models as computers get faster and
larger, training data sizes should also increase
82Sentence Mixture Models
- Both 5-gram and sentence mixture models attempt
to model long distance dependencies, the
improvement from their combination would be less
than the sum of the individual improvements - In Fig.8, for 100,000 and 1,000,000 words, that
different between trigram and 5-gram is very
small, so the question is not very important - For 10,000,000 words and all training data, there
is some negative interaction
So, approximately one third of the improvement
seems to be correlated
83Combining techniques
- Combining techniquesinterpolate this
clustered trigram with a normal 5-gram
84Combining techniques
- Interpolate the sentence-specific 5-gram model
with the global 5-gram model, the three skipping
models, and the two cache model
85Combining techniques
- Next, we define the analogous function for
predicting words given clusters
86Combining techniques
- Now, we can write out our probability model
(9)
87(No Transcript)
88(No Transcript)
89Experiment
- In fact, without KN-smooth, 5-gram actually hurt
at small and medium data sizes. This is a
wonderful example of synergy - Caching is the largest gain at small and medium
data size - Combined with KN-smoothing, 5-grams are the
largest gain at large data sizes
90(No Transcript)
91(No Transcript)
92(No Transcript)