Title: Language Modeling
1Language Modeling
- Anytime a linguist leaves the group the
recognition rate goes up. (Fred Jelinek)
2Word Prediction in Application Domains
- Guessing the next word/letter
- Once upon a time there was .
- Cera una volta .
- Domains speech modeling, augmentative
communication systems (disabled persons), T9
3Word Prediction for Spelling
- Andranno a trovarlo alla sua cassa domani.
- Se andrei al mare sarei abbronzato.
- Vado a spiaggia.
- Hopefully, all with continue smoothly in my
absence. - Can they lave him my message?
- I need to notified the bank of this problem.
4Probs
- Prior probability that the training data D will
be observed P(D) - Prior probability of h, P(h), my include any
prior knowledge that h is the correct hypothesis - P(Dh), probability of observing data D given a
world where hypothesis h holds. - P(hD), probability that h holds given the data
D, i.e. posterior probability of h, because it
reflects our confidence that h holds after we
have seen the data D.
5The Bayes Rule (Theorem)
6Maximum Aposteriory Hypothesis and Maximum
Likelihood
7Bayes Optimal Classifier
- Motivation 3 hypotheses with the posterior probs
of 0.4, 0.3 and 0.3. Thus, the first one is the
MAP hypothesis. (!) BUT - (A problem) Suppose new instance us classified
positive by the first hyp., while negative by the
other two. So, the porb. that the new instance is
positive is 0.4 opposed to 0.6 for negative
classification. The MAP is the 0.4 one ! - Solution The most probable classification of the
new instance is obtained by combining the
prediction for all hypothesis weighted by their
posterior probabilities.
8Bayes Optimal Classifier
- Classification class
- Bayes Optimal Classifier
9Naïve Bayes Classifier
- Bayes Optimal Classifier
- Naïve version
10m-estimate of probability
11Tagging
12Use corpus to find them
Language Model
13N-gram Model
- The N-th word is predicted by the previous N-1
words. - What is a word?
- Token, word-form, lemma, m-tag,
14N-gram approximation models
15bi-gram and tri-gram models
N2 (bi)
N3 (tri)
16Counting n-grams
17The Language Model Allows us to Calculate
Sentence Probs
- P( Today is a beautiful day . )
- P( Today ltStartgt) P (is Today) P( a
is) P(beautifula) P(day beautiful) P(.
day) P(ltEndgt .) - Work in log space !
18Unseen n-grams and Smoothing
- Discounting (several types)
- Backoff
- Deleted Interpolation
19Deleted Interpolation
20Searching For the Best Tagging
W_1 W_2 W_3 W_4 W_5 W_6 W_7 W_8 t_1_1 t_1_2
t_1_3 t_1_4 t_1_5 t_1_6 t_1_7 t_1_8 t_2_1
t_2_2 t_2_3 t_2_5 t_2_8 t_3_1 t_3_3 t_4_1
Use Viterbi search to find the best path through
the lattice.
21Cross Entropy
- Entropy from the point of view of the user who
has misinterpreted the source distribution to be
q rather than p Cross entropy is an upper bound
of entropy
22Cross Entropy as a Quality Measure
- Two models, therefore 2 upper bounds of entropy.
- The more accurate is the one with lower cross
entropy
23Imagine that y was generated with either model A
or model B. Then
24Cont.
Proof of convergence of the EM algorithm
25Estimation - Maximization Algorithm
- Consider a problem in which the data D is a set
of instances generated by a probability
distribution that is a mixture of k distinct
Normal distributions (assuming same variances) - Hypothesis is therefore defined by the vector of
the means of the distributions
26Estimation-Maximization Algorithm
- Step 1 Calculate the expected value of each
distribution, assuming that the current
hypothesis holds - Step 2 Calculate a new maximum likelihood
hypothesis assuming that the expected value is
the true value. Then make the new hypothesis be
the actual one. - Step 3 Goto Step 1.
27If we find lambda prime such that
So we need to maximize A with respect to lambda
prime Under the constraint that all lambdas sum
up to one. ? Use Lagrange multipliers
28The EM Algorithm
Can be analogically generalized for more lambdas
29Measuring success rates
- Recall (correct answers)/(total possible
answers) - Precision (correct answers)/(answers)
- Fallout (incorrect answers)/(of spourious
facts in the text) - F-measure (b21)PR/(b2PR)
- If b gt 1 P is favored.
30Chunking as Tagging
- Even certain parsing problems can be solved via
tagging - E.g.
- ((A B) C ((D F) G))
- BIA tags A/B B/A C/I D/B F/A G/A