Language Modeling

About This Presentation

Title:

Language Modeling

Description:

Language Modeling. Anytime a linguist leaves the group the ... Andranno a trovarlo alla sua cassa domani. Se andrei al mare sarei abbronzato. Vado a spiaggia. ... – PowerPoint PPT presentation

Number of Views:17

Avg rating:3.0/5.0

Slides: 31

Provided by: malylin

more less

Transcript and Presenter's Notes

Title: Language Modeling

1
Language Modeling

Anytime a linguist leaves the group the
recognition rate goes up. (Fred Jelinek)

2
Word Prediction in Application Domains

Guessing the next word/letter
Once upon a time there was .
Cera una volta .
Domains speech modeling, augmentative
communication systems (disabled persons), T9

3
Word Prediction for Spelling

Andranno a trovarlo alla sua cassa domani.
Se andrei al mare sarei abbronzato.
Vado a spiaggia.
Hopefully, all with continue smoothly in my
absence.
Can they lave him my message?
I need to notified the bank of this problem.

4
Probs

Prior probability that the training data D will
be observed P(D)
Prior probability of h, P(h), my include any
prior knowledge that h is the correct hypothesis
P(Dh), probability of observing data D given a
world where hypothesis h holds.
P(hD), probability that h holds given the data
D, i.e. posterior probability of h, because it
reflects our confidence that h holds after we
have seen the data D.

5
The Bayes Rule (Theorem)
6
Maximum Aposteriory Hypothesis and Maximum
Likelihood
7
Bayes Optimal Classifier

Motivation 3 hypotheses with the posterior probs
of 0.4, 0.3 and 0.3. Thus, the first one is the
MAP hypothesis. (!) BUT
(A problem) Suppose new instance us classified
positive by the first hyp., while negative by the
other two. So, the porb. that the new instance is
positive is 0.4 opposed to 0.6 for negative
classification. The MAP is the 0.4 one !
Solution The most probable classification of the
new instance is obtained by combining the
prediction for all hypothesis weighted by their
posterior probabilities.

8
Bayes Optimal Classifier

Classification class
Bayes Optimal Classifier

9
Naïve Bayes Classifier

Bayes Optimal Classifier
Naïve version

10
m-estimate of probability
11
Tagging

P (tag Noun word saw) ?

12
Use corpus to find them
Language Model
13
N-gram Model

The N-th word is predicted by the previous N-1
words.
What is a word?
Token, word-form, lemma, m-tag,

14
N-gram approximation models
15
bi-gram and tri-gram models
N2 (bi)
N3 (tri)
16
Counting n-grams
17
The Language Model Allows us to Calculate
Sentence Probs

P( Today is a beautiful day . )
P( Today ltStartgt) P (is Today) P( a
is) P(beautifula) P(day beautiful) P(.
day) P(ltEndgt .)
Work in log space !

18
Unseen n-grams and Smoothing

Discounting (several types)
Backoff
Deleted Interpolation

19
Deleted Interpolation
20
Searching For the Best Tagging
W_1 W_2 W_3 W_4 W_5 W_6 W_7 W_8 t_1_1 t_1_2
t_1_3 t_1_4 t_1_5 t_1_6 t_1_7 t_1_8 t_2_1
t_2_2 t_2_3 t_2_5 t_2_8 t_3_1 t_3_3 t_4_1
Use Viterbi search to find the best path through
the lattice.
21
Cross Entropy

Entropy from the point of view of the user who
has misinterpreted the source distribution to be
q rather than p Cross entropy is an upper bound
of entropy

22
Cross Entropy as a Quality Measure

Two models, therefore 2 upper bounds of entropy.
The more accurate is the one with lower cross
entropy

23
Imagine that y was generated with either model A
or model B. Then
24
Cont.
Proof of convergence of the EM algorithm
25
Estimation - Maximization Algorithm

Consider a problem in which the data D is a set
of instances generated by a probability
distribution that is a mixture of k distinct
Normal distributions (assuming same variances)
Hypothesis is therefore defined by the vector of
the means of the distributions

26
Estimation-Maximization Algorithm

Step 1 Calculate the expected value of each
distribution, assuming that the current
hypothesis holds
Step 2 Calculate a new maximum likelihood
hypothesis assuming that the expected value is
the true value. Then make the new hypothesis be
the actual one.
Step 3 Goto Step 1.

27
If we find lambda prime such that
So we need to maximize A with respect to lambda
prime Under the constraint that all lambdas sum
up to one. ? Use Lagrange multipliers
28
The EM Algorithm
Can be analogically generalized for more lambdas
29
Measuring success rates