Language Modeling

About This Presentation
Title:

Language Modeling

Description:

Language Modeling. Anytime a linguist leaves the group the ... Andranno a trovarlo alla sua cassa domani. Se andrei al mare sarei abbronzato. Vado a spiaggia. ... – PowerPoint PPT presentation

Number of Views:17
Avg rating:3.0/5.0
Slides: 31
Provided by: malylin

less

Transcript and Presenter's Notes

Title: Language Modeling


1
Language Modeling
  • Anytime a linguist leaves the group the
    recognition rate goes up. (Fred Jelinek)

2
Word Prediction in Application Domains
  • Guessing the next word/letter
  • Once upon a time there was .
  • Cera una volta .
  • Domains speech modeling, augmentative
    communication systems (disabled persons), T9

3
Word Prediction for Spelling
  • Andranno a trovarlo alla sua cassa domani.
  • Se andrei al mare sarei abbronzato.
  • Vado a spiaggia.
  • Hopefully, all with continue smoothly in my
    absence.
  • Can they lave him my message?
  • I need to notified the bank of this problem.

4
Probs
  • Prior probability that the training data D will
    be observed P(D)
  • Prior probability of h, P(h), my include any
    prior knowledge that h is the correct hypothesis
  • P(Dh), probability of observing data D given a
    world where hypothesis h holds.
  • P(hD), probability that h holds given the data
    D, i.e. posterior probability of h, because it
    reflects our confidence that h holds after we
    have seen the data D.

5
The Bayes Rule (Theorem)
6
Maximum Aposteriory Hypothesis and Maximum
Likelihood
7
Bayes Optimal Classifier
  • Motivation 3 hypotheses with the posterior probs
    of 0.4, 0.3 and 0.3. Thus, the first one is the
    MAP hypothesis. (!) BUT
  • (A problem) Suppose new instance us classified
    positive by the first hyp., while negative by the
    other two. So, the porb. that the new instance is
    positive is 0.4 opposed to 0.6 for negative
    classification. The MAP is the 0.4 one !
  • Solution The most probable classification of the
    new instance is obtained by combining the
    prediction for all hypothesis weighted by their
    posterior probabilities.

8
Bayes Optimal Classifier
  • Classification class
  • Bayes Optimal Classifier

9
Naïve Bayes Classifier
  • Bayes Optimal Classifier
  • Naïve version

10
m-estimate of probability
11
Tagging
  • P (tag Noun word saw) ?

12
Use corpus to find them
Language Model
13
N-gram Model
  • The N-th word is predicted by the previous N-1
    words.
  • What is a word?
  • Token, word-form, lemma, m-tag,

14
N-gram approximation models
15
bi-gram and tri-gram models
N2 (bi)
N3 (tri)
16
Counting n-grams
17
The Language Model Allows us to Calculate
Sentence Probs
  • P( Today is a beautiful day . )
  • P( Today ltStartgt) P (is Today) P( a
    is) P(beautifula) P(day beautiful) P(.
    day) P(ltEndgt .)
  • Work in log space !

18
Unseen n-grams and Smoothing
  • Discounting (several types)
  • Backoff
  • Deleted Interpolation

19
Deleted Interpolation
20
Searching For the Best Tagging
W_1 W_2 W_3 W_4 W_5 W_6 W_7 W_8 t_1_1 t_1_2
t_1_3 t_1_4 t_1_5 t_1_6 t_1_7 t_1_8 t_2_1
t_2_2 t_2_3 t_2_5 t_2_8 t_3_1 t_3_3 t_4_1
Use Viterbi search to find the best path through
the lattice.
21
Cross Entropy
  • Entropy from the point of view of the user who
    has misinterpreted the source distribution to be
    q rather than p Cross entropy is an upper bound
    of entropy

22
Cross Entropy as a Quality Measure
  • Two models, therefore 2 upper bounds of entropy.
  • The more accurate is the one with lower cross
    entropy

23
Imagine that y was generated with either model A
or model B. Then
24
Cont.
Proof of convergence of the EM algorithm
25
Estimation - Maximization Algorithm
  • Consider a problem in which the data D is a set
    of instances generated by a probability
    distribution that is a mixture of k distinct
    Normal distributions (assuming same variances)
  • Hypothesis is therefore defined by the vector of
    the means of the distributions

26
Estimation-Maximization Algorithm
  • Step 1 Calculate the expected value of each
    distribution, assuming that the current
    hypothesis holds
  • Step 2 Calculate a new maximum likelihood
    hypothesis assuming that the expected value is
    the true value. Then make the new hypothesis be
    the actual one.
  • Step 3 Goto Step 1.

27
If we find lambda prime such that
So we need to maximize A with respect to lambda
prime Under the constraint that all lambdas sum
up to one. ? Use Lagrange multipliers
28
The EM Algorithm
Can be analogically generalized for more lambdas
29
Measuring success rates
  • Recall (correct answers)/(total possible
    answers)
  • Precision (correct answers)/(answers)
  • Fallout (incorrect answers)/(of spourious
    facts in the text)
  • F-measure (b21)PR/(b2PR)
  • If b gt 1 P is favored.

30
Chunking as Tagging
  • Even certain parsing problems can be solved via
    tagging
  • E.g.
  • ((A B) C ((D F) G))
  • BIA tags A/B B/A C/I D/B F/A G/A
Write a Comment
User Comments (0)