Probabilistic Models of Language and spelling correction

1 / 30

About This Presentation

Title:

Probabilistic Models of Language and spelling correction

Description:

for machine translation, to find a good translation of a source sentence. Linguistic theory? ... Machine translation. The first NLP application (since early 1950's) ... –

Number of Views:180

Avg rating:3.0/5.0

Slides: 31

Provided by: ralphgr

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic Models of Language and spelling correction

1
Probabilistic ModelsofLanguageandspelling
correction
2
Why probabilistic models?

Our job is to analyze speech and language
for a speech recognizer, to figure out the words
for a language analyzer, to figure out the
structure of a sentence
the part of speech of each word
the argument/modifier relations between words
for machine translation, to find a good
translation of a source sentence

3
Linguistic theory?

By itself, linguistic theory is not very helpful
tells us what is possible, but not what is likely
does a good job for core grammatical phenomena,
but ignores many real issues

4
Speech recognition

We dont understand words in isolation
recognition is heavily dependent on our
expectations
particularly in noisy environments
we need to capture those expectations
what word(s) are mostly coming up next?
linguistics (grammar, semantics) doesnt help
much
but we can make a good prediction based on the
immediately preceding words
P(wordi wordi-2 wordi-1)
strong silent

5
Estimating probabilities

we can estimate these probabilities accurately
from corpora
P(wordi wordi-2 wordi-1) count(wordi-2
wordi-1 wordi) / count(wordi-2 wordi-1)
large corpus a crucial ingredient
Web now gives us a very large corpus (about 1
teraword)
speech recognition did not start making steady
progress until it adopted probabilistic models in
the 1980s

6
Part of speech tagging

Can we tag parts of speech by using a dictionary
a few rules?
Problem dictionaries do not distinguish common
from rare parts of speech
a is a noun (first entry in the dictionary)
number is an adjective
need to know what the common parts of speech of a
word are
Problem rules are hard to write
can do quite well knowing likely sequences of
parts of speech
can get these from a POS-tagged corpus

7
Parsing

Writing a grammar for real sentences is real hard
if grammar is not rich enough, many sentences
wont parse
if grammar is too rich, you will drown in parses
a rule you add for a problem sentence will be
applied to lots of other sentences
Need to know what constructs are likely and try
those first
estimate likelihood from a corpus of parses
Parser performance was stuck until probabilistic
parsers were developed in 1990s
based on a hand-parsed corpus
linguistics re-enters through corpus annotation

8
Machine translation

The first NLP application (since early 1950s)
For linguistically divergent languages (e.g.
English - Arabic) probablistic models now beat
rule-based systems
P (English phrase Arabic phrase)
trained from very large bi-texts (parallel
bilingual texts)

9
Estimating Probabilities

We assume some infinite random source of data,
with a certain distribution
e.g., P() 0.75 --------...
We estimate this probability by taking a data
sample and counting how often appears
P() count() / N where Nnumber of samples
the bigger the sample, the better the estimate

10
Conditional Probabilities

A conditional probability P(X Y) measures how
often X happens in those cases where Y happens
P(wi wi-1 -)
P(wi dog wi-1 hot)

11
Noisy Channel Model
Noisyword

Word
a useful model for many tasks
spelling correction
speech recognition
machine translation
Arabic sentence is a noisy version of English

decoder
Guess atoriginal word
Noisy channel
12
Decoder

The job of the decoder is to make a good guess at
the original input
good guess most likely input
w argmax P(w O)
w in V
where O observed data (text, sound)
V vocabulary possible guesses
P(wO) prob. of word given observation

13
P (w O)

How do we estimate P (w O)?
We will compute it from two probabilities that
are easier to estimate, using Bayes rule

14
Bayes Rule

P(x y) P (y x) P(x) / P(y)
based on definition of conditional probability
P (x y) P (x y) / P (y)
P (x y) P (y) P (x y) P(y x) P(x)

P (circle) P (green) P (green circle) P
(circle green) P (circle green)
15
Using Bayes Rule

P (w O) P (O w) P(w) / P (O)
w argmax P(w O)
w in V
argmax P (O w) P(w) / P (O)
w in V
argmax P (O w) P(w)
w in V likelihood prior

16
Computing the prior

Take a large corpus (N words)
Count frequency of each word w
P(w) count(w) / N
may want to smooth so no word has P 0
For the moment, compute prob. in isolation
in Chapter 6, will consider words in context
probabilities of n-grams sequences of n words
very large corpora (web teraword) allow for
good estimates

17
Computing the likelihood(channel probability)

Likelihood of error depends on source of error
typing error
OCR error
handwriting recognition error
noisy room (speech recognition)
P ( O w )
count (typed O when I meant to type w) /
count (meant to type O)
collect these counts over a large corpus (how
large?)

18
Building a probabilistic model

In any task, a trade-off
very detailed model
can capture reality better
but needs lots of training data
may not be well trained with limited data
simpler model
lumps cases together
can be trained with less data

19
Models of spelling error

Most errors involve a single
insertion (cot -gt coat)
substitution (cot -gt cut)
deletion (cot -gt ct)
transposition (cot -gt cto)
Estimate probabilities for
inserting y after x
substituting y for x
deleting y after x
transposing x and y

20
Spelling correction process

Given typed word t
check if it is in dictionary
if not, generate all candidates w which
w is in dictionary
t can be produced from w by a single
change(substitution, insertion, deletion,
transposition)
if none, give up
for each candidate, compute P(t w) P(w)
return top ranked (most likely) candidate

21
Choosing a model

We can assume that all changes are equally likely
OR
We can assume that different errors have
different probabilities
that P(inserting X after Y) depends on X
that P(inserting X after Y) depends on X and Y

22
Estimating model probabilities

We can annotate lots of data and count
not much fun
We can run correction program with a simple
model, gather correction counts, and use them to
build a better model

23
Real-word spelling correction

(J M 6.6)
Approach until now only handles errors which
produce non-words
but a good spelling corrector would fix words
which are very unlikely in context
I ate a hot bog.

24
Conditional Probabilities

We capture context using conditional
probabilities
P ( dog I ate a hot )
P ( bog I ate a hot )

25
Bigrams

But gathering statistics on entire sentences
requires a very large corpus, so we typically
look at smaller contexts
P (wi wi-1) bigrams
may have to smooth to account for unseen
bigrams (discussed extensively in Chapter 6)

26
Real-word spelling correction

Correction process is the same
now applied to all words
argmax P (O w) P(w wi-1)
w in V
include no-error probability P(Ow) where Ow

27
Extending the correctormultiple errors

Can we extend our approach to handle multiple
errors in a single word?
Problem multiple derivations
Minimum edit distance fewest changes to convert
X into Y

28
Minimum edit distance

First try to find min edit distance(X,Y)
let N max (length(X), length(Y))
for I 1 to N
starting from X, generate all strings by making I
changes
is one of them Y?
yes, quit min edit distance I
Verrrrrrrry slow

29
Minimum edit distance (2)

N lt- length(Y)
M lt- length(X)
create matrix distance(N1, M1)
distance0,0 0
for i 0 to N
for j 0 to M
distancei,j min (distancei-1,j
insertion_cost(Yi),
distancei-1, j-1 substitution_cost(Xj, Yi),
distancei,j-1
deletion_cost(Xj))
dynamic programming algorithm
computes min edit distance in MN steps

30
Other improvements