Title: Probabilistic Models of Language and spelling correction
1Probabilistic ModelsofLanguageandspelling
correction
2Why probabilistic models?
- Our job is to analyze speech and language
- for a speech recognizer, to figure out the words
- for a language analyzer, to figure out the
structure of a sentence - the part of speech of each word
- the argument/modifier relations between words
- for machine translation, to find a good
translation of a source sentence
3Linguistic theory?
- By itself, linguistic theory is not very helpful
- tells us what is possible, but not what is likely
- does a good job for core grammatical phenomena,
but ignores many real issues
4Speech recognition
- We dont understand words in isolation
- recognition is heavily dependent on our
expectations - particularly in noisy environments
- we need to capture those expectations
- what word(s) are mostly coming up next?
- linguistics (grammar, semantics) doesnt help
much - but we can make a good prediction based on the
immediately preceding words - P(wordi wordi-2 wordi-1)
- strong silent
5Estimating probabilities
- we can estimate these probabilities accurately
from corpora - P(wordi wordi-2 wordi-1) count(wordi-2
wordi-1 wordi) / count(wordi-2 wordi-1) - large corpus a crucial ingredient
- Web now gives us a very large corpus (about 1
teraword) - speech recognition did not start making steady
progress until it adopted probabilistic models in
the 1980s
6Part of speech tagging
- Can we tag parts of speech by using a dictionary
a few rules? - Problem dictionaries do not distinguish common
from rare parts of speech - a is a noun (first entry in the dictionary)
- number is an adjective
- need to know what the common parts of speech of a
word are - Problem rules are hard to write
- can do quite well knowing likely sequences of
parts of speech - can get these from a POS-tagged corpus
7Parsing
- Writing a grammar for real sentences is real hard
- if grammar is not rich enough, many sentences
wont parse - if grammar is too rich, you will drown in parses
- a rule you add for a problem sentence will be
applied to lots of other sentences - Need to know what constructs are likely and try
those first - estimate likelihood from a corpus of parses
- Parser performance was stuck until probabilistic
parsers were developed in 1990s - based on a hand-parsed corpus
- linguistics re-enters through corpus annotation
8Machine translation
- The first NLP application (since early 1950s)
- For linguistically divergent languages (e.g.
English - Arabic) probablistic models now beat
rule-based systems - P (English phrase Arabic phrase)
- trained from very large bi-texts (parallel
bilingual texts)
9Estimating Probabilities
- We assume some infinite random source of data,
with a certain distribution - e.g., P() 0.75 --------...
- We estimate this probability by taking a data
sample and counting how often appears - P() count() / N where Nnumber of samples
- the bigger the sample, the better the estimate
10Conditional Probabilities
- A conditional probability P(X Y) measures how
often X happens in those cases where Y happens - P(wi wi-1 -)
- P(wi dog wi-1 hot)
11Noisy Channel Model
Noisyword
- Word
- a useful model for many tasks
- spelling correction
- speech recognition
- machine translation
- Arabic sentence is a noisy version of English
decoder
Guess atoriginal word
Noisy channel
12Decoder
- The job of the decoder is to make a good guess at
the original input - good guess most likely input
- w argmax P(w O)
- w in V
- where O observed data (text, sound)
- V vocabulary possible guesses
- P(wO) prob. of word given observation
13P (w O)
- How do we estimate P (w O)?
- We will compute it from two probabilities that
are easier to estimate, using Bayes rule
14Bayes Rule
- P(x y) P (y x) P(x) / P(y)
- based on definition of conditional probability
- P (x y) P (x y) / P (y)
- P (x y) P (y) P (x y) P(y x) P(x)
P (circle) P (green) P (green circle) P
(circle green) P (circle green)
15Using Bayes Rule
- P (w O) P (O w) P(w) / P (O)
- w argmax P(w O)
- w in V
- argmax P (O w) P(w) / P (O)
- w in V
- argmax P (O w) P(w)
- w in V likelihood prior
16Computing the prior
- Take a large corpus (N words)
- Count frequency of each word w
- P(w) count(w) / N
- may want to smooth so no word has P 0
- For the moment, compute prob. in isolation
- in Chapter 6, will consider words in context
- probabilities of n-grams sequences of n words
- very large corpora (web teraword) allow for
good estimates
17Computing the likelihood(channel probability)
- Likelihood of error depends on source of error
- typing error
- OCR error
- handwriting recognition error
- noisy room (speech recognition)
- P ( O w )
- count (typed O when I meant to type w) /
- count (meant to type O)
- collect these counts over a large corpus (how
large?)
18Building a probabilistic model
- In any task, a trade-off
- very detailed model
- can capture reality better
- but needs lots of training data
- may not be well trained with limited data
- simpler model
- lumps cases together
- can be trained with less data
19Models of spelling error
- Most errors involve a single
- insertion (cot -gt coat)
- substitution (cot -gt cut)
- deletion (cot -gt ct)
- transposition (cot -gt cto)
- Estimate probabilities for
- inserting y after x
- substituting y for x
- deleting y after x
- transposing x and y
20Spelling correction process
- Given typed word t
- check if it is in dictionary
- if not, generate all candidates w which
- w is in dictionary
- t can be produced from w by a single
change(substitution, insertion, deletion,
transposition) - if none, give up
- for each candidate, compute P(t w) P(w)
- return top ranked (most likely) candidate
21Choosing a model
- We can assume that all changes are equally likely
- OR
- We can assume that different errors have
different probabilities - that P(inserting X after Y) depends on X
- that P(inserting X after Y) depends on X and Y
22Estimating model probabilities
- We can annotate lots of data and count
- not much fun
- We can run correction program with a simple
model, gather correction counts, and use them to
build a better model
23Real-word spelling correction
- (J M 6.6)
- Approach until now only handles errors which
produce non-words - but a good spelling corrector would fix words
which are very unlikely in context - I ate a hot bog.
24Conditional Probabilities
- We capture context using conditional
probabilities - P ( dog I ate a hot )
- P ( bog I ate a hot )
25Bigrams
- But gathering statistics on entire sentences
requires a very large corpus, so we typically
look at smaller contexts - P (wi wi-1) bigrams
- may have to smooth to account for unseen
bigrams (discussed extensively in Chapter 6)
26Real-word spelling correction
- Correction process is the same
- now applied to all words
- argmax P (O w) P(w wi-1)
- w in V
- include no-error probability P(Ow) where Ow
27Extending the correctormultiple errors
- Can we extend our approach to handle multiple
errors in a single word? - Problem multiple derivations
- Minimum edit distance fewest changes to convert
X into Y
28Minimum edit distance
- First try to find min edit distance(X,Y)
- let N max (length(X), length(Y))
- for I 1 to N
- starting from X, generate all strings by making I
changes - is one of them Y?
- yes, quit min edit distance I
- Verrrrrrrry slow
29Minimum edit distance (2)
- N lt- length(Y)
- M lt- length(X)
- create matrix distance(N1, M1)
- distance0,0 0
- for i 0 to N
- for j 0 to M
- distancei,j min (distancei-1,j
insertion_cost(Yi),
distancei-1, j-1 substitution_cost(Xj, Yi),
distancei,j-1
deletion_cost(Xj)) - dynamic programming algorithm
- computes min edit distance in MN steps
30Other improvements
- Some spelling errors are typos
- Others come from not knowing how to spell a word
- those errors may involve multiple letters
- letter-by-letter model would make such errors
very unlikely - look for larger patterns
- make use of pronunciation rules
- x lt--gt cks