Probabilistic Models of Language and spelling correction

1 / 30
About This Presentation
Title:

Probabilistic Models of Language and spelling correction

Description:

for machine translation, to find a good translation of a source sentence. Linguistic theory? ... Machine translation. The first NLP application (since early 1950's) ... –

Number of Views:180
Avg rating:3.0/5.0
Slides: 31
Provided by: ralphgr
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic Models of Language and spelling correction


1
Probabilistic ModelsofLanguageandspelling
correction
2
Why probabilistic models?
  • Our job is to analyze speech and language
  • for a speech recognizer, to figure out the words
  • for a language analyzer, to figure out the
    structure of a sentence
  • the part of speech of each word
  • the argument/modifier relations between words
  • for machine translation, to find a good
    translation of a source sentence

3
Linguistic theory?
  • By itself, linguistic theory is not very helpful
  • tells us what is possible, but not what is likely
  • does a good job for core grammatical phenomena,
    but ignores many real issues

4
Speech recognition
  • We dont understand words in isolation
  • recognition is heavily dependent on our
    expectations
  • particularly in noisy environments
  • we need to capture those expectations
  • what word(s) are mostly coming up next?
  • linguistics (grammar, semantics) doesnt help
    much
  • but we can make a good prediction based on the
    immediately preceding words
  • P(wordi wordi-2 wordi-1)
  • strong silent

5
Estimating probabilities
  • we can estimate these probabilities accurately
    from corpora
  • P(wordi wordi-2 wordi-1) count(wordi-2
    wordi-1 wordi) / count(wordi-2 wordi-1)
  • large corpus a crucial ingredient
  • Web now gives us a very large corpus (about 1
    teraword)
  • speech recognition did not start making steady
    progress until it adopted probabilistic models in
    the 1980s

6
Part of speech tagging
  • Can we tag parts of speech by using a dictionary
    a few rules?
  • Problem dictionaries do not distinguish common
    from rare parts of speech
  • a is a noun (first entry in the dictionary)
  • number is an adjective
  • need to know what the common parts of speech of a
    word are
  • Problem rules are hard to write
  • can do quite well knowing likely sequences of
    parts of speech
  • can get these from a POS-tagged corpus

7
Parsing
  • Writing a grammar for real sentences is real hard
  • if grammar is not rich enough, many sentences
    wont parse
  • if grammar is too rich, you will drown in parses
  • a rule you add for a problem sentence will be
    applied to lots of other sentences
  • Need to know what constructs are likely and try
    those first
  • estimate likelihood from a corpus of parses
  • Parser performance was stuck until probabilistic
    parsers were developed in 1990s
  • based on a hand-parsed corpus
  • linguistics re-enters through corpus annotation

8
Machine translation
  • The first NLP application (since early 1950s)
  • For linguistically divergent languages (e.g.
    English - Arabic) probablistic models now beat
    rule-based systems
  • P (English phrase Arabic phrase)
  • trained from very large bi-texts (parallel
    bilingual texts)

9
Estimating Probabilities
  • We assume some infinite random source of data,
    with a certain distribution
  • e.g., P() 0.75 --------...
  • We estimate this probability by taking a data
    sample and counting how often appears
  • P() count() / N where Nnumber of samples
  • the bigger the sample, the better the estimate

10
Conditional Probabilities
  • A conditional probability P(X Y) measures how
    often X happens in those cases where Y happens
  • P(wi wi-1 -)
  • P(wi dog wi-1 hot)

11
Noisy Channel Model
Noisyword
  • Word
  • a useful model for many tasks
  • spelling correction
  • speech recognition
  • machine translation
  • Arabic sentence is a noisy version of English

decoder
Guess atoriginal word
Noisy channel
12
Decoder
  • The job of the decoder is to make a good guess at
    the original input
  • good guess most likely input
  • w argmax P(w O)
  • w in V
  • where O observed data (text, sound)
  • V vocabulary possible guesses
  • P(wO) prob. of word given observation

13
P (w O)
  • How do we estimate P (w O)?
  • We will compute it from two probabilities that
    are easier to estimate, using Bayes rule

14
Bayes Rule
  • P(x y) P (y x) P(x) / P(y)
  • based on definition of conditional probability
  • P (x y) P (x y) / P (y)
  • P (x y) P (y) P (x y) P(y x) P(x)

P (circle) P (green) P (green circle) P
(circle green) P (circle green)
15
Using Bayes Rule
  • P (w O) P (O w) P(w) / P (O)
  • w argmax P(w O)
  • w in V
  • argmax P (O w) P(w) / P (O)
  • w in V
  • argmax P (O w) P(w)
  • w in V likelihood prior

16
Computing the prior
  • Take a large corpus (N words)
  • Count frequency of each word w
  • P(w) count(w) / N
  • may want to smooth so no word has P 0
  • For the moment, compute prob. in isolation
  • in Chapter 6, will consider words in context
  • probabilities of n-grams sequences of n words
  • very large corpora (web teraword) allow for
    good estimates

17
Computing the likelihood(channel probability)
  • Likelihood of error depends on source of error
  • typing error
  • OCR error
  • handwriting recognition error
  • noisy room (speech recognition)
  • P ( O w )
  • count (typed O when I meant to type w) /
  • count (meant to type O)
  • collect these counts over a large corpus (how
    large?)

18
Building a probabilistic model
  • In any task, a trade-off
  • very detailed model
  • can capture reality better
  • but needs lots of training data
  • may not be well trained with limited data
  • simpler model
  • lumps cases together
  • can be trained with less data

19
Models of spelling error
  • Most errors involve a single
  • insertion (cot -gt coat)
  • substitution (cot -gt cut)
  • deletion (cot -gt ct)
  • transposition (cot -gt cto)
  • Estimate probabilities for
  • inserting y after x
  • substituting y for x
  • deleting y after x
  • transposing x and y

20
Spelling correction process
  • Given typed word t
  • check if it is in dictionary
  • if not, generate all candidates w which
  • w is in dictionary
  • t can be produced from w by a single
    change(substitution, insertion, deletion,
    transposition)
  • if none, give up
  • for each candidate, compute P(t w) P(w)
  • return top ranked (most likely) candidate

21
Choosing a model
  • We can assume that all changes are equally likely
  • OR
  • We can assume that different errors have
    different probabilities
  • that P(inserting X after Y) depends on X
  • that P(inserting X after Y) depends on X and Y

22
Estimating model probabilities
  • We can annotate lots of data and count
  • not much fun
  • We can run correction program with a simple
    model, gather correction counts, and use them to
    build a better model

23
Real-word spelling correction
  • (J M 6.6)
  • Approach until now only handles errors which
    produce non-words
  • but a good spelling corrector would fix words
    which are very unlikely in context
  • I ate a hot bog.

24
Conditional Probabilities
  • We capture context using conditional
    probabilities
  • P ( dog I ate a hot )
  • P ( bog I ate a hot )

25
Bigrams
  • But gathering statistics on entire sentences
    requires a very large corpus, so we typically
    look at smaller contexts
  • P (wi wi-1) bigrams
  • may have to smooth to account for unseen
    bigrams (discussed extensively in Chapter 6)

26
Real-word spelling correction
  • Correction process is the same
  • now applied to all words
  • argmax P (O w) P(w wi-1)
  • w in V
  • include no-error probability P(Ow) where Ow

27
Extending the correctormultiple errors
  • Can we extend our approach to handle multiple
    errors in a single word?
  • Problem multiple derivations
  • Minimum edit distance fewest changes to convert
    X into Y

28
Minimum edit distance
  • First try to find min edit distance(X,Y)
  • let N max (length(X), length(Y))
  • for I 1 to N
  • starting from X, generate all strings by making I
    changes
  • is one of them Y?
  • yes, quit min edit distance I
  • Verrrrrrrry slow

29
Minimum edit distance (2)
  • N lt- length(Y)
  • M lt- length(X)
  • create matrix distance(N1, M1)
  • distance0,0 0
  • for i 0 to N
  • for j 0 to M
  • distancei,j min (distancei-1,j
    insertion_cost(Yi),
    distancei-1, j-1 substitution_cost(Xj, Yi),
    distancei,j-1
    deletion_cost(Xj))
  • dynamic programming algorithm
  • computes min edit distance in MN steps

30
Other improvements
  • Some spelling errors are typos
  • Others come from not knowing how to spell a word
  • those errors may involve multiple letters
  • letter-by-letter model would make such errors
    very unlikely
  • look for larger patterns
  • make use of pronunciation rules
  • x lt--gt cks
Write a Comment
User Comments (0)
About PowerShow.com