Probabilistic Detection of Context-Sensitive Spelling Errors

About This Presentation
Title:

Probabilistic Detection of Context-Sensitive Spelling Errors

Description:

All words found in dictionary. If context is considered, the spelling of ... 1, 2, 5, 10 and 20% misspelled words (using software AutoEval) Results? 1% errors ... – PowerPoint PPT presentation

Number of Views:12
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: Probabilistic Detection of Context-Sensitive Spelling Errors


1
Probabilistic Detection of Context-Sensitive
Spelling Errors
  • Johnny Bigert
  • Royal Institute of Technology, Sweden
  • johnny_at_kth.se

2
What?
  • Context-Sensitive Spelling Errors
  • ExampleNice whether today.
  • All words found in dictionary
  • If context is considered,the spelling of whether
    is incorrect

3
Why?
  • Why do we need detection of context-sensitive
    spelling errors?
  • These errors are quite frequent (reports on
    16-40 of all errors)
  • Larger dictionaries result in more errors
    undetected
  • They cannot be found by regular spell checkers!

4
Why not?
  • What about proposing corrections for the errors?
  • An interesting topic,but not the topic of this
    article
  • Detection is imperative,correction is an aid

5
Related work?
  • Are there no algorithms doing this already?
  • A full parser is perfect for the job
  • Drawbacks
  • high accuracy is required
  • not available for many languages
  • manual labor is expensive
  • not robust

6
Related work?
  • Are there no other algorithms?
  • Several other algorithms (e.g. Winnow)
  • Some do correction
  • Drawbacks
  • They require a set of easily confused words
  • Normally, you dont know your spelling errors
    beforehand

7
Why?
  • What are the benefits of this algorithm?
  • Find any error
  • Avoid extensive manual work
  • Robustness

8
How?
  • Prerequisites
  • We use PoS tag trigram frequenciesfrom an
    annotated corpus
  • We are given a sentence, and apply a PoS tagger

9
How?
  • Basic assumption
  • If any tag trigram frequency is low, that part
    is probably ungrammatical

10
But?
  • But dont you often encounter rare or unseen
    trigrams?
  • Yes, unfortunately
  • We modify the notion of frequency
  • Find and use other, syntactically close PoS
    trigrams

11
Close?
  • What is the syntactic distance between two PoS
    tags?
  • A probability that one tag is replaceable by
    another
  • Retain grammaticality
  • Distances extracted from corpus
  • Unsupervised learning algorithm

12
Then?
  • The algorithm
  • We have a generalized PoS tag trigtram frequency
  • If frequency below threshold, text is probably
    ungrammatical

13
Result?
  • Summary so far
  • Unsupervised learning
  • Automatic algorithm
  • Detection of any error
  • No manual labor!
  • Alas, phrase boundaries cause problems

14
Phrases?
  • What about phrases?
  • PoS tag trigrams overlapping two phrases are very
    productive
  • Rare phrases, rare trigrams
  • Transformations!

15
Transform?
  • How do we transform a phrase?
  • Shallow parser
  • Transform phrases to most common form
  • Normally, the head
  • Benefits retain grammaticality, less rare
    trigrams, longer tagger scope

16
Example?
  • Example of phrase transformation
  • Only the paintings that are old are for sale
  • Only the paintings are for sale

NP
NP
17
Then what?
  • How do we use the transformations?
  • Apply tagger to transformed sentence
  • Run first part of algorithm again
  • If any transformation yield only trigrams with
    high frequency,sentence ok
  • Otherwise, probable error

18
Result?
  • Summary
  • Trigram part, fully automatic
  • Phrase part, could use machine learning of rules
    for shallow parser
  • Finds many difficult error types
  • Threshold determines precision/recall trade-off

19
Evaluation?
  • Fully automatic evaluation
  • Introduce artificial context-sensitive spelling
    errors (using software Missplel)
  • Automated evaluation procedure for 1, 2, 5, 10
    and 20 misspelled words(using software AutoEval)

20
Results? 1 errors
21
Results? 2 errors
Write a Comment
User Comments (0)