Title: Spelling correction
1Spelling correction
- Mani Thomas and Greg Waltz
2Overview
- Introduction to Probabilistic Modeling
- Techniques of spelling correction
- Types of errors
- Non-word error detection
- Noisy channel Model
- Isolated word error correction
- Real word error correction
- Latent Semantic Analysis
3Introduction
- Spelling error correction mainly deals with two
main types of issues - Error Detection
- Error Correction
- Probabilistic Modeling
- Given a sequence of letters corresponding to a
misspelled word - Ordered list of possible correct words for each
misspelled word - Transducing from surface forms to lexical forms,
assigning each with a probability
- Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992 - Jurafsky and Martin, Speech and Language
Processing, Chapter 5
4Techniques of spelling correction
- Automatic word correction research focuses on
three increasingly broader problem - Non-word error detection
- Detecting spelling errors that result in
non-words - Isolated-word error correction
- Correcting spelling errors that result in
non-words - Context-dependent word correction
- Using context to detect and correct spelling
errors
- Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992 - J. L. Peterson, Computer Programs for Detecting
and Correcting Spelling Errors, Communications
of the ACM, 1980
5Non-word error detection
- Dictionary look-up techniques
- Response time is a problem when size exceeds a
few hundred words - Most common technique to reduce search time ?
- Hash technique
- N-gram analysis
- Errors made by OCR devices ? N-gram analysis
produces improbable N-grams - Kinds
- Unigram analysis
- Bigram analysis
- Trigram analysis
6Non-word error detection (Contd.)
- Issues under Dictionary look-up
- Dictionary construction issues ? lexicon must be
carefully tuned to intended domain of discourse - Word Boundary problem ? word boundaries are
defined by white space characters (problems ?
running two words together, and splitting a
single word)
7Isolated-word error correction
- Used mostly for text recognition and text editing
- Application specific issues
- Lexicon issues
- Computer-human interface issues
- Spelling error patterns
- Typographical errors
- Cognitive errors
- Phonetic errors
8Isolated-word error correction
- Basic error types
- Single error misspellings ?80 of all misspelled
words in a sample of human keypunched text - Insertion
- Deletion
- Substitution
- Transposition
- Multi error misspellings
- Fred Damerau A technique for computer detection
and correction of spelling errors. Communications
of ACM Vol. 7, No.3
9Isolated-word error correction
- Other issues related to Spelling Error Patterns
- Word length effects
- First-position errors
- Keyboard effects
- Error rates
- Phonetic errors
- Heuristic rules and probabilistic tendencies
- Common misspellings
10Isolated-word error correction
- Problems related
- Detecting errors
- Generating candidates for correction
- Ranking the candidates
- Techniques include ?
- Minimum edit distance
- Similarity key technique
- Rule-based technique
- N-gram based technique
- Probabilistic Techniques
- Neural Net techniques
11Real-word error correction
- One correctly spelled word - substituted for
another - Simple typos (from -gt form, form -gt farm)
- Cognitive or phonetic lapses (there -gt their)
- Syntactic or grammatical mistakes (arrive -gt
arrives) - Wrong function word (for -gt of)
- Semantic anomalies (in five minuets)
- Insertions or deletions of whole words (the
system has been operating for almost three years,
at absolutely extra cost) - Improper spacing (myself -gt my self)
- Require information from the context for both
detection and correction
- Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992
12Real-word Error Correction
- Errors as analyzed by Atwell and Elliot
- Local Syntactic Errors (the study was conducted
mainly be John Black) - Involve local incongruities -gt detectable by a
parser or a statistical language model - Global Syntactic Errors (Not only we in the
academia but the scientific community as a whole
needs to guard against this) - Involve long term dependency between complex
subject phrase and verb -gt requires the full
syntactic structure - Semantic Errors (He is trying to fine out)
- Have valid syntactic parses -gt no general purpose
NLP systems to detect these semantic errors
13Latent semantic analysis
- Approach originally developed for automatic
indexing and data retrieval - Overcome fundamental problem -gt match words of
queries with words of the document - Overcome deficiencies by treating the observed
term-document association data as a statistical
problem - LSA makes predictions by building a
high-dimensional semantic space of words - Compare similarity of the words from a confusion
set to a given context
- Dumais, Susan T., et al, Using Latent Semantic
Analysis to improve access to textual
information, Proc. CHI 88
14Latent semantic analysis (contd.)
- Goal of LSA
- Take evidence (i.e. words) and uncover the
underlying semantics - Eliminate noise from data -gt Transform data from
the high dimensional space to a reduced
dimensional space - Dimension reduction using Singular Value
Decomposition (SVD) - Similarity obtained using the dot product or
cosine between the points in the space
15Latent semantic analysis (contd.)
- Rows of matrix -gt terms and columns of matrix -gt
documents - Singular Value Decomposition
- Factors original matrix into the product of three
matrices - Matrices show breakdown into linearly independent
and orthogonal components (factors) - T0 -gt original term vectors as vectors of the
factor values - S0 -gt diagonal matrix of the singular values
(scaling factor for each dimension) - D0 matrix -gt original document vectors as vectors
of the factor values
- Jones M. P., et al, Contextual Spelling
Correction Using Latent Semantic Analysis, Proc.
5th Conf. on Applied Natural Language Processing,
1997
16Latent semantic analysis (contd.)
- Approximation of original matrix
- Eliminate some number of the least important
singular values - The product is a least squares best fit of the
original matrix - Generate vector representation for each text
passage in which a confusion word appears - Similarity between text passage vector and
confusion word vector -gt predict most likely word
given the context
- Jones M. P., et al, Contextual Spelling
Correction Using Latent Semantic Analysis, Proc.
5th Conf. on Applied Natural Language Processing,
1997
17Vector Space Model
- d documents described by t terms -gt t xd
term-document matrix - d vectors form the columns of the matrix
- Semantic content is wholly contained in the
column space - Common measure of similarity -gt cosine of the
angle between the query and document vector
- Berry, M. W., Zlatko Drmac and E. R. Jessup,
Matrices, Vector Spaces, and Information
Retrieval, SIAM review, 1999 - Sonia Leach, Singular Value Decomposition A
Primer
18Vector Space Model (contd.)
- Query -gt baking bread
- Threshold 0.5
- Col. 5 can be written as sum of col. 2 and col. 3
- Query -gt baking
- Not able to retrieve the 4th document though it
is more comprehensive
19Vector Space Model (contd.)
- Rank 4 matrix has 4 non-zero singular values
- the two zero rows -gt first four columns of U form
a basis of the column space
- Berry, M. W., Zlatko Drmac and E. R. Jessup,
Matrices, Vector Spaces, and Information
Retrieval, SIAM review, 1999
20Vector Space Model (contd.)
- Rank-3 approximation of the original
term-document matrix - Considering the top three singular values
- Setting the other elements to 0
- This subspace represent the structure of the
matrix better than the original - Reduction of noise which exist in the lower
singular values - Note 1 Deciding the rank approximation factor is
still an open question - Note 2 Negative values -gt Entries are linear
combinations of the entries in the original matrix
21Vector Space Model (contd.)
- Using the rank-3 approximation
- Query -gt baking bread
- Query -gt baking
- Remaining cosines are no longer zero but are
still lower than the threshold and are cut-off - Lowering the rank lowers the cost of query
matching - Using the k vectors instead of the d vectors
22Vector space model (contd.)
- Redistribution of the data points depending on
the rank-lowering - Imagine looking at the 12-dimension space such
that the 1st and 2nd singular value forms the X
and Y axis
- Dhillon, I., Course Notes, CS 395T, Large Scale
Data Mining, University of Texas at Austin
http//www.cs.utexas.edu/users/inderjit/courses/dm
2000.html
23Latent semantic analysis
- Experimental Method
- Data (Random assignment)
- 80 of Brown Corpus -gt training and 20 of Brown
Corpus -gt testing - Sentences that contained the confusion words were
extracted to perform testing and training - Training
- Each training sentence is used as a column in the
LSA matrix -gt each training sentence is treated
as a document
24Latent semantic analysis (contd.)
- Transformations prior to LSA processing
- Context reduction
- Column vector reduced to confusion word plus 7
words on either side (instead of average sentence
length of 28 words in the corpus) - Reduce running time and storage requirement
- Stemming
- Reducing each word to morphological root
(Porters algorithm, 1980) - Bi-gram Creation
- Formed between all adjacent pairs treated as
additional terms in the term-document matrix - Term Weighting
- Local Weighting
- Terms nearer the confusion word given more weight
in a linearly decreasing manner - Global Weighting
- Weight given to each term depending on the
importance of the word in the corpus as a whole
25Latent semantic analysis (contd.)
- Testing
- Words from the confusion set are inserted at
location - Vector in LSA space is constructed
- Vector similarity of test vector with each
confusion vector computed using cosine between
two vectors - Largest cosine value identified as most probable
word in a least square sense for test sentence
26Latent semantic analysis (contd.)
- Results
- Baseline performance Percentage of correct
prediction using the most frequent word - LSA performs better on words with same parts of
speech - Since it doesnt include part of speech
information it does not perform as well as
Tribayes in second group
27Latent semantic analysis (contd.)
- Advantages of SVD and LSA
- Noise removal -gt better semantic description
- Incorporate synonymy and polysemy
- Synonymy many ways to refer the same object
- Polysemy words have more than one distinct
meaning - Disadvantages of SVD and LSA
- Computationally expensive
- Matrix are large and sparse
- Typically many singular vectors required
(100500) - Not intuitive
28Conclusions
- Overview of spelling correction mechanism
- Vector Space Model
- Singular Value Decomposition
- Real word spelling errors
- Latent semantic analysis
- Tribayes
- Comparative results of the two methods
29References
- Papers
- Jones, M. P. and James H. Martin, Contextual
Spelling Correction, Proc. 5th Conf. on Applied
Natural Language Processing, 1997 - Deerwester, S., Dumais, S. T., Landauer, T. K.,
Furnas, G. W. and Harshman, R. A., "Indexing by
latent semantic analysis." Journal of the Society
for Information Science, 41(6), 1990 - Berry, M. W., Zlatko Drmac and E. R. Jessup,
Matrices, Vector Spaces, and Information
Retrieval, SIAM review, 1999 - Sonia Leach, Singular Value Decomposition A
Primer, RI 02912, Brown University, - Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992 - J. L. Peterson, Computer Programs for Detecting
and Correcting Spelling Errors, Communications
of the ACM, 1980 - J. L. Peterson, A note on undetected Typing
Errors, Communications of the ACM, Vol. 29, No.
7, 1986 - Kernighan, M. D., K. W. Church and W. A. Gale, A
Spelling Correction Program Based on a Noisy
Channel Model, Proceedings of COLING '90,
Helsinki, Finland, 1990 - Course Notes
- Dhillon, I., LSI with SVD, Course Notes, CS
395T, Large Scale Data Mining, University of
Texas at Austin http//www.cs.utexas.edu/users/ind
erjit/courses/dm2000.html - Sarkar, A., Edit distances for spelling
correction, Course Notes, CMPT-413,
Computational Linguistics, Simon Fraser
University, Canada http//www.sfu.ca/anoop/course
s/CMPT-413-Spring-2003/