Spelling correction - PowerPoint PPT Presentation

1 / 29

About This Presentation

Title:

Spelling correction

Description:

Dictionary construction issues lexicon must be carefully tuned to intended domain of discourse ... Lexicon issues. Computer-human interface issues. Spelling ... – PowerPoint PPT presentation

Number of Views:297

Avg rating:3.0/5.0

Slides: 30

Provided by: ManiT1

Category:

more less

Transcript and Presenter's Notes

Title: Spelling correction

1
Spelling correction

Mani Thomas and Greg Waltz

2
Overview

Introduction to Probabilistic Modeling
Techniques of spelling correction
Types of errors
Non-word error detection

Noisy channel Model
Isolated word error correction
Real word error correction
Latent Semantic Analysis

3
Introduction

Spelling error correction mainly deals with two
main types of issues
Error Detection
Error Correction
Probabilistic Modeling
Given a sequence of letters corresponding to a
misspelled word
Ordered list of possible correct words for each
misspelled word
Transducing from surface forms to lexical forms,
assigning each with a probability

Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992
Jurafsky and Martin, Speech and Language
Processing, Chapter 5

4
Techniques of spelling correction

Automatic word correction research focuses on
three increasingly broader problem
Non-word error detection
Detecting spelling errors that result in
non-words
Isolated-word error correction
Correcting spelling errors that result in
non-words
Context-dependent word correction
Using context to detect and correct spelling
errors

Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992
J. L. Peterson, Computer Programs for Detecting
and Correcting Spelling Errors, Communications
of the ACM, 1980

5
Non-word error detection

Dictionary look-up techniques
Response time is a problem when size exceeds a
few hundred words
Most common technique to reduce search time ?
Hash technique

N-gram analysis
Errors made by OCR devices ? N-gram analysis
produces improbable N-grams
Kinds
Unigram analysis
Bigram analysis
Trigram analysis

6
Non-word error detection (Contd.)

Issues under Dictionary look-up
Dictionary construction issues ? lexicon must be
carefully tuned to intended domain of discourse
Word Boundary problem ? word boundaries are
defined by white space characters (problems ?
running two words together, and splitting a
single word)

7
Isolated-word error correction

Used mostly for text recognition and text editing
Application specific issues
Lexicon issues
Computer-human interface issues
Spelling error patterns
Typographical errors
Cognitive errors
Phonetic errors

8
Isolated-word error correction

Basic error types
Single error misspellings ?80 of all misspelled
words in a sample of human keypunched text
Insertion
Deletion
Substitution
Transposition
Multi error misspellings

Fred Damerau A technique for computer detection
and correction of spelling errors. Communications
of ACM Vol. 7, No.3

9
Isolated-word error correction

Other issues related to Spelling Error Patterns
Word length effects
First-position errors
Keyboard effects
Error rates
Phonetic errors
Heuristic rules and probabilistic tendencies
Common misspellings

10
Isolated-word error correction

Problems related
Detecting errors
Generating candidates for correction
Ranking the candidates
Techniques include ?
Minimum edit distance
Similarity key technique
Rule-based technique
N-gram based technique
Probabilistic Techniques
Neural Net techniques

11
Real-word error correction

One correctly spelled word - substituted for
another
Simple typos (from -gt form, form -gt farm)
Cognitive or phonetic lapses (there -gt their)
Syntactic or grammatical mistakes (arrive -gt
arrives)
Wrong function word (for -gt of)
Semantic anomalies (in five minuets)
Insertions or deletions of whole words (the
system has been operating for almost three years,
at absolutely extra cost)
Improper spacing (myself -gt my self)
Require information from the context for both
detection and correction

Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992

12
Real-word Error Correction

Errors as analyzed by Atwell and Elliot
Local Syntactic Errors (the study was conducted
mainly be John Black)
Involve local incongruities -gt detectable by a
parser or a statistical language model
Global Syntactic Errors (Not only we in the
academia but the scientific community as a whole
needs to guard against this)
Involve long term dependency between complex
subject phrase and verb -gt requires the full
syntactic structure
Semantic Errors (He is trying to fine out)
Have valid syntactic parses -gt no general purpose
NLP systems to detect these semantic errors

13
Latent semantic analysis

Approach originally developed for automatic
indexing and data retrieval
Overcome fundamental problem -gt match words of
queries with words of the document
Overcome deficiencies by treating the observed
term-document association data as a statistical
problem
LSA makes predictions by building a
high-dimensional semantic space of words
Compare similarity of the words from a confusion
set to a given context

Dumais, Susan T., et al, Using Latent Semantic
Analysis to improve access to textual
information, Proc. CHI 88

14
Latent semantic analysis (contd.)

Goal of LSA
Take evidence (i.e. words) and uncover the
underlying semantics
Eliminate noise from data -gt Transform data from
the high dimensional space to a reduced
dimensional space
Dimension reduction using Singular Value
Decomposition (SVD)
Similarity obtained using the dot product or
cosine between the points in the space

15
Latent semantic analysis (contd.)

Rows of matrix -gt terms and columns of matrix -gt
documents
Singular Value Decomposition
Factors original matrix into the product of three
matrices
Matrices show breakdown into linearly independent
and orthogonal components (factors)
T0 -gt original term vectors as vectors of the
factor values
S0 -gt diagonal matrix of the singular values
(scaling factor for each dimension)
D0 matrix -gt original document vectors as vectors
of the factor values

Jones M. P., et al, Contextual Spelling
Correction Using Latent Semantic Analysis, Proc.
5th Conf. on Applied Natural Language Processing,
1997

16
Latent semantic analysis (contd.)

Approximation of original matrix
Eliminate some number of the least important
singular values
The product is a least squares best fit of the
original matrix
Generate vector representation for each text
passage in which a confusion word appears
Similarity between text passage vector and
confusion word vector -gt predict most likely word
given the context

Jones M. P., et al, Contextual Spelling
Correction Using Latent Semantic Analysis, Proc.
5th Conf. on Applied Natural Language Processing,
1997

17
Vector Space Model

d documents described by t terms -gt t xd
term-document matrix
d vectors form the columns of the matrix
Semantic content is wholly contained in the
column space
Common measure of similarity -gt cosine of the
angle between the query and document vector

Berry, M. W., Zlatko Drmac and E. R. Jessup,
Matrices, Vector Spaces, and Information
Retrieval, SIAM review, 1999
Sonia Leach, Singular Value Decomposition A
Primer

18
Vector Space Model (contd.)

Query -gt baking bread
Threshold 0.5
Col. 5 can be written as sum of col. 2 and col. 3
Query -gt baking
Not able to retrieve the 4th document though it
is more comprehensive

19
Vector Space Model (contd.)

Rank 4 matrix has 4 non-zero singular values
the two zero rows -gt first four columns of U form
a basis of the column space

Berry, M. W., Zlatko Drmac and E. R. Jessup,
Matrices, Vector Spaces, and Information
Retrieval, SIAM review, 1999

20
Vector Space Model (contd.)

Rank-3 approximation of the original
term-document matrix
Considering the top three singular values
Setting the other elements to 0
This subspace represent the structure of the
matrix better than the original
Reduction of noise which exist in the lower
singular values
Note 1 Deciding the rank approximation factor is
still an open question
Note 2 Negative values -gt Entries are linear
combinations of the entries in the original matrix

21
Vector Space Model (contd.)

Using the rank-3 approximation
Query -gt baking bread
Query -gt baking
Remaining cosines are no longer zero but are
still lower than the threshold and are cut-off
Lowering the rank lowers the cost of query
matching
Using the k vectors instead of the d vectors

22
Vector space model (contd.)

Redistribution of the data points depending on
the rank-lowering
Imagine looking at the 12-dimension space such
that the 1st and 2nd singular value forms the X
and Y axis

Dhillon, I., Course Notes, CS 395T, Large Scale
Data Mining, University of Texas at Austin
http//www.cs.utexas.edu/users/inderjit/courses/dm
2000.html

23
Latent semantic analysis

Experimental Method
Data (Random assignment)
80 of Brown Corpus -gt training and 20 of Brown
Corpus -gt testing
Sentences that contained the confusion words were
extracted to perform testing and training
Training
Each training sentence is used as a column in the
LSA matrix -gt each training sentence is treated
as a document

24
Latent semantic analysis (contd.)

Transformations prior to LSA processing
Context reduction
Column vector reduced to confusion word plus 7
words on either side (instead of average sentence
length of 28 words in the corpus)
Reduce running time and storage requirement
Stemming
Reducing each word to morphological root
(Porters algorithm, 1980)
Bi-gram Creation
Formed between all adjacent pairs treated as
additional terms in the term-document matrix
Term Weighting
Local Weighting
Terms nearer the confusion word given more weight
in a linearly decreasing manner
Global Weighting
Weight given to each term depending on the
importance of the word in the corpus as a whole

25
Latent semantic analysis (contd.)

Testing
Words from the confusion set are inserted at
location
Vector in LSA space is constructed
Vector similarity of test vector with each
confusion vector computed using cosine between
two vectors
Largest cosine value identified as most probable
word in a least square sense for test sentence

26
Latent semantic analysis (contd.)

Results
Baseline performance Percentage of correct
prediction using the most frequent word
LSA performs better on words with same parts of
speech
Since it doesnt include part of speech
information it does not perform as well as
Tribayes in second group

27
Latent semantic analysis (contd.)

Advantages of SVD and LSA
Noise removal -gt better semantic description
Incorporate synonymy and polysemy
Synonymy many ways to refer the same object
Polysemy words have more than one distinct
meaning
Disadvantages of SVD and LSA
Computationally expensive
Matrix are large and sparse
Typically many singular vectors required
(100500)
Not intuitive

28
Conclusions

Overview of spelling correction mechanism
Vector Space Model
Singular Value Decomposition
Real word spelling errors
Latent semantic analysis
Tribayes
Comparative results of the two methods

29
References

Papers
Jones, M. P. and James H. Martin, Contextual
Spelling Correction, Proc. 5th Conf. on Applied
Natural Language Processing, 1997
Deerwester, S., Dumais, S. T., Landauer, T. K.,
Furnas, G. W. and Harshman, R. A., "Indexing by
latent semantic analysis." Journal of the Society
for Information Science, 41(6), 1990
Berry, M. W., Zlatko Drmac and E. R. Jessup,
Matrices, Vector Spaces, and Information
Retrieval, SIAM review, 1999
Sonia Leach, Singular Value Decomposition A
Primer, RI 02912, Brown University,
Karen Kukich, Techniques for automatically
correcting Words in Text, ACM computing Surveys,
1992
J. L. Peterson, Computer Programs for Detecting
and Correcting Spelling Errors, Communications
of the ACM, 1980
J. L. Peterson, A note on undetected Typing
Errors, Communications of the ACM, Vol. 29, No.
7, 1986
Kernighan, M. D., K. W. Church and W. A. Gale, A
Spelling Correction Program Based on a Noisy
Channel Model, Proceedings of COLING '90,
Helsinki, Finland, 1990
Course Notes
Dhillon, I., LSI with SVD, Course Notes, CS
395T, Large Scale Data Mining, University of
Texas at Austin http//www.cs.utexas.edu/users/ind
erjit/courses/dm2000.html
Sarkar, A., Edit distances for spelling
correction, Course Notes, CMPT-413,
Computational Linguistics, Simon Fraser
University, Canada http//www.sfu.ca/anoop/course
s/CMPT-413-Spring-2003/