Spelling correction - PowerPoint PPT Presentation

1 / 29
About This Presentation
Title:

Spelling correction

Description:

Dictionary construction issues lexicon must be carefully tuned to intended domain of discourse ... Lexicon issues. Computer-human interface issues. Spelling ... – PowerPoint PPT presentation

Number of Views:295
Avg rating:3.0/5.0
Slides: 30
Provided by: ManiT1
Category:

less

Transcript and Presenter's Notes

Title: Spelling correction


1
Spelling correction
  • Mani Thomas and Greg Waltz

2
Overview
  • Introduction to Probabilistic Modeling
  • Techniques of spelling correction
  • Types of errors
  • Non-word error detection
  • Noisy channel Model
  • Isolated word error correction
  • Real word error correction
  • Latent Semantic Analysis

3
Introduction
  • Spelling error correction mainly deals with two
    main types of issues
  • Error Detection
  • Error Correction
  • Probabilistic Modeling
  • Given a sequence of letters corresponding to a
    misspelled word
  • Ordered list of possible correct words for each
    misspelled word
  • Transducing from surface forms to lexical forms,
    assigning each with a probability
  • Karen Kukich, Techniques for automatically
    correcting Words in Text, ACM computing Surveys,
    1992
  • Jurafsky and Martin, Speech and Language
    Processing, Chapter 5

4
Techniques of spelling correction
  • Automatic word correction research focuses on
    three increasingly broader problem
  • Non-word error detection
  • Detecting spelling errors that result in
    non-words
  • Isolated-word error correction
  • Correcting spelling errors that result in
    non-words
  • Context-dependent word correction
  • Using context to detect and correct spelling
    errors
  • Karen Kukich, Techniques for automatically
    correcting Words in Text, ACM computing Surveys,
    1992
  • J. L. Peterson, Computer Programs for Detecting
    and Correcting Spelling Errors, Communications
    of the ACM, 1980

5
Non-word error detection
  • Dictionary look-up techniques
  • Response time is a problem when size exceeds a
    few hundred words
  • Most common technique to reduce search time ?
  • Hash technique
  • N-gram analysis
  • Errors made by OCR devices ? N-gram analysis
    produces improbable N-grams
  • Kinds
  • Unigram analysis
  • Bigram analysis
  • Trigram analysis

6
Non-word error detection (Contd.)
  • Issues under Dictionary look-up
  • Dictionary construction issues ? lexicon must be
    carefully tuned to intended domain of discourse
  • Word Boundary problem ? word boundaries are
    defined by white space characters (problems ?
    running two words together, and splitting a
    single word)

7
Isolated-word error correction
  • Used mostly for text recognition and text editing
  • Application specific issues
  • Lexicon issues
  • Computer-human interface issues
  • Spelling error patterns
  • Typographical errors
  • Cognitive errors
  • Phonetic errors

8
Isolated-word error correction
  • Basic error types
  • Single error misspellings ?80 of all misspelled
    words in a sample of human keypunched text
  • Insertion
  • Deletion
  • Substitution
  • Transposition
  • Multi error misspellings
  • Fred Damerau A technique for computer detection
    and correction of spelling errors. Communications
    of ACM Vol. 7, No.3

9
Isolated-word error correction
  • Other issues related to Spelling Error Patterns
  • Word length effects
  • First-position errors
  • Keyboard effects
  • Error rates
  • Phonetic errors
  • Heuristic rules and probabilistic tendencies
  • Common misspellings

10
Isolated-word error correction
  • Problems related
  • Detecting errors
  • Generating candidates for correction
  • Ranking the candidates
  • Techniques include ?
  • Minimum edit distance
  • Similarity key technique
  • Rule-based technique
  • N-gram based technique
  • Probabilistic Techniques
  • Neural Net techniques

11
Real-word error correction
  • One correctly spelled word - substituted for
    another
  • Simple typos (from -gt form, form -gt farm)
  • Cognitive or phonetic lapses (there -gt their)
  • Syntactic or grammatical mistakes (arrive -gt
    arrives)
  • Wrong function word (for -gt of)
  • Semantic anomalies (in five minuets)
  • Insertions or deletions of whole words (the
    system has been operating for almost three years,
    at absolutely extra cost)
  • Improper spacing (myself -gt my self)
  • Require information from the context for both
    detection and correction
  • Karen Kukich, Techniques for automatically
    correcting Words in Text, ACM computing Surveys,
    1992

12
Real-word Error Correction
  • Errors as analyzed by Atwell and Elliot
  • Local Syntactic Errors (the study was conducted
    mainly be John Black)
  • Involve local incongruities -gt detectable by a
    parser or a statistical language model
  • Global Syntactic Errors (Not only we in the
    academia but the scientific community as a whole
    needs to guard against this)
  • Involve long term dependency between complex
    subject phrase and verb -gt requires the full
    syntactic structure
  • Semantic Errors (He is trying to fine out)
  • Have valid syntactic parses -gt no general purpose
    NLP systems to detect these semantic errors

13
Latent semantic analysis
  • Approach originally developed for automatic
    indexing and data retrieval
  • Overcome fundamental problem -gt match words of
    queries with words of the document
  • Overcome deficiencies by treating the observed
    term-document association data as a statistical
    problem
  • LSA makes predictions by building a
    high-dimensional semantic space of words
  • Compare similarity of the words from a confusion
    set to a given context
  • Dumais, Susan T., et al, Using Latent Semantic
    Analysis to improve access to textual
    information, Proc. CHI 88

14
Latent semantic analysis (contd.)
  • Goal of LSA
  • Take evidence (i.e. words) and uncover the
    underlying semantics
  • Eliminate noise from data -gt Transform data from
    the high dimensional space to a reduced
    dimensional space
  • Dimension reduction using Singular Value
    Decomposition (SVD)
  • Similarity obtained using the dot product or
    cosine between the points in the space

15
Latent semantic analysis (contd.)
  • Rows of matrix -gt terms and columns of matrix -gt
    documents
  • Singular Value Decomposition
  • Factors original matrix into the product of three
    matrices
  • Matrices show breakdown into linearly independent
    and orthogonal components (factors)
  • T0 -gt original term vectors as vectors of the
    factor values
  • S0 -gt diagonal matrix of the singular values
    (scaling factor for each dimension)
  • D0 matrix -gt original document vectors as vectors
    of the factor values
  • Jones M. P., et al, Contextual Spelling
    Correction Using Latent Semantic Analysis, Proc.
    5th Conf. on Applied Natural Language Processing,
    1997

16
Latent semantic analysis (contd.)
  • Approximation of original matrix
  • Eliminate some number of the least important
    singular values
  • The product is a least squares best fit of the
    original matrix
  • Generate vector representation for each text
    passage in which a confusion word appears
  • Similarity between text passage vector and
    confusion word vector -gt predict most likely word
    given the context
  • Jones M. P., et al, Contextual Spelling
    Correction Using Latent Semantic Analysis, Proc.
    5th Conf. on Applied Natural Language Processing,
    1997

17
Vector Space Model
  • d documents described by t terms -gt t xd
    term-document matrix
  • d vectors form the columns of the matrix
  • Semantic content is wholly contained in the
    column space
  • Common measure of similarity -gt cosine of the
    angle between the query and document vector
  • Berry, M. W., Zlatko Drmac and E. R. Jessup,
    Matrices, Vector Spaces, and Information
    Retrieval, SIAM review, 1999
  • Sonia Leach, Singular Value Decomposition A
    Primer

18
Vector Space Model (contd.)
  • Query -gt baking bread
  • Threshold 0.5
  • Col. 5 can be written as sum of col. 2 and col. 3
  • Query -gt baking
  • Not able to retrieve the 4th document though it
    is more comprehensive

19
Vector Space Model (contd.)
  • Rank 4 matrix has 4 non-zero singular values
  • the two zero rows -gt first four columns of U form
    a basis of the column space
  • Berry, M. W., Zlatko Drmac and E. R. Jessup,
    Matrices, Vector Spaces, and Information
    Retrieval, SIAM review, 1999

20
Vector Space Model (contd.)
  • Rank-3 approximation of the original
    term-document matrix
  • Considering the top three singular values
  • Setting the other elements to 0
  • This subspace represent the structure of the
    matrix better than the original
  • Reduction of noise which exist in the lower
    singular values
  • Note 1 Deciding the rank approximation factor is
    still an open question
  • Note 2 Negative values -gt Entries are linear
    combinations of the entries in the original matrix

21
Vector Space Model (contd.)
  • Using the rank-3 approximation
  • Query -gt baking bread
  • Query -gt baking
  • Remaining cosines are no longer zero but are
    still lower than the threshold and are cut-off
  • Lowering the rank lowers the cost of query
    matching
  • Using the k vectors instead of the d vectors

22
Vector space model (contd.)
  • Redistribution of the data points depending on
    the rank-lowering
  • Imagine looking at the 12-dimension space such
    that the 1st and 2nd singular value forms the X
    and Y axis
  • Dhillon, I., Course Notes, CS 395T, Large Scale
    Data Mining, University of Texas at Austin
    http//www.cs.utexas.edu/users/inderjit/courses/dm
    2000.html

23
Latent semantic analysis
  • Experimental Method
  • Data (Random assignment)
  • 80 of Brown Corpus -gt training and 20 of Brown
    Corpus -gt testing
  • Sentences that contained the confusion words were
    extracted to perform testing and training
  • Training
  • Each training sentence is used as a column in the
    LSA matrix -gt each training sentence is treated
    as a document

24
Latent semantic analysis (contd.)
  • Transformations prior to LSA processing
  • Context reduction
  • Column vector reduced to confusion word plus 7
    words on either side (instead of average sentence
    length of 28 words in the corpus)
  • Reduce running time and storage requirement
  • Stemming
  • Reducing each word to morphological root
    (Porters algorithm, 1980)
  • Bi-gram Creation
  • Formed between all adjacent pairs treated as
    additional terms in the term-document matrix
  • Term Weighting
  • Local Weighting
  • Terms nearer the confusion word given more weight
    in a linearly decreasing manner
  • Global Weighting
  • Weight given to each term depending on the
    importance of the word in the corpus as a whole

25
Latent semantic analysis (contd.)
  • Testing
  • Words from the confusion set are inserted at
    location
  • Vector in LSA space is constructed
  • Vector similarity of test vector with each
    confusion vector computed using cosine between
    two vectors
  • Largest cosine value identified as most probable
    word in a least square sense for test sentence

26
Latent semantic analysis (contd.)
  • Results
  • Baseline performance Percentage of correct
    prediction using the most frequent word
  • LSA performs better on words with same parts of
    speech
  • Since it doesnt include part of speech
    information it does not perform as well as
    Tribayes in second group

27
Latent semantic analysis (contd.)
  • Advantages of SVD and LSA
  • Noise removal -gt better semantic description
  • Incorporate synonymy and polysemy
  • Synonymy many ways to refer the same object
  • Polysemy words have more than one distinct
    meaning
  • Disadvantages of SVD and LSA
  • Computationally expensive
  • Matrix are large and sparse
  • Typically many singular vectors required
    (100500)
  • Not intuitive

28
Conclusions
  • Overview of spelling correction mechanism
  • Vector Space Model
  • Singular Value Decomposition
  • Real word spelling errors
  • Latent semantic analysis
  • Tribayes
  • Comparative results of the two methods

29
References
  • Papers
  • Jones, M. P. and James H. Martin, Contextual
    Spelling Correction, Proc. 5th Conf. on Applied
    Natural Language Processing, 1997
  • Deerwester, S., Dumais, S. T., Landauer, T. K.,
    Furnas, G. W. and Harshman, R. A., "Indexing by
    latent semantic analysis." Journal of the Society
    for Information Science, 41(6), 1990
  • Berry, M. W., Zlatko Drmac and E. R. Jessup,
    Matrices, Vector Spaces, and Information
    Retrieval, SIAM review, 1999
  • Sonia Leach, Singular Value Decomposition A
    Primer, RI 02912, Brown University,
  • Karen Kukich, Techniques for automatically
    correcting Words in Text, ACM computing Surveys,
    1992
  • J. L. Peterson, Computer Programs for Detecting
    and Correcting Spelling Errors, Communications
    of the ACM, 1980
  • J. L. Peterson, A note on undetected Typing
    Errors, Communications of the ACM, Vol. 29, No.
    7, 1986
  • Kernighan, M. D., K. W. Church and W. A. Gale, A
    Spelling Correction Program Based on a Noisy
    Channel Model, Proceedings of COLING '90,
    Helsinki, Finland, 1990
  • Course Notes
  • Dhillon, I., LSI with SVD, Course Notes, CS
    395T, Large Scale Data Mining, University of
    Texas at Austin http//www.cs.utexas.edu/users/ind
    erjit/courses/dm2000.html
  • Sarkar, A., Edit distances for spelling
    correction, Course Notes, CMPT-413,
    Computational Linguistics, Simon Fraser
    University, Canada http//www.sfu.ca/anoop/course
    s/CMPT-413-Spring-2003/
Write a Comment
User Comments (0)
About PowerShow.com