Lecture 3 summary - PowerPoint PPT Presentation

About This Presentation
Title:

Lecture 3 summary

Description:

Lecture 12 Summary Assuming that the correct word differs from the misspelling by a single insertion/deletion/substitution, a list of candidate words is generated ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 3
Provided by: Saurabh70
Category:

less

Transcript and Presenter's Notes

Title: Lecture 3 summary


1
Lecture 12 Summary
  • Assuming that the correct word differs from the
    misspelling by a single insertion/deletion/substit
    ution, a list of candidate words is generated
    from the typo (which differ from the misspelling
    by at-most 1 error).
  • We then score each correction using
    and choose the
    correct word as the candidate with highest
    probability.
  • The prior, P(C) is estimated as
  • C count
    of candidate word in the corpus, V number of
    different words in corpus
  • Likelihood P(t/c) is estimated by simple
    heurestics governed by the source of typo.
  • Confusion Matrix a 26x26 matrix which
    represents of times one letter was incorrectly
    used instead of another (4 confusion matrices
    used for computing likelihoods one for each
    type of error deletion, insertion, substitution,
    reversal).

2
  • Lecture 12 Summary
  • Minimum Edit Distance
  • Previous algorithm assumes that each word has a
    single spelling error.
  • To handle multiple errors, we must have a
    technique comparing the distance between
    strings.
  • The Minimum Edit Distance between two strings
    is the minimum number of editing operations
    needed to transform one string to the other (can
    be represented as a trace, alignment or an
    operation list) (also known as Levenshtein
    distance ).
  • Computed by employing dynamic programming (by
    creating a distance matrix with one column for
    each symbol in the target sequence and one row
    for each symbol in the source sequence).
  • Each cell edit_distancei,j contains the
    distance between the first i characters of the
    target and the first j characters of the source.
    The value in each cell is computed as the minimum
    of the 3 possible paths it can follow
Write a Comment
User Comments (0)
About PowerShow.com