Learning a Spelling Error Model from Search Query Logs

1 / 23
About This Presentation
Title:

Learning a Spelling Error Model from Search Query Logs

Description:

search a dictionary for the closest match, using levenshtein edit distance ... Test of 508 Misspelled Dictionary Words. Corrections Returned are Ranked by Probability ... –

Number of Views:67
Avg rating:3.0/5.0
Slides: 24
Provided by: ResearchM53
Category:

less

Transcript and Presenter's Notes

Title: Learning a Spelling Error Model from Search Query Logs


1
Learning a Spelling Error Model from Search Query
Logs
Farooq Ahmad and Grzegorz KondrakUniversity of
Alberta
2
Overview
  • Motivation and Prior Work
  • Learning an Error Model
  • Results
  • Future Work

3
Background
  • Motivation
  • Over 700 M search queries made every day
  • 10 misspelled
  • Problems
  • Queries are often not found in a dictionary
  • Eg. multiplayer, blog, federline
  • Many possible candidate corrections for any given
    misspelled query

4
Motivation
  • Blue Dictionary Words
  • 27 of unique types
  • 80 of all words
  • Yellow Non-Dictionary Words
  • 73 of token types
  • 20 of all words

Token Frequency vs Rank for Dictionary and
Non-Dictionary Words
5
What are these non-dictionary words?
6
Possible Approaches
  • 1. Naïve Method
  • search a dictionary for the closest match, using
    levenshtein edit distance
  • returns minimum number of insertions,
    deletions,and substitutions to transform one word
    to another
  • Assigns a uniform cost to every substitution
  • the best word is the one with minimum edit
    distance from the misspelled word
  • Why not just use Levenshtein?
  • eg. britny - briny vs. britney

7
Better Method
  • 2. Incorporate a Language Model
  • use levenshtein edit distance and word
    probability to select best match
  • Use Levenshtein to find candidate words
  • Rank candidates by word probability
  • Mayes, Damerau 1991
  • Spelling Correction using bigram language model
    Levenshtein edit distance

8
Even Better...
  • 3. Use probabilistic edit distance and word
    probability
  • probabilistic edit distance
  • each type of insertion, deletion and substitution
    has its own edit cost EC
  • eg. P(e i ) P(e z) so we want EC(e
    i ) z)
  • word probability
  • use unigram, bigram, or trigram probabilities
  • eg. Unigram Probability P(wi) c(wi)/N
  • How do we integrate these probabilities?

9
Noisy Channel Model
  • Basic Noisy Channel Model
  • Kernighan, Church, Gale 1990
  • Use a dictionary to find candidates w within 1
    edit of v
  • Given misspelled word v, find best w
  • What do we want?
  • Language model P(w)
  • Error model P(vw)

10
Language Model
  • Can an be determined from query logs
  • Brill, Moore 2004
  • N-gram language Model derived from search queries
  • Log thousands (millions) of search queries
  • http//www.metacrawler.com/perl/metaspy
  • real-time display of search queries processed by
    the metacrawler search engine
  • compile word probabilities (Unigram, Bigram, etc)

11
Error Model
  • P of misspelling v given word w
  • depends on the probability of each edit operation
  • Taking the log of both sides gives
  • How do we relate Edit Cost (lower is better)
    and probability (higher is better)?
  • EC(e) -logP(e) (Ristad, Yianilos 1997)
  • So, ED(v,w) -logP(vw)

12
Learning the Error Model
  • How do we find the edit probabilities P(e)?
  • Use a hand compiled list of spelling errors and
    their corrections
  • Compile statistics on the edit operations
  • OR...
  • use the language model to determine the error
    model using expectation maximization

13
Expectation Maximization
  • Soft Clustering (EM)

Assign the data point (v) to each cluster (w) in
proportion to how well it fits the cluster
P(vw), P(w)
Given a data point (possibly misspelled word v)
and clusters (possible corrections wi )
Update the cluster centers (edit costs) to
reflect the inclusion of the new data
14
Use EM to find Edit Distances
  • Start with a naïve error model
  • Use Expectation Maximization to improve it
  • For each query v
  • Determine the most likely candidate corrections
    using the existing edit distance model and
    language model (E-Step)
  • for each candidate word wi within ED(x)
  • candidates args max n P(vwn)P(wn)
  • P(vw) ?P(ek)
  • one candidate may be the word itself
  • Update the edit distance model (M-Step)

15
M-Step
  • M-Step
  • Given P(e1...en)
  • each ek is a single ins, del, or sub of two
    letters
  • want to adjust P(e1).. P(e2) accordingly
  • Update Frequency Table
  • F(ei) P(wi)
  • Normalize
  • P(ei) F(ek) / N
  • N total number of edit operations for that letter
  • Convert into Edit Distance
  • D(ek) -log(P(ek))

16
EM Example
  • E and M-Step working together

E-Step
Update
Frequency Table
Normalization
Example P(equipment equibmnt) 0.11 ee,
qq, uu, ii, pb, mm, e_, nn, tt
D -log(P)
Probability Table
Edit Distance Table
17
Example
  • Say we are using a bigram language model and see
    the following bigram
  • High scopl (v scopl)
  • 1. Find all possible candidate corrections w and
    their probabilities

18
EM Example
  • E and M-Step working together

P(school scopl) P(ss, cc, h_, oo,
op, ll) 0.11
19
ResultsMost Common Letter Substitutions
20
Results Letter Insertion Probabilities
21
Evaluation
  • Test of 508 Misspelled Dictionary Words
  • Corrections Returned are Ranked by Probability
  • Percentage of times that the correct word was
    within the Top 1,5,25 of the returned corrections

22
Future Work
  • Better Language Model (Word Bigrams)
  • New jeresy 5
  • New jers 1
  • New jersery 3
  • New jerseu 1
  • New jersey 4
  • New jersey 4654
  • New jersy 19
  • New jersye 1
  • Use Letter Context
  • (condition on previous letter lT-1)
  • eg. sofen - soften
  • P(t-_f) instead of just P(t-_)
  • Transpositions
  • their thier
  • Use a stemmer
  • hot dog hot dogs

23
Questions?
Write a Comment
User Comments (0)
About PowerShow.com