Title: Learning a Spelling Error Model from Search Query Logs
1Learning a Spelling Error Model from Search Query
Logs
Farooq Ahmad and Grzegorz KondrakUniversity of
Alberta
2Overview
- Motivation and Prior Work
- Learning an Error Model
- Results
- Future Work
3Background
- Motivation
- Over 700 M search queries made every day
- 10 misspelled
- Problems
- Queries are often not found in a dictionary
- Eg. multiplayer, blog, federline
- Many possible candidate corrections for any given
misspelled query
4Motivation
- Blue Dictionary Words
- 27 of unique types
- 80 of all words
- Yellow Non-Dictionary Words
- 73 of token types
- 20 of all words
Token Frequency vs Rank for Dictionary and
Non-Dictionary Words
5What are these non-dictionary words?
6Possible Approaches
- 1. Naïve Method
- search a dictionary for the closest match, using
levenshtein edit distance - returns minimum number of insertions,
deletions,and substitutions to transform one word
to another - Assigns a uniform cost to every substitution
- the best word is the one with minimum edit
distance from the misspelled word - Why not just use Levenshtein?
- eg. britny - briny vs. britney
7Better Method
- 2. Incorporate a Language Model
- use levenshtein edit distance and word
probability to select best match - Use Levenshtein to find candidate words
- Rank candidates by word probability
- Mayes, Damerau 1991
- Spelling Correction using bigram language model
Levenshtein edit distance
8Even Better...
- 3. Use probabilistic edit distance and word
probability - probabilistic edit distance
- each type of insertion, deletion and substitution
has its own edit cost EC - eg. P(e i ) P(e z) so we want EC(e
i ) z) - word probability
- use unigram, bigram, or trigram probabilities
- eg. Unigram Probability P(wi) c(wi)/N
- How do we integrate these probabilities?
9Noisy Channel Model
- Basic Noisy Channel Model
- Kernighan, Church, Gale 1990
- Use a dictionary to find candidates w within 1
edit of v - Given misspelled word v, find best w
- What do we want?
- Language model P(w)
- Error model P(vw)
10Language Model
- Can an be determined from query logs
- Brill, Moore 2004
- N-gram language Model derived from search queries
- Log thousands (millions) of search queries
- http//www.metacrawler.com/perl/metaspy
- real-time display of search queries processed by
the metacrawler search engine - compile word probabilities (Unigram, Bigram, etc)
11Error Model
- P of misspelling v given word w
- depends on the probability of each edit operation
- Taking the log of both sides gives
- How do we relate Edit Cost (lower is better)
and probability (higher is better)? - EC(e) -logP(e) (Ristad, Yianilos 1997)
- So, ED(v,w) -logP(vw)
12Learning the Error Model
- How do we find the edit probabilities P(e)?
- Use a hand compiled list of spelling errors and
their corrections - Compile statistics on the edit operations
- OR...
- use the language model to determine the error
model using expectation maximization
13Expectation Maximization
Assign the data point (v) to each cluster (w) in
proportion to how well it fits the cluster
P(vw), P(w)
Given a data point (possibly misspelled word v)
and clusters (possible corrections wi )
Update the cluster centers (edit costs) to
reflect the inclusion of the new data
14Use EM to find Edit Distances
- Start with a naïve error model
- Use Expectation Maximization to improve it
- For each query v
- Determine the most likely candidate corrections
using the existing edit distance model and
language model (E-Step) - for each candidate word wi within ED(x)
- candidates args max n P(vwn)P(wn)
- P(vw) ?P(ek)
- one candidate may be the word itself
- Update the edit distance model (M-Step)
15M-Step
- M-Step
- Given P(e1...en)
- each ek is a single ins, del, or sub of two
letters - want to adjust P(e1).. P(e2) accordingly
- Update Frequency Table
- F(ei) P(wi)
- Normalize
- P(ei) F(ek) / N
- N total number of edit operations for that letter
- Convert into Edit Distance
- D(ek) -log(P(ek))
16EM Example
- E and M-Step working together
E-Step
Update
Frequency Table
Normalization
Example P(equipment equibmnt) 0.11 ee,
qq, uu, ii, pb, mm, e_, nn, tt
D -log(P)
Probability Table
Edit Distance Table
17Example
- Say we are using a bigram language model and see
the following bigram - High scopl (v scopl)
- 1. Find all possible candidate corrections w and
their probabilities
18EM Example
- E and M-Step working together
P(school scopl) P(ss, cc, h_, oo,
op, ll) 0.11
19ResultsMost Common Letter Substitutions
20Results Letter Insertion Probabilities
21Evaluation
- Test of 508 Misspelled Dictionary Words
- Corrections Returned are Ranked by Probability
- Percentage of times that the correct word was
within the Top 1,5,25 of the returned corrections
22Future Work
- Better Language Model (Word Bigrams)
- New jeresy 5
- New jers 1
- New jersery 3
- New jerseu 1
- New jersey 4
- New jersey 4654
- New jersy 19
- New jersye 1
- Use Letter Context
- (condition on previous letter lT-1)
- eg. sofen - soften
- P(t-_f) instead of just P(t-_)
- Transpositions
- their thier
- Use a stemmer
- hot dog hot dogs
23Questions?