Learning a Spelling Error Model from Search Query Logs

1 / 23

About This Presentation

Title:

Learning a Spelling Error Model from Search Query Logs

Description:

search a dictionary for the closest match, using levenshtein edit distance ... Test of 508 Misspelled Dictionary Words. Corrections Returned are Ranked by Probability ... –

Number of Views:67

Avg rating:3.0/5.0

Slides: 24

Provided by: ResearchM53

Category:

more less

Transcript and Presenter's Notes

Title: Learning a Spelling Error Model from Search Query Logs

1
Learning a Spelling Error Model from Search Query
Logs
Farooq Ahmad and Grzegorz KondrakUniversity of
Alberta
2
Overview

Motivation and Prior Work
Learning an Error Model
Results
Future Work

3
Background

Motivation
Over 700 M search queries made every day
10 misspelled
Problems
Queries are often not found in a dictionary
Eg. multiplayer, blog, federline
Many possible candidate corrections for any given
misspelled query

4
Motivation

Blue Dictionary Words
27 of unique types
80 of all words
Yellow Non-Dictionary Words
73 of token types
20 of all words

Token Frequency vs Rank for Dictionary and
Non-Dictionary Words
5
What are these non-dictionary words?
6
Possible Approaches

1. Naïve Method
search a dictionary for the closest match, using
levenshtein edit distance
returns minimum number of insertions,
deletions,and substitutions to transform one word
to another
Assigns a uniform cost to every substitution
the best word is the one with minimum edit
distance from the misspelled word
Why not just use Levenshtein?
eg. britny - briny vs. britney

7
Better Method

2. Incorporate a Language Model
use levenshtein edit distance and word
probability to select best match
Use Levenshtein to find candidate words
Rank candidates by word probability
Mayes, Damerau 1991
Spelling Correction using bigram language model
Levenshtein edit distance

8
Even Better...

3. Use probabilistic edit distance and word
probability
probabilistic edit distance
each type of insertion, deletion and substitution
has its own edit cost EC
eg. P(e i ) P(e z) so we want EC(e
i ) z)
word probability
use unigram, bigram, or trigram probabilities
eg. Unigram Probability P(wi) c(wi)/N
How do we integrate these probabilities?

9
Noisy Channel Model

Basic Noisy Channel Model
Kernighan, Church, Gale 1990
Use a dictionary to find candidates w within 1
edit of v
Given misspelled word v, find best w
What do we want?
Language model P(w)
Error model P(vw)

10
Language Model

Can an be determined from query logs
Brill, Moore 2004
N-gram language Model derived from search queries
Log thousands (millions) of search queries
http//www.metacrawler.com/perl/metaspy
real-time display of search queries processed by
the metacrawler search engine
compile word probabilities (Unigram, Bigram, etc)

11
Error Model

P of misspelling v given word w
depends on the probability of each edit operation
Taking the log of both sides gives
How do we relate Edit Cost (lower is better)
and probability (higher is better)?
EC(e) -logP(e) (Ristad, Yianilos 1997)
So, ED(v,w) -logP(vw)

12
Learning the Error Model

How do we find the edit probabilities P(e)?
Use a hand compiled list of spelling errors and
their corrections
Compile statistics on the edit operations
OR...
use the language model to determine the error
model using expectation maximization

13
Expectation Maximization

Soft Clustering (EM)

Assign the data point (v) to each cluster (w) in
proportion to how well it fits the cluster
P(vw), P(w)
Given a data point (possibly misspelled word v)
and clusters (possible corrections wi )
Update the cluster centers (edit costs) to
reflect the inclusion of the new data
14
Use EM to find Edit Distances

Start with a naïve error model
Use Expectation Maximization to improve it
For each query v
Determine the most likely candidate corrections
using the existing edit distance model and
language model (E-Step)
for each candidate word wi within ED(x)
candidates args max n P(vwn)P(wn)
P(vw) ?P(ek)
one candidate may be the word itself
Update the edit distance model (M-Step)

15
M-Step

M-Step
Given P(e1...en)
each ek is a single ins, del, or sub of two
letters
want to adjust P(e1).. P(e2) accordingly
Update Frequency Table
F(ei) P(wi)
Normalize
P(ei) F(ek) / N
N total number of edit operations for that letter
Convert into Edit Distance
D(ek) -log(P(ek))

16
EM Example

E and M-Step working together

E-Step
Update
Frequency Table
Normalization
Example P(equipment equibmnt) 0.11 ee,
qq, uu, ii, pb, mm, e_, nn, tt
D -log(P)
Probability Table
Edit Distance Table
17
Example