Language Models for Handwriting

About This Presentation

Title:

Language Models for Handwriting

Description:

Language Models for Handwriting – PowerPoint PPT presentation

Number of Views:168

Avg rating:3.0/5.0

Slides: 64

Provided by: josh9

Category:

more less

Transcript and Presenter's Notes

Title: Language Models for Handwriting

1
Language Models for Handwriting

Joshua Goodman
Microsoft Research
Machine Learning and Applied Statistics Group
http//www.research.microsoft.com/joshuago

2
What does this say?
3
What does this say?
4
What does this say?
5
What does this say?
6
Without context, very hard to read, if its even
possible
7
Context is Key

Very hard to read any individual part of this
without context.

But the overall string has only one reasonably
likely interpretation.
Language model tells you which strings are
likely, which ones are not.

8
Overview

What are language models
(2 Slides of Boring Preliminaries)
Who uses language models
Every natural language input technique
Even a few handwriting systems
Reduce errors by about 40 (relative)
Why language modeling is hard for handwriting
And what to do about it
How to build language models
Techniques, tools, etc.
The future
Handwriting researchers using language models for
almost all applications
And contributing back to the speech and language
modeling communities

9
What is a language model?

Gives a probability distribution over text
strings (characters/words/phrases/documents)

May be easier to model
P(ink wordsequence) than
P(wordsequence ink)
Can train language model on much more/different
data than the ink model

10
Language Model EvaluationEntropy and Perplexity

Language models
quality typically
measured with
entropy or perplexity
Entropy is just number of bits required to encode
the test data (should be called cross entropy of
model on test data)
Perplexity is 2entropy
Lower is better for both

11
Who uses language models

Almost every natural language input technique
uses language models
Speech recognition
Typing in Japanese and Chinese
Machine Translation
A whole bunch of others
Handwriting recognition
Simple dictionaries with some kind of frequency
information everyone who deals with words
Bigram and trigram models About 12 papers, All
with good results

12
Error-rate correlates with entropy (for speech
recognition)
13
Pinyin Conversion

How to enter Chinese text
Type phonetically (pinyin)
Many characters have same sound
Find

1 if correct pinyin, 0 otherwise
Language Model
14
Machine Translation

Let f be a French sentence we wish to translate
into English.
Let e be a possible English translation.

Translation Model
Language Model
Thanks to Eugene Charniak for this slide
15
Machine Translation Error rate versus Entropy
(From Franz Och)
These are old results newest results from Franz
Och are trained on 200 BILLION words of data, get
even better results.
16
Other Language Model Uses

Information retrieval
P(query document) ??
P(document)
Telephone Keypad input
P(numberswords) ??
P(words)
Soft Keyboard input
P(pendown-positionswords) ??
P(words)
Spelling Correction
P(observed keyswords) ??
P(words)

Language Model
17
Language Models in Handwriting Systems

Kinds of language models (unigram, bigram,
trigram)
Results from handwriting papers
Character-based vs. word based
Some of the problems of using LMs for handwriting

18
What kind of language model to use

P(a b c q r s)
P(a) P(b a) P (c a b)
P(s a b c q r)
How can we compute P(s a b c q r)?
Too hard so approximate it by
P(s q r) Trigram
P(s r) Bigram
P(s) Unigram
P(s) ? 1/voc Uniform

19
Language modeling for handwriting recognition

Highlights from 12 papers that I can find
Hybrid neuro-Markovian (Marukatat, et al. 01)
Error rate drops from 30 to 18 by using a
bigram.
Quiniou et al., 05 18 to 10 (bigram and
trigram same)
Perraud et al., 03
From 34 (uniform) to 29 (unigram) to 22.5
error rates
Vinciarelli et al., 04
Tried different test sets, with semi-realistic
mismatch between training and test.
Results highly dependent on match between
training/test
Unigram always much better than uniform (no
probabilities) (typically 50 more accurate)
Typically marginal gains from trigram
Sometimes large gains from bigram (up to 20
relative) when match was good

20
Pretty good Bibliography of LMs for handwriting

Using a Statistical Language Model to Improve the
Performance of an HMM-Based Cursive Handwriting
System, Marti et al. IJPRAI 2001
On the influence of vocabulary size and language
models in unconstrained handwritten text
recognition, Marti et al., ICDAR 2001
N-gram and N-Class Models for On line Handwriting
Recognition, Perraud et al., ICDAR 2003
Offline Recognition of Unconstrained Handwritten
Texts Using HMMs and Statistical Language Models,
Vinciarelli et al, IEEE PAMI 2004
N-Gram Language Models for Offline Handwritten
Text Recognition, Zimmerman et al., IWFHR 2004
Stability Measure of Entropy Estimate and Its
Application to Language Model Evaluation, Kim,
J., and Ryu, S., and Kim, J.H., IWFHR 2004
An Empirical Study of Statistical Language Models
for Contextual Post-processing of Chinese Script
Recognition, Li,Y-X. and Tan, C.L., IWFHR 2004
Statistical Language Models for On-line
Handwritten Sentence Recognition, Quiniou et al.,
ICDAR 2005
A Data Structure Using Hashing and Tries for
Efficient Chinese Lexical Access,Y.-K. Lam and Q.
Huo ICDAR 2005
Document Understanding System Using Stochastic
Context-Free Grammars, J. Handley. A. Namboodiri,
and R. Zanibbi, ICDAR 2005
Multiple Handwritten Text Line Recognition
Systems Derived from Specific Integration of a
Language Model, R. Bertolami and H. Bunke, ICDAR
2005
A Priori and a posteriori integration and
combination of language models in an on-line
handwritten sentence recognition system, Solen
Quiniou and Eric Anquetil, IWFHR 2006

21
Character-based or Word-based

Model letter sequences instead of word sequences
Example P( letter-3 letter-2 letter-1)
Typically, use a higher-order n-gram (e.g. 6)
This is good if you want to model
out-of-vocabulary words, proper names,
misspellings, digit sequences, etc.

22
Combining character-based and a word-based

Can model both in-vocabulary and
out-of-vocabulary words with language models, at
the same time.
Explicitly model
P(out-of-vocabulary previous-word)
P(out-of-vocabulary Mr.)
Model out-of-vocabulary using character model.

23
Digit Sequences

Might think that a language model cant help with
digit sequences (e.g. street numbers, dollar
amounts)
Numbers are much more likely to start with 1
(30)
Dollar amounts much more likely to end in .00,
.99, .95

The entropy of first digits is about 10 lower
than uniform entropy 10 lower error rate on
first digits?
24
Why its hard to use language models for
handwriting

Model incompatibilities
Lack of training data

25
Model IncompatibilitiesToy example

Toy example of handwriting system segment into
words, then segment words into letters, then
recognize letters
Where do we put the language modeling
probabilities? How do we integrate them into the
segmentation and recognition process?

26
A slightly more realistic exampleProduce
lattice of scores results
P e w s y w a w
n n l v n i a

For each segment, train recognizer (NN, SVM,
etc.) to recognize the letter in the segment.
Roughly, you learn P(letter ink)
What happens when you multiply by language model?
P(letter ink) ? P(letter previous letters)
This is not a meaningful probability!
Will it work in practice? Maybe, maybe not.

27
Some handwriting systems work very well with LMs

HMM-based approaches integrate very well with
language model.
Hybrid models (NeuralNet with HMM, e.g. Marukatat
01) also work well.
Use neural net to predict P(ink state, previous
ink)
Discriminatively trained neural nets can work
very well, if trained at the sentence level, with
LM integrated at training time.
Gradient-Based Learning Applied to Document
Recognition (LeCun et al., 98)
How to integrate SVMs with LM? Not clear.
Often SVM score is ?? P(letterink)

28
Lots of training data, or none

Usually, can train on millions, hundreds of
millions, or billions of words of data
Easy to find text that look like addresses.
Hard to find text that looks like meeting notes
Potential gain from language models likely to be
very application specific

29
Training Data is key
30
Quick Overview of LM Research

Trigram models are obviously brain-damaged lots
of improvements
Smoothing
Caching
Clustering
Neural methods
Other stuff

31
Smoothing None

Lowest perplexity trigram on training data.
Terrible on test data If no occurrences of
C(xyz), probability is 0.
Makes it impossible to recognize the word

32
Smoothing Key Ideas

Want some way to combine trigram probabilities,
bigram probabilities, unigram probabilities
Use trigram/bigram for accurate modeling of
context
Use unigram to get best guess you can when data
is sparse.
Lots of different techniques
Simple interpolation
Absolute Discounting
Katz Smoothing (Good-Turing)
Interpolated Kneser-Ney Smoothing

33
Smoothing Interpolated Kneser-Ney

Simple to implement smoothing technique.
Consistently the best performing across a very
wide range of conditions.
See appendix of A bit of progress in language
modeling for pseudo-code

34
Caching

If you say something, you are likely to say it
again later.
Interpolate trigram with cache

35
Caching Real Life

Someone writes The white house
System recognizes The white hor se
Cache remembers!
Person writes The whole house, and, with cache,
system recognizes The whole hor se. errors
are locked in.
Caching works well when users correct as they go,
poorly or even hurts without correction.

36
Cache Results
37
Neural Probabilistic Language ModelsBengio et
al. 2000

Multi-layer neural network
Similar to a convolutional-neural net applied to
language modeling

Largest improvement reported for any single
technique
Relatively slow and complex to build and apply,
but see ongoing research
(Combination techniques, e.g. Bit of Progress
paper, have slightly better overall results.)

38
Clustering

CLUSTERING CLASSES (same thing)
What is P(Tuesday party on)
Similar to P(Monday party on)
Similar to P(Tuesday celebration on)
Put words in clusters
WEEKDAY Sunday, Monday, Tuesday,
EVENTparty, celebration, birthday,
P(Tuesday WEEKDAY) ?
P(WEEKDAY EVENT PREPOSITION)

39
Cluster Results
40
Language Model Compression

Use handwriting on devices that are too small for
a keyboard
Will also have memory constraints
Same reasoning/problems for speech recognition
Use count-cutoffs discard all n-grams that
dont occur at least k times
Use Stolcke (1998) entropy-based pruning
Works better than cutoffs
Use clustering techniques (Goodman and Gao, 2000)
Smallest models
Harder to implement
Interacts poorly with tree representation in an
HMM

41
Lots and lots of other language model research

Endless ways to improve LMs
Sentence-mixture models
Skipping models
Parsing-based models
Decision-tree models
Maximum entropy (single layer NN) models

42
How to Build Language Models

Language modeling is depressingly easy
The best practical technique is to simply use a
bigram or trigram model
Works well with tree-structured HMMs
Almost all other techniques dont
If you have correction information, also use a
cache
If you have adaptation data (user-specific),
interpolate it in (weighted average of
probabilities)
Use count cutoffs or Stolcke pruning

43
Speech recognizer with language model

In theory,
In practice, language model is a better predictor
-- acoustic probabilities arent real
probabilities
In practice, penalize insertions

44
Tools

SRI Language Modeling Toolkit
http//www.speech.sri.com/projects/srilm/
Free for non-profit use
Can handles clusters, lattices, n-best lists,
hidden tags
CMU Language Modeling Toolkit
Can handle bigram, trigrams, more
Can handle different smoothing schemes
Many separate tools output of one tool is input
to next easy to use
Free for research purposes
http//svr-www.eng.cam.ac.uk/prc14/toolkit.html

45
Synergies between Handwriting and Speech/NLP

Language Modeling is only one of the similarities
between handwriting and speech
The two communities have lots of useful ideas
that they could learn from each other
Things for handwriting people to teach speech
people
Other useful ideas in speech recognition

46
From Handwriting to Speech and NLP

Neural Nets for Language Modeling
Neural Probabilistic Language Model
Work by Yoshua Bengio et al. (handwriting
researcher)
Like a convolutional neural net applied to
language modeling
Some of the best and most exciting recent results
in language modeling
Gradient-based learning with multi-layer Neural
Nets (LeCun et al., 98)
Similar to CRFs, which are increasingly popular
for NLP, but more sophisticated (CRFs are
basically the single layer version.)
How to use SVMs and NNs to recognize sequence
information
Probably much more this is just stuff I noticed
as I prepared

47
From Speech to Handwriting Finite State
Transducers

More powerful than HMMs
Can encode n-gram models efficiently (like HMMs)
Can encode simple grammars like spelled-out
numbers One thousand two hundred and eighteen
or addresses.
Can encode error models, e.g. spelling errors.
Some toolkits can convert transducer to a large
HMM, then optimize/compress it.
Can encode context-dependencies efficiently
(e.g., first state of a is different when
preceded by space than when preceded by d
than when preceded by o.)
Lots of toolkits available
ATT Finite State Machine Library
MIT Finite-State Transducer Toolkit
SFST, the Stuttgart Finite State Transducer Tools

48
From Speech to Handwriting

ROVER
Technique for combining the final output of
multiple recognizers, to get overall better
results
HMM expertise
Endless tricks used in speech world to improve
search sophisticated thresholding lattice
processing multiple-pass techniques tree
structuring etc.
Decision-Tree Senone Models
Ways to model context dependencies of phonemes
(letters) without getting too much data sparsity.

49
More Resources

Joshuas web page www.research.microsoft.com/jos
huago
These slides
Slides for 3 hour language modeling tutorial
Smoothing paper with Stan Chen good introduction
to smoothing and lots of details too.
A Bit of Progress in Language Modeling
comprehensive overview and experimental
comparison of LM techniques best overall LM
results.
Papers on fuzzy keyboard, language model
compression, more.
Appendix to this talk

50
Conclusion

Handwriting has special challenges
Difficulty of getting training data for some
applications
Difficulty integrating some model types (e.g.
SVMs)
Language Modeling has been very helpful for every
other kind of natural language input technique
Can reduce error rate by 1/3 or ½
Results for handwriting with similar improvements
Potential of language modeling is not just for
note-taking
Addresses
Check-writing
Any application where the distribution is not
uniform
May not be easy, but it will be worth it.

If you can read this slide, youre too close

52
Pretty good Bibliography of LMs for handwriting

Using a Statistical Language Model to Improve the
Performance of an HMM-Based Cursive Handwriting
System, Marti et al. IJPRAI 2001
On the influence of vocabulary size and language
models in unconstrained handwritten text
recognition, Marti et al., ICDAR 2001
N-gram and N-Class Models for On line Handwriting
Recognition, Perraud et al., ICDAR 2003
Offline Recognition of Unconstrained Handwritten
Texts Using HMMs and Statistical Language Models,
Vinciarelli et al, IEEE PAMI 2004
N-Gram Language Models for Offline Handwritten
Text Recognition, Zimmerman et al., IWFHR 2004
Stability Measure of Entropy Estimate and Its
Application to Language Model Evaluation, Kim,
J., and Ryu, S., and Kim, J.H., IWFHR 2004
An Empirical Study of Statistical Language Models
for Contextual Post-processing of Chinese Script
Recognition, Li,Y-X. and Tan, C.L., IWFHR 2004
Statistical Language Models for On-line
Handwritten Sentence Recognition, Quiniou et al.,
ICDAR 2005
A Data Structure Using Hashing and Tries for
Efficient Chinese Lexical Access,Y.-K. Lam and Q.
Huo ICDAR 2005
Document Understanding System Using Stochastic
Context-Free Grammars, J. Handley. A. Namboodiri,
and R. Zanibbi, ICDAR 2005
Multiple Handwritten Text Line Recognition
Systems Derived from Specific Integration of a
Language Model, R. Bertolami and H. Bunke, ICDAR
2005
A Priori and a posteriori integration and
combination of language models in an on-line
handwritten sentence recognition system, Solen
Quiniou and Eric Anquetil, IWFHR 2006

53
More Resources

Joshuas web page www.research.microsoft.com/jos
huago
Smoothing technical report good introduction to
smoothing and lots of details too.
A Bit of Progress in Language Modeling, which
is the journal version of much of this talk.
Papers on fuzzy keyboard, language model
compression, and maximum entropy.

54
More Resources

Eugene Charniaks web page http//www.cs.brown.ed
u/people/ec/home.html
Papers on statistical parsing for its own sake
and for language modeling, as well as using
language modeling to measure contextual
influence.
Pointers to software for statistical parsing as
well as statistical parsers optimized for
language-modeling

55
More ResourcesBooks

Books (all are OK, none focus on language models)
Statistical Language Learning by Eugene Charniak
Speech and Language Processing by Dan Jurafsky
and Jim Martin (especially Chapter 6)
Foundations of Statistical Natural Language
Processing by Chris Manning and Hinrich Schütze.
Statistical Methods for Speech Recognition, by
Frederick Jelinek

56
More Resources

Sentence Mixture Models (also, caching)
Rukmini Iyer, EE Ph.D. Thesis, 1998 "Improving
and predicting performance of statistical
language models in sparse domains"
Rukmini Iyer and Mari Ostendorf. Modeling long
distance dependence in language Topic mixtures
versus dynamic cache models. IEEE Transactions
on Acoustics, Speech and Audio Processing,
730--39, January 1999.
Caching Above, plus
R. Kuhn. Speech recognition and the frequency of
recently used words A modified markov model for
natural language. In 12th International
Conference on Computational Linguistics, pages
348--350, Budapest, August 1988.
R. Kuhn and R. D. Mori. A cache-based natural
language model for speech reproduction. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 12(6)570--583, 1990.
R. Kuhn and R. D. Mori. Correction to a
cache-based natural language model for speech
reproduction. IEEE Transactions on Pattern
Analysis and Machine Intelligence,
14(6)691--692, 1992.

57
More Resources Clustering

The seminal reference
P. F. Brown, V. J. DellaPietra, P. V. deSouza, J.
C. Lai, and R. L. Mercer. Class-based n-gram
models of natural language. Computational
Linguistics, 18(4)467--479, December 1992.
Two-sided clustering
H. Yamamoto and Y. Sagisaka. Multi-class
composite n-gram based on connection direction.
In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal
Processing Phoenix, Arizona, May 1999.
Fast clustering
D. R. Cutting, D. R. Karger, J. R. Pedersen, and
J. W. Tukey. Scatter/gather A cluster-based
approach to browsing large document collections.
In SIGIR 92, 1992.
Other
R. Kneser and H. Ney. Improved clustering
techniques for class-based statistical language
modeling. In Eurospeech 93, volume 2, pages
973--976, 1993.

58
More Resources

Structured Language Models
Eugenes web page
Ciprian Chelbas web page
http//www.clsp.jhu.edu/people/chelba/
Maximum Entropy
Roni Rosenfelds home page and thesis
http//www.cs.cmu.edu/roni/
Joshuas web page
Stolcke Pruning
A. Stolcke (1998), Entropy-based pruning of
backoff language models. Proc. DARPA Broadcast
News Transcription and Understanding Workshop,
pp. 270-274, Lansdowne, VA. NOTE get corrected
version from http//www.speech.sri.com/people/stol
cke

59
More Resources Skipping

Skipping
X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang,
K.-F. Lee, and R. Rosenfeld. The SPHINX-II speech
recognition system An overview. Computer,
Speech, and Language, 2137--148, 1993.
Lots of stuff
S. Martin, C. Hamacher, J. Liermann, F. Wessel,
and H. Ney. Assessment of smoothing methods and
complex stochastic language modeling. In 6th
European Conference on Speech Communication and
Technology, volume 5, pages 1939--1942, Budapest,
Hungary, September 1999. H. Ney, U. Essen, and
R. Kneser.
On structuring probabilistic dependences in
stochastic language modeling. Computer, Speech,
and Language, 81--38, 1994.