Title: Language Models for Handwriting
1Language Models for Handwriting
- Joshua Goodman
- Microsoft Research
- Machine Learning and Applied Statistics Group
- http//www.research.microsoft.com/joshuago
2What does this say?
3What does this say?
4What does this say?
5What does this say?
6Without context, very hard to read, if its even
possible
7Context is Key
- Very hard to read any individual part of this
without context.
- But the overall string has only one reasonably
likely interpretation. - Language model tells you which strings are
likely, which ones are not.
8Overview
- What are language models
- (2 Slides of Boring Preliminaries)
- Who uses language models
- Every natural language input technique
- Even a few handwriting systems
- Reduce errors by about 40 (relative)
- Why language modeling is hard for handwriting
- And what to do about it
- How to build language models
- Techniques, tools, etc.
- The future
- Handwriting researchers using language models for
almost all applications - And contributing back to the speech and language
modeling communities
9What is a language model?
- Gives a probability distribution over text
strings (characters/words/phrases/documents)
- May be easier to model
- P(ink wordsequence) than
- P(wordsequence ink)
- Can train language model on much more/different
data than the ink model
10Language Model EvaluationEntropy and Perplexity
- Language models
- quality typically
- measured with
- entropy or perplexity
- Entropy is just number of bits required to encode
the test data (should be called cross entropy of
model on test data) - Perplexity is 2entropy
- Lower is better for both
11Who uses language models
- Almost every natural language input technique
uses language models - Speech recognition
- Typing in Japanese and Chinese
- Machine Translation
- A whole bunch of others
- Handwriting recognition
- Simple dictionaries with some kind of frequency
information everyone who deals with words - Bigram and trigram models About 12 papers, All
with good results
12Error-rate correlates with entropy (for speech
recognition)
13Pinyin Conversion
- How to enter Chinese text
- Type phonetically (pinyin)
- Many characters have same sound
- Find
1 if correct pinyin, 0 otherwise
Language Model
14Machine Translation
- Let f be a French sentence we wish to translate
into English. - Let e be a possible English translation.
Translation Model
Language Model
Thanks to Eugene Charniak for this slide
15Machine Translation Error rate versus Entropy
(From Franz Och)
These are old results newest results from Franz
Och are trained on 200 BILLION words of data, get
even better results.
16Other Language Model Uses
- Information retrieval
- P(query document) ??
- P(document)
- Telephone Keypad input
- P(numberswords) ??
- P(words)
- Soft Keyboard input
- P(pendown-positionswords) ??
- P(words)
- Spelling Correction
- P(observed keyswords) ??
- P(words)
-
-
Language Model
17Language Models in Handwriting Systems
- Kinds of language models (unigram, bigram,
trigram) - Results from handwriting papers
- Character-based vs. word based
- Some of the problems of using LMs for handwriting
18What kind of language model to use
- P(a b c q r s)
- P(a) P(b a) P (c a b)
- P(s a b c q r)
- How can we compute P(s a b c q r)?
- Too hard so approximate it by
- P(s q r) Trigram
- P(s r) Bigram
- P(s) Unigram
- P(s) ? 1/voc Uniform
19Language modeling for handwriting recognition
- Highlights from 12 papers that I can find
- Hybrid neuro-Markovian (Marukatat, et al. 01)
- Error rate drops from 30 to 18 by using a
bigram. - Quiniou et al., 05 18 to 10 (bigram and
trigram same) - Perraud et al., 03
- From 34 (uniform) to 29 (unigram) to 22.5
error rates - Vinciarelli et al., 04
- Tried different test sets, with semi-realistic
mismatch between training and test. - Results highly dependent on match between
training/test - Unigram always much better than uniform (no
probabilities) (typically 50 more accurate) - Typically marginal gains from trigram
- Sometimes large gains from bigram (up to 20
relative) when match was good
20Pretty good Bibliography of LMs for handwriting
- Using a Statistical Language Model to Improve the
Performance of an HMM-Based Cursive Handwriting
System, Marti et al. IJPRAI 2001 - On the influence of vocabulary size and language
models in unconstrained handwritten text
recognition, Marti et al., ICDAR 2001 - N-gram and N-Class Models for On line Handwriting
Recognition, Perraud et al., ICDAR 2003 - Offline Recognition of Unconstrained Handwritten
Texts Using HMMs and Statistical Language Models,
Vinciarelli et al, IEEE PAMI 2004 - N-Gram Language Models for Offline Handwritten
Text Recognition, Zimmerman et al., IWFHR 2004 - Stability Measure of Entropy Estimate and Its
Application to Language Model Evaluation, Kim,
J., and Ryu, S., and Kim, J.H., IWFHR 2004 - An Empirical Study of Statistical Language Models
for Contextual Post-processing of Chinese Script
Recognition, Li,Y-X. and Tan, C.L., IWFHR 2004 - Statistical Language Models for On-line
Handwritten Sentence Recognition, Quiniou et al.,
ICDAR 2005 - A Data Structure Using Hashing and Tries for
Efficient Chinese Lexical Access,Y.-K. Lam and Q.
Huo ICDAR 2005 - Document Understanding System Using Stochastic
Context-Free Grammars, J. Handley. A. Namboodiri,
and R. Zanibbi, ICDAR 2005 - Multiple Handwritten Text Line Recognition
Systems Derived from Specific Integration of a
Language Model, R. Bertolami and H. Bunke, ICDAR
2005 - A Priori and a posteriori integration and
combination of language models in an on-line
handwritten sentence recognition system, Solen
Quiniou and Eric Anquetil, IWFHR 2006
21Character-based or Word-based
- Model letter sequences instead of word sequences
- Example P( letter-3 letter-2 letter-1)
- Typically, use a higher-order n-gram (e.g. 6)
- This is good if you want to model
out-of-vocabulary words, proper names,
misspellings, digit sequences, etc.
22Combining character-based and a word-based
- Can model both in-vocabulary and
out-of-vocabulary words with language models, at
the same time. - Explicitly model
- P(out-of-vocabulary previous-word)
- P(out-of-vocabulary Mr.)
- Model out-of-vocabulary using character model.
23Digit Sequences
- Might think that a language model cant help with
digit sequences (e.g. street numbers, dollar
amounts) - Numbers are much more likely to start with 1
(30) - Dollar amounts much more likely to end in .00,
.99, .95
The entropy of first digits is about 10 lower
than uniform entropy 10 lower error rate on
first digits?
24Why its hard to use language models for
handwriting
- Model incompatibilities
- Lack of training data
25Model IncompatibilitiesToy example
- Toy example of handwriting system segment into
words, then segment words into letters, then
recognize letters - Where do we put the language modeling
probabilities? How do we integrate them into the
segmentation and recognition process?
26A slightly more realistic exampleProduce
lattice of scores results
P e w s y w a w
n n l v n i a
- For each segment, train recognizer (NN, SVM,
etc.) to recognize the letter in the segment.
Roughly, you learn P(letter ink) - What happens when you multiply by language model?
- P(letter ink) ? P(letter previous letters)
- This is not a meaningful probability!
- Will it work in practice? Maybe, maybe not.
27Some handwriting systems work very well with LMs
- HMM-based approaches integrate very well with
language model. - Hybrid models (NeuralNet with HMM, e.g. Marukatat
01) also work well. - Use neural net to predict P(ink state, previous
ink) - Discriminatively trained neural nets can work
very well, if trained at the sentence level, with
LM integrated at training time. - Gradient-Based Learning Applied to Document
Recognition (LeCun et al., 98) - How to integrate SVMs with LM? Not clear.
- Often SVM score is ?? P(letterink)
28Lots of training data, or none
- Usually, can train on millions, hundreds of
millions, or billions of words of data - Easy to find text that look like addresses.
- Hard to find text that looks like meeting notes
- Potential gain from language models likely to be
very application specific
29Training Data is key
30Quick Overview of LM Research
- Trigram models are obviously brain-damaged lots
of improvements - Smoothing
- Caching
- Clustering
- Neural methods
- Other stuff
31Smoothing None
- Lowest perplexity trigram on training data.
- Terrible on test data If no occurrences of
C(xyz), probability is 0. - Makes it impossible to recognize the word
32Smoothing Key Ideas
- Want some way to combine trigram probabilities,
bigram probabilities, unigram probabilities - Use trigram/bigram for accurate modeling of
context - Use unigram to get best guess you can when data
is sparse. - Lots of different techniques
- Simple interpolation
- Absolute Discounting
- Katz Smoothing (Good-Turing)
- Interpolated Kneser-Ney Smoothing
33Smoothing Interpolated Kneser-Ney
- Simple to implement smoothing technique.
- Consistently the best performing across a very
wide range of conditions. - See appendix of A bit of progress in language
modeling for pseudo-code
34Caching
- If you say something, you are likely to say it
again later. - Interpolate trigram with cache
35Caching Real Life
- Someone writes The white house
- System recognizes The white hor se
- Cache remembers!
- Person writes The whole house, and, with cache,
system recognizes The whole hor se. errors
are locked in. - Caching works well when users correct as they go,
poorly or even hurts without correction.
36Cache Results
37Neural Probabilistic Language ModelsBengio et
al. 2000
- Multi-layer neural network
- Similar to a convolutional-neural net applied to
language modeling
- Largest improvement reported for any single
technique - Relatively slow and complex to build and apply,
but see ongoing research - (Combination techniques, e.g. Bit of Progress
paper, have slightly better overall results.)
38Clustering
- CLUSTERING CLASSES (same thing)
- What is P(Tuesday party on)
- Similar to P(Monday party on)
- Similar to P(Tuesday celebration on)
- Put words in clusters
- WEEKDAY Sunday, Monday, Tuesday,
- EVENTparty, celebration, birthday,
- P(Tuesday WEEKDAY) ?
- P(WEEKDAY EVENT PREPOSITION)
39Cluster Results
40Language Model Compression
- Use handwriting on devices that are too small for
a keyboard - Will also have memory constraints
- Same reasoning/problems for speech recognition
- Use count-cutoffs discard all n-grams that
dont occur at least k times - Use Stolcke (1998) entropy-based pruning
- Works better than cutoffs
- Use clustering techniques (Goodman and Gao, 2000)
- Smallest models
- Harder to implement
- Interacts poorly with tree representation in an
HMM
41Lots and lots of other language model research
- Endless ways to improve LMs
- Sentence-mixture models
- Skipping models
- Parsing-based models
- Decision-tree models
- Maximum entropy (single layer NN) models
42How to Build Language Models
- Language modeling is depressingly easy
- The best practical technique is to simply use a
bigram or trigram model - Works well with tree-structured HMMs
- Almost all other techniques dont
- If you have correction information, also use a
cache - If you have adaptation data (user-specific),
interpolate it in (weighted average of
probabilities) - Use count cutoffs or Stolcke pruning
43Speech recognizer with language model
- In theory,
- In practice, language model is a better predictor
-- acoustic probabilities arent real
probabilities - In practice, penalize insertions
44Tools
- SRI Language Modeling Toolkit
- http//www.speech.sri.com/projects/srilm/
- Free for non-profit use
- Can handles clusters, lattices, n-best lists,
hidden tags - CMU Language Modeling Toolkit
- Can handle bigram, trigrams, more
- Can handle different smoothing schemes
- Many separate tools output of one tool is input
to next easy to use - Free for research purposes
- http//svr-www.eng.cam.ac.uk/prc14/toolkit.html
45Synergies between Handwriting and Speech/NLP
- Language Modeling is only one of the similarities
between handwriting and speech - The two communities have lots of useful ideas
that they could learn from each other - Things for handwriting people to teach speech
people - Other useful ideas in speech recognition
46From Handwriting to Speech and NLP
- Neural Nets for Language Modeling
- Neural Probabilistic Language Model
- Work by Yoshua Bengio et al. (handwriting
researcher) - Like a convolutional neural net applied to
language modeling - Some of the best and most exciting recent results
in language modeling - Gradient-based learning with multi-layer Neural
Nets (LeCun et al., 98) - Similar to CRFs, which are increasingly popular
for NLP, but more sophisticated (CRFs are
basically the single layer version.) - How to use SVMs and NNs to recognize sequence
information - Probably much more this is just stuff I noticed
as I prepared
47From Speech to Handwriting Finite State
Transducers
- More powerful than HMMs
- Can encode n-gram models efficiently (like HMMs)
- Can encode simple grammars like spelled-out
numbers One thousand two hundred and eighteen
or addresses. - Can encode error models, e.g. spelling errors.
- Some toolkits can convert transducer to a large
HMM, then optimize/compress it. - Can encode context-dependencies efficiently
(e.g., first state of a is different when
preceded by space than when preceded by d
than when preceded by o.) - Lots of toolkits available
- ATT Finite State Machine Library
- MIT Finite-State Transducer Toolkit
- SFST, the Stuttgart Finite State Transducer Tools
48From Speech to Handwriting
- ROVER
- Technique for combining the final output of
multiple recognizers, to get overall better
results - HMM expertise
- Endless tricks used in speech world to improve
search sophisticated thresholding lattice
processing multiple-pass techniques tree
structuring etc. - Decision-Tree Senone Models
- Ways to model context dependencies of phonemes
(letters) without getting too much data sparsity.
49More Resources
- Joshuas web page www.research.microsoft.com/jos
huago - These slides
- Slides for 3 hour language modeling tutorial
- Smoothing paper with Stan Chen good introduction
to smoothing and lots of details too. - A Bit of Progress in Language Modeling
comprehensive overview and experimental
comparison of LM techniques best overall LM
results. - Papers on fuzzy keyboard, language model
compression, more. - Appendix to this talk
50Conclusion
- Handwriting has special challenges
- Difficulty of getting training data for some
applications - Difficulty integrating some model types (e.g.
SVMs) - Language Modeling has been very helpful for every
other kind of natural language input technique - Can reduce error rate by 1/3 or ½
- Results for handwriting with similar improvements
- Potential of language modeling is not just for
note-taking - Addresses
- Check-writing
- Any application where the distribution is not
uniform - May not be easy, but it will be worth it.
51- If you can read this slide, youre too close
52Pretty good Bibliography of LMs for handwriting
- Using a Statistical Language Model to Improve the
Performance of an HMM-Based Cursive Handwriting
System, Marti et al. IJPRAI 2001 - On the influence of vocabulary size and language
models in unconstrained handwritten text
recognition, Marti et al., ICDAR 2001 - N-gram and N-Class Models for On line Handwriting
Recognition, Perraud et al., ICDAR 2003 - Offline Recognition of Unconstrained Handwritten
Texts Using HMMs and Statistical Language Models,
Vinciarelli et al, IEEE PAMI 2004 - N-Gram Language Models for Offline Handwritten
Text Recognition, Zimmerman et al., IWFHR 2004 - Stability Measure of Entropy Estimate and Its
Application to Language Model Evaluation, Kim,
J., and Ryu, S., and Kim, J.H., IWFHR 2004 - An Empirical Study of Statistical Language Models
for Contextual Post-processing of Chinese Script
Recognition, Li,Y-X. and Tan, C.L., IWFHR 2004 - Statistical Language Models for On-line
Handwritten Sentence Recognition, Quiniou et al.,
ICDAR 2005 - A Data Structure Using Hashing and Tries for
Efficient Chinese Lexical Access,Y.-K. Lam and Q.
Huo ICDAR 2005 - Document Understanding System Using Stochastic
Context-Free Grammars, J. Handley. A. Namboodiri,
and R. Zanibbi, ICDAR 2005 - Multiple Handwritten Text Line Recognition
Systems Derived from Specific Integration of a
Language Model, R. Bertolami and H. Bunke, ICDAR
2005 - A Priori and a posteriori integration and
combination of language models in an on-line
handwritten sentence recognition system, Solen
Quiniou and Eric Anquetil, IWFHR 2006
53More Resources
- Joshuas web page www.research.microsoft.com/jos
huago - Smoothing technical report good introduction to
smoothing and lots of details too. - A Bit of Progress in Language Modeling, which
is the journal version of much of this talk. - Papers on fuzzy keyboard, language model
compression, and maximum entropy.
54More Resources
- Eugene Charniaks web page http//www.cs.brown.ed
u/people/ec/home.html - Papers on statistical parsing for its own sake
and for language modeling, as well as using
language modeling to measure contextual
influence. - Pointers to software for statistical parsing as
well as statistical parsers optimized for
language-modeling
55More ResourcesBooks
- Books (all are OK, none focus on language models)
- Statistical Language Learning by Eugene Charniak
- Speech and Language Processing by Dan Jurafsky
and Jim Martin (especially Chapter 6) - Foundations of Statistical Natural Language
Processing by Chris Manning and Hinrich SchĂĽtze. - Statistical Methods for Speech Recognition, by
Frederick Jelinek
56More Resources
- Sentence Mixture Models (also, caching)
- Rukmini Iyer, EE Ph.D. Thesis, 1998 "Improving
and predicting performance of statistical
language models in sparse domains" - Rukmini Iyer and Mari Ostendorf. Modeling long
distance dependence in language Topic mixtures
versus dynamic cache models. IEEE Transactions
on Acoustics, Speech and Audio Processing,
730--39, January 1999. - Caching Above, plus
- R. Kuhn. Speech recognition and the frequency of
recently used words A modified markov model for
natural language. In 12th International
Conference on Computational Linguistics, pages
348--350, Budapest, August 1988. - R. Kuhn and R. D. Mori. A cache-based natural
language model for speech reproduction. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 12(6)570--583, 1990. - R. Kuhn and R. D. Mori. Correction to a
cache-based natural language model for speech
reproduction. IEEE Transactions on Pattern
Analysis and Machine Intelligence,
14(6)691--692, 1992.
57More Resources Clustering
- The seminal reference
- P. F. Brown, V. J. DellaPietra, P. V. deSouza, J.
C. Lai, and R. L. Mercer. Class-based n-gram
models of natural language. Computational
Linguistics, 18(4)467--479, December 1992. - Two-sided clustering
- H. Yamamoto and Y. Sagisaka. Multi-class
composite n-gram based on connection direction.
In Proceedings of the IEEE International
Conference on Acoustics, Speech and Signal
Processing Phoenix, Arizona, May 1999. - Fast clustering
- D. R. Cutting, D. R. Karger, J. R. Pedersen, and
J. W. Tukey. Scatter/gather A cluster-based
approach to browsing large document collections.
In SIGIR 92, 1992. - Other
- R. Kneser and H. Ney. Improved clustering
techniques for class-based statistical language
modeling. In Eurospeech 93, volume 2, pages
973--976, 1993.
58More Resources
- Structured Language Models
- Eugenes web page
- Ciprian Chelbas web page
- http//www.clsp.jhu.edu/people/chelba/
- Maximum Entropy
- Roni Rosenfelds home page and thesis
- http//www.cs.cmu.edu/roni/
- Joshuas web page
- Stolcke Pruning
- A. Stolcke (1998), Entropy-based pruning of
backoff language models. Proc. DARPA Broadcast
News Transcription and Understanding Workshop,
pp. 270-274, Lansdowne, VA. NOTE get corrected
version from http//www.speech.sri.com/people/stol
cke
59More Resources Skipping
- Skipping
- X. Huang, F. Alleva, H.-W. Hon, M.-Y. Hwang,
K.-F. Lee, and R. Rosenfeld. The SPHINX-II speech
recognition system An overview. Computer,
Speech, and Language, 2137--148, 1993. - Lots of stuff
- S. Martin, C. Hamacher, J. Liermann, F. Wessel,
and H. Ney. Assessment of smoothing methods and
complex stochastic language modeling. In 6th
European Conference on Speech Communication and
Technology, volume 5, pages 1939--1942, Budapest,
Hungary, September 1999. H. Ney, U. Essen, and
R. Kneser. - On structuring probabilistic dependences in
stochastic language modeling. Computer, Speech,
and Language, 81--38, 1994.
60- If you can read this slide, youre too close
61Fuzzy Keyboard
- Very small users can type on key boundary, or
hit the wrong key easily
- A soft keyboard is an image of a keyboard e.g.
Palm Pilot or PocketPC.
62Fuzzy Keyboard Language model and Pen Positions
- Math Language Model times Pen Postion
- For pen down positions, collect data, and
compute simple Gaussian distribution.
63Fuzzy Keyboard Results
- 40 Fewer errors, same speed.