Title: Digit%20Recognition%20Using%20the%20SPEECHDAT%20Corpus
1Robust Recognition of Digits and Natural Numbers
Frederico Rodrigues and Isabel Trancoso
INESC/IST, 2000
2Summary
- Problem overview
- Baseline system
- Extensions to the baseline system
- Conclusions and future work
3The Problem
4Corpus Description
- Multilingual telephone speech corpus
- SPEECHDAT(M) 1000 speakers
- SPEECHDAT(II) 4000 speakers
- Orthographically transcribed including noise
events
5Noise events
- spk Speaker related noises
- sta Stationary noises
- int Intermittent noises
6(No Transcript)
7Train and Test Set Definition
- Selection procedure
- Age, gender and region distribution are
approximately equal in both train and test sets - SPEECHDAT II
- Fixed 500 speakers evaluation set
- Additional 300 speakers development set
- SPEECHDAT(M)
- 200 speakers evaluation set
- Overall ratio of 80 Train/20 Test
8Sub-corpus Used
- I1 - Isolated digit strings
- B1 - Sequences of 10 digits
- N - Natural numbers
9Feature Extraction
- MFCC (Mel Frequency Cepstral Coefficients)
- 14 Cepstra 14 ? Cepstra Energy ? Energy
- Speech signal band-limited between 200 and 3800
Hz - Hamming Window 25 ms each 10 ms
- Cepstral Mean Substraction
- Simple but effective technique for channel and
speaker normalization
10Acoustic Modeling
- Left-right continuous density HMMs
- Word models for each digit. No skips.
- Silence and filler models with forward and
backward skips - Gender dependent models
HMM Hidden Markov Model
11Model Topology
Fillers and silence models topology
12Baseline System - Isolated Digits
- Choose isolated digits with no noise marks
- HMM parameters initialized with the global mean
and variance of the training data - Embedded Baum-Welch Reestimation
- Evaluate performance withViterbi decoding
- Grammar allowing one digit and initial and final
silence - Grammar allowing one digit and any number of
fillers or silence
13Baseline System - Isolated Digits
14Baseline System - Isolated Digits
- Increment Gaussian mixtures per state up to 3 for
the digit models - Introduce files with noise marks
- Repeat re-estimation/evaluation process
- Increment Gaussian mixtures per state up to 3 for
the filler and digit models
15Connected vs Isolated Digits
Example Number 3 1 2 6 said as Isolated
Digits t r e S u d o j S s 6 j S Connected
Digits t r e z u d o j S _ 6 j S
16Baseline System - Connected Digits
- Use best isolated digit models as bootstrap
models - Repeat re-estimation/evaluation process
- Increment gradually Gaussian mixtures per state
up to 5 for the digit models
17Baseline System - Results
18Extension to the Baseline System
- New way of modelling the filler models
- Same training/evaluation process
- Train the 9 filler and silence models with no
skips - Build a unique filler model concatenating all
filler and silence models
19New Filler Model Arquitecture
20Results With New Filler Model
21Natural Numbers
- Phone models with 3 states and no skips
- Larger vocabulary size
- May be adapted to other tasks
- Phones initialized from models already trained
for a directory assistance task - Digits are still modeled by word models
- Grammar for natural numbers ranging from zero to
hundreds of millions
22Natural Numbers Example
Number 25 Hypothesis 1 vinte e cinco (Twenty
and five) Hypotesis 2 vinte cinco (Twenty
five) But vinte cinco could also be the
sequence of natural numbers 20 5
23Natural Numbers - Results
24Sample Application
25Conclusions and Future Work
- Explicitly modeling fillers is a difficult task
- Improved filler model decreases error rate up to
50 - Develop context dependent models
- Solve vowel reduction and co-articulation
problems - Results may be improved through the use of
discriminative training techniques