Title: Relating Reliability in Phonetic Feature Streams to Noise Robustness
1Relating Reliability in Phonetic Feature Streams
to Noise Robustness
- Alex Park
- August 26th, 2003
2Overview
- Motivation for using layered, phonetic feature
stream approach - Building a recognizer based on phonetic features
- MFCC-based GMM feature detectors (baseline)
- Sample feature stream outputs
- Training digit recognizer using concatenated
feature streams as input - Robust alternatives for voicing feature stream
module - Saul sinusoid detector
- Autocorrelation
- GMM classifier using alternative features
- Evaluation of stream reliability using distortion
between clean and noisy speech - Hard question what is ground truth for
continuous measurements? - Relating stream extraction reliability to word
recognition accuracy - Conclusions and Future Work
Introduction
3Motivation
- Failure of recognizers in noise is due to
mismatch between features observed in training
and testing - In order to reduce mismatch, we can evaluate and
optimize the reliability of features presented to
the acoustic models at a middle layer - Current recognizers typically use one set of
front end features to train acoustic models at
the phone level - Typical front end features can only be evaluated
by looking at WER, which is influenced by many
factors. Global optimization can mask serious
inconsistencies in the speech representation
under noise - Phonetic features can change asynchronously,
especially in spontaneous speech - Why phonetic features?
- Are perceivable by humans and relevant to speech
- Several examples of phonetic feature/phone class
detection exist - Bursts (Niyogi 2002), Nasality (Glass 1986),
Voicing (Saul 2003) - Other researchers have recently proposed acoustic
modelling frameworks based on related feature
streams (articulatory, acoustic, distinctive) - Articulatory (Livescu 2003, Metze 2002), acoustic
(Kirchoff 2002) - Why not?
Introduction
4Training MFCC GMM Feature Classifiers
- Sparse set of 6 phonetic features chosen for
simplicity - For a less constrained task, more features should
probably be used - More extensive training data would also improve
quality of each feature detector - For each feature F, train two GMMs, p(xF) and
p(x-F), using frame-level feature MFCC feature
vectors - Trained on 410 TIMIT sentences from 40 speakers
(126k frames) - Use Bayes rule (w/ equal priors) to determine
posterior probs, which are computed every 10 ms
Posterior probability
Transcribed speech (Training Data)
p(x-F)
Feature TIMIT labels
Frication s, sh, z, zh, f, th,
Rounding w, ow, uw,
Nasal n, m, ng,
Liquid/Glide el, l, uw,
Burst g, k, p,
Voice aa, ae, ah,
p(xF)
Training
Testing
Stream Recognizer
5Sample Outputs MFCC-based Streams
- Feature streams on Aurora utt. (six three five
seven one zero four)
Stream Recognizer
6Recognizer Training
- Phonetic feature posterior probability outputs
used as feature vectors to train Aurora HMM
recognizer - Standard train script included for Aurora 2
evaluation (8440 clean training utts) - Eleven whole word models and one silence model
- 18 states each, 3 mixtures, 6 dimensional
diagonal Gaussian emission probs - Probably not an optimal model structure for given
feature set. - Also, used HCompV instead of HInit with time
aligned transcriptions
one
Concat.
Train
two
oh
Feature Extraction Modules
Feature Vector
Clean Training Data
Whole word HMMs
Stream Recognizer
7Preliminary recognition results
- Tested across all 4 noise conditions, 7 SNR
levels on Aurora testa - Accuracy is 88 on clean data (obtained 91
earlier using 9 feature streams, but reduced to 6
for simplicity) - Poor performance compared to Aurora baseline, but
interesting considering sparsity of feature set
used to train HMMs
- Many factors should be addressed to improve
stream-based recognizer - More feature streams
- Deltas and delta-deltas
- Relationship between feature streams
- Discriminative lexical ability for different word
models - Noise compensation in feature extraction
Stream Recognizer
8A closer look Stream corruption under noise
- Effect of noise on output of MFCC-based voicing
feature module
p(Voice)
p(Voice)
p(Voice)
Voicing Module
9In search of a better voicing module
- Several possible alternatives to MFCC based
voicing module - Autocorrelation (AutoCorr)
- Sinusoid Uncertainty (Saul, 2003)
- Alternative GMM classifier (AltGMM)
- trained like MFCC classifier, but using above
features - 6 dimensional, 10 mixture diagonal gaussians each
for p(xF), p(x-F) - Evaluated voicing detection using phonetic
transcription as reference - In clean conditions, MFCC GMM has best detection
performance - Is this the best module to use?
Method Equal Error Rate
GMM 11.14
Sinusoid 18.11
Autocorr 16.78
AltGMM 24.84
Voicing Module
10Evaluating stream robustness
- Several problems with using global frame
detection accuracy to rate module performance - Would like to have some continuous measure of
voicing (degree of voicing) instead of binary
decision - Ground truth is hard to come by voiced phone
labels not necessarily voiced!
- To evaluate reliability, try using distortion
between clean and noisy voicing probability for
the same utterance. - For each frame, measure difference between clean,
fc(t), and noisy, fn(t), estimate. - If fc(t)-fn(t) gt 0.2 ? label f(t) as gross
error - If fc(t)-fn(t) lt 0.2 ? use fc(t)-fn(t) as a
measure of the distortion caused by noise. - N.B. Consistency doesnt guarantee accuracy, we
still need to check
Voicing Module
11Distortion Comparison
- Compared the frame distortion for voicing modules
at each noise level - Percentage of frames labelled as gross errors
(distortion gt 0.2) - Average distortion for remaining frames
(distortion lt 0.2) - Despite higher performance in clean data, MFCC
module is most erratic - For consistency, Alt. GMM module outperforms MFCC
module in noise
Voicing Module
12A better voicing module?
- Output of AltGMM module trained on AutoCorr and
SinUn features
p(Voice)
p(Voice)
p(Voice)
Voicing Module
13Recognition Performance Comparison
- Trained 3 additional recognizers, one for each
alternative voicing module - Performed recognition experiments to compare
voicing modules - No significant difference in accuracy at any
noise level -_- - Need to perform additional experiments to
understand effect of voicing modules on
recognition
voicing
voicing
Feature Extraction Modules
Test Utterance
Feature Vector
Recognition Experiments
14Oracle Experiment
- What happens if we assume the voicing module is
perfectly reliable? - i.e., same output under any noise condition
- Accuracy not improved from normal scenario
- Having robust voicing feature only is not enough
to improve recognition - Corruption of other feature streams likely
skewing overall acoustic model scores - How can we isolate the contribution of this
feature stream?
Clean
Noisy
Feature Extraction Modules
Test Utterance
Feature Vector
Recognition Experiments
15Inverse Oracle Experiment
- Assume other feature streams are computed
consistently - Allow voicing module to contribute actual output
- Significant difference in performance between 4
voicing modules - Even with 5 of 6 clean features, MFCC voicing
module degrades quickly in noise - Recognition performance of each methods is
correlated with distortion results
Noisy
Clean
Feature Extraction Modules
Feature Vector
Test Utterance
Recognition Experiments
16Conclusions and Future Work
- Small set of phonetic features can obtain
somewhat high (88) recognition accuracy for
constrained digit task even when integrated in
non-optimal manner (HMM) - Reliable extraction of feature streams is
essential for robust recognition - Combining statistical training with
feature-specific measurements can improve
reliability for feature stream extraction - Even if other 5 streams computed perfectly,
messing up voicing can drastically degrade
recognition accuracy - Integrate feature streams with a more appropriate
acoustic modelling layer (i.e. feature based
graphical models or DBN) - Optimize individual feature stream modules with
relevant measurements - Nasality broad F1 bandwidth, low spectral slope
in F1F2 region, stable low frequency energy - Rounding Low F1, F2.
- Retroflex Low F3, rising formants.
- Combine feature streams with SNR based measure of
reliability - Lots to be done!
Conclusions
17References
- J.R. Glass and V.W. Zue (1986). Detection and
Recognition of Nasal Consonants in American
English," In Proc. ICASSP 86, Tokyo, Japan. - P. Niyogi and M.M. Sondhi (2002). Detecting Stop
Consonants in Continuous Speech, J. Acoust. Soc.
Am. vol 111, pp 1063. - L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun
(2003). Real time voice processing with
audiovisual feedback Toward autonomous agents
with perfect pitch in S. Becker, S. Thrun, and
K. Obermayer (eds.), Advances in Neural
Information Processing Systems 15. MIT Press
Cambridge, MA. - K. Kirchoff, G.A. Fink, G. Sagerer (2002).
Combining acoustic and articulatory feature
information for robust speech recognition,
Speech Communications, May 2002. - K. Livescu, J. R. Glass, J. Bilmes (2003).
Hidden Feature Models for Speech Recognition
Using Dynamic Bayesian Networks, to be
presented at Eurospeech 03, Geneva, Switzerland. - F. Metze, A. Waibel (2002). A Flexible Stream
Architecture for ASR Using Articulatory
Features, In Proc. ICSLP 02, Denver, Colorado.
Conclusions
18Band-limited Sinusoid Fitting (Saul 2003)
- Filter bandwidths allow at least one filter to
resolve single harmonics - Frames of filtered signals fit with sinusoid of
frequency w and error u - At each step, lowest u gives voicing
probability, w gives pitch estimate - Algorithm is fast and gives accurate pitch tracks
Extra Slide
Voicing Module
19Supp. recognition results I (Actual streams)
Extra Slide
Stream Recognizer
20Supp. recognition results II (Oracle voice)
Extra Slide
Stream Recognizer
21Supp. recognition results III (Inv. Oracle voice)
Extra Slide
Stream Recognizer
22Supp. distortion results I (Gross error rate)
Extra Slide
Stream Recognizer
23Supp. distortion results II (Avg Frame Distortion)
Extra Slide
Stream Recognizer