Relating Reliability in Phonetic Feature Streams to Noise Robustness - PowerPoint PPT Presentation

About This Presentation

Title:

Relating Reliability in Phonetic Feature Streams to Noise Robustness

Description:

Bursts (Niyogi 2002), Nasality (Glass 1986), Voicing (Saul 2003) ... Burst. el, l, uw, ... Liquid/Glide. n, m, ng, ... Nasal. w, ow, uw, ... Rounding ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 24

Provided by: michell189

Learn more at: https://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Relating Reliability in Phonetic Feature Streams to Noise Robustness

1
Relating Reliability in Phonetic Feature Streams
to Noise Robustness

Alex Park
August 26th, 2003

2
Overview

Motivation for using layered, phonetic feature
stream approach
Building a recognizer based on phonetic features
MFCC-based GMM feature detectors (baseline)
Sample feature stream outputs
Training digit recognizer using concatenated
feature streams as input
Robust alternatives for voicing feature stream
module
Saul sinusoid detector
Autocorrelation
GMM classifier using alternative features
Evaluation of stream reliability using distortion
between clean and noisy speech
Hard question what is ground truth for
continuous measurements?
Relating stream extraction reliability to word
recognition accuracy
Conclusions and Future Work

Introduction
3
Motivation

Failure of recognizers in noise is due to
mismatch between features observed in training
and testing
In order to reduce mismatch, we can evaluate and
optimize the reliability of features presented to
the acoustic models at a middle layer
Current recognizers typically use one set of
front end features to train acoustic models at
the phone level
Typical front end features can only be evaluated
by looking at WER, which is influenced by many
factors. Global optimization can mask serious
inconsistencies in the speech representation
under noise
Phonetic features can change asynchronously,
especially in spontaneous speech
Why phonetic features?
Are perceivable by humans and relevant to speech
Several examples of phonetic feature/phone class
detection exist
Bursts (Niyogi 2002), Nasality (Glass 1986),
Voicing (Saul 2003)
Other researchers have recently proposed acoustic
modelling frameworks based on related feature
streams (articulatory, acoustic, distinctive)
Articulatory (Livescu 2003, Metze 2002), acoustic
(Kirchoff 2002)
Why not?

Introduction
4
Training MFCC GMM Feature Classifiers

Sparse set of 6 phonetic features chosen for
simplicity
For a less constrained task, more features should
probably be used
More extensive training data would also improve
quality of each feature detector
For each feature F, train two GMMs, p(xF) and
p(x-F), using frame-level feature MFCC feature
vectors
Trained on 410 TIMIT sentences from 40 speakers
(126k frames)
Use Bayes rule (w/ equal priors) to determine
posterior probs, which are computed every 10 ms

Posterior probability
Transcribed speech (Training Data)
p(x-F)
Feature TIMIT labels
Frication s, sh, z, zh, f, th,
Rounding w, ow, uw,
Nasal n, m, ng,
Liquid/Glide el, l, uw,
Burst g, k, p,
Voice aa, ae, ah,
p(xF)
Training
Testing
Stream Recognizer
5
Sample Outputs MFCC-based Streams

Feature streams on Aurora utt. (six three five
seven one zero four)

Stream Recognizer
6
Recognizer Training

Phonetic feature posterior probability outputs
used as feature vectors to train Aurora HMM
recognizer
Standard train script included for Aurora 2
evaluation (8440 clean training utts)
Eleven whole word models and one silence model
18 states each, 3 mixtures, 6 dimensional
diagonal Gaussian emission probs
Probably not an optimal model structure for given
feature set.
Also, used HCompV instead of HInit with time
aligned transcriptions

one

Concat.
Train

two

oh
Feature Extraction Modules
Feature Vector
Clean Training Data
Whole word HMMs
Stream Recognizer
7
Preliminary recognition results

Tested across all 4 noise conditions, 7 SNR
levels on Aurora testa
Accuracy is 88 on clean data (obtained 91
earlier using 9 feature streams, but reduced to 6
for simplicity)
Poor performance compared to Aurora baseline, but
interesting considering sparsity of feature set
used to train HMMs

Many factors should be addressed to improve
stream-based recognizer
More feature streams
Deltas and delta-deltas
Relationship between feature streams
Discriminative lexical ability for different word
models
Noise compensation in feature extraction

Stream Recognizer
8
A closer look Stream corruption under noise

Effect of noise on output of MFCC-based voicing
feature module

p(Voice)
p(Voice)
p(Voice)
Voicing Module
9
In search of a better voicing module

Several possible alternatives to MFCC based
voicing module
Autocorrelation (AutoCorr)
Sinusoid Uncertainty (Saul, 2003)
Alternative GMM classifier (AltGMM)
trained like MFCC classifier, but using above
features
6 dimensional, 10 mixture diagonal gaussians each
for p(xF), p(x-F)
Evaluated voicing detection using phonetic
transcription as reference
In clean conditions, MFCC GMM has best detection
performance
Is this the best module to use?

Method Equal Error Rate
GMM 11.14
Sinusoid 18.11
Autocorr 16.78
AltGMM 24.84
Voicing Module
10
Evaluating stream robustness

Several problems with using global frame
detection accuracy to rate module performance
Would like to have some continuous measure of
voicing (degree of voicing) instead of binary
decision
Ground truth is hard to come by voiced phone
labels not necessarily voiced!

To evaluate reliability, try using distortion
between clean and noisy voicing probability for
the same utterance.
For each frame, measure difference between clean,
fc(t), and noisy, fn(t), estimate.
If fc(t)-fn(t) gt 0.2 ? label f(t) as gross
error
If fc(t)-fn(t) lt 0.2 ? use fc(t)-fn(t) as a
measure of the distortion caused by noise.
N.B. Consistency doesnt guarantee accuracy, we
still need to check

Voicing Module
11
Distortion Comparison

Compared the frame distortion for voicing modules
at each noise level
Percentage of frames labelled as gross errors
(distortion gt 0.2)
Average distortion for remaining frames
(distortion lt 0.2)
Despite higher performance in clean data, MFCC
module is most erratic
For consistency, Alt. GMM module outperforms MFCC
module in noise

Voicing Module
12
A better voicing module?

Output of AltGMM module trained on AutoCorr and
SinUn features

p(Voice)
p(Voice)
p(Voice)
Voicing Module
13
Recognition Performance Comparison

Trained 3 additional recognizers, one for each
alternative voicing module
Performed recognition experiments to compare
voicing modules
No significant difference in accuracy at any
noise level -_-
Need to perform additional experiments to
understand effect of voicing modules on
recognition

voicing
voicing
Feature Extraction Modules
Test Utterance
Feature Vector
Recognition Experiments
14
Oracle Experiment

What happens if we assume the voicing module is
perfectly reliable?
i.e., same output under any noise condition
Accuracy not improved from normal scenario
Having robust voicing feature only is not enough
to improve recognition
Corruption of other feature streams likely
skewing overall acoustic model scores
How can we isolate the contribution of this
feature stream?

Clean
Noisy
Feature Extraction Modules
Test Utterance
Feature Vector
Recognition Experiments
15
Inverse Oracle Experiment

Assume other feature streams are computed
consistently
Allow voicing module to contribute actual output
Significant difference in performance between 4
voicing modules
Even with 5 of 6 clean features, MFCC voicing
module degrades quickly in noise
Recognition performance of each methods is
correlated with distortion results

Noisy
Clean
Feature Extraction Modules
Feature Vector
Test Utterance
Recognition Experiments
16
Conclusions and Future Work

Small set of phonetic features can obtain
somewhat high (88) recognition accuracy for
constrained digit task even when integrated in
non-optimal manner (HMM)
Reliable extraction of feature streams is
essential for robust recognition
Combining statistical training with
feature-specific measurements can improve
reliability for feature stream extraction
Even if other 5 streams computed perfectly,
messing up voicing can drastically degrade
recognition accuracy
Integrate feature streams with a more appropriate
acoustic modelling layer (i.e. feature based
graphical models or DBN)
Optimize individual feature stream modules with
relevant measurements
Nasality broad F1 bandwidth, low spectral slope
in F1F2 region, stable low frequency energy
Rounding Low F1, F2.
Retroflex Low F3, rising formants.
Combine feature streams with SNR based measure of
reliability
Lots to be done!

Conclusions
17
References

J.R. Glass and V.W. Zue (1986). Detection and
Recognition of Nasal Consonants in American
English," In Proc. ICASSP 86, Tokyo, Japan.
P. Niyogi and M.M. Sondhi (2002). Detecting Stop
Consonants in Continuous Speech, J. Acoust. Soc.
Am. vol 111, pp 1063.
L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun
(2003). Real time voice processing with
audiovisual feedback Toward autonomous agents
with perfect pitch in S. Becker, S. Thrun, and
K. Obermayer (eds.), Advances in Neural
Information Processing Systems 15. MIT Press
Cambridge, MA.
K. Kirchoff, G.A. Fink, G. Sagerer (2002).
Combining acoustic and articulatory feature
information for robust speech recognition,
Speech Communications, May 2002.
K. Livescu, J. R. Glass, J. Bilmes (2003).
Hidden Feature Models for Speech Recognition
Using Dynamic Bayesian Networks, to be
presented at Eurospeech 03, Geneva, Switzerland.
F. Metze, A. Waibel (2002). A Flexible Stream
Architecture for ASR Using Articulatory
Features, In Proc. ICSLP 02, Denver, Colorado.

Conclusions
18
Band-limited Sinusoid Fitting (Saul 2003)

Filter bandwidths allow at least one filter to
resolve single harmonics
Frames of filtered signals fit with sinusoid of
frequency w and error u
At each step, lowest u gives voicing
probability, w gives pitch estimate
Algorithm is fast and gives accurate pitch tracks

Extra Slide
Voicing Module
19
Supp. recognition results I (Actual streams)
Extra Slide
Stream Recognizer
20
Supp. recognition results II (Oracle voice)
Extra Slide
Stream Recognizer
21
Supp. recognition results III (Inv. Oracle voice)
Extra Slide
Stream Recognizer
22
Supp. distortion results I (Gross error rate)
Extra Slide
Stream Recognizer
23
Supp. distortion results II (Avg Frame Distortion)
Extra Slide
Stream Recognizer

Write a Comment

User Comments (0)