An Analysis of the Aurora Large Vocabulary Evaluation - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

An Analysis of the Aurora Large Vocabulary Evaluation

Description:

Dept. Electrical and Computer Eng. Mississippi State University. Contact ... reduction techniques including discriminative transforms, feature normalization, ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 22

Provided by: ValuedGate831

Category:

more less

Transcript and Presenter's Notes

Title: An Analysis of the Aurora Large Vocabulary Evaluation

1
An Analysis of theAurora Large Vocabulary
Evaluation
EUROSPEECH 2003

Authors
Naveen Parihar and Joseph Picone
Inst. for Signal and Info. Processing
Dept. Electrical and Computer Eng.
Mississippi State University
Contact Information
Box 9571
Mississippi State University
Mississippi State, Mississippi 39762
Tel 662-325-8335
Fax 662-325-2298

Email parihar,picone_at_isip.msstate.edu
URL isip.msstate.edu/publications/seminars/ece_
weekly/2003/evaluation/
2
INTRODUCTION
ABSTRACT
In this presentation, we analyze the results of
the recent Aurora large vocabulary evaluations
(ALV). Two consortia submitted proposals on
speech recognition front ends for this
evaluation (1) Qualcomm, ICSI, and OGI (QIO),
and (2) Motorola, France Telecom, and Alcatel
(MFA). These front ends used a variety of noise
reduction techniques including discriminative
transforms, feature normalization, voice activity
detection, and blind equalization. Participants
used a common speech recognition engine to
post-process their features. In this
presentation, we show that the results of this
evaluation were not significantly impacted by
suboptimal recognition system parameter settings.
Without any front end specific tuning, the MFA
front end outperforms the QIO front end by 9.6
relative. With tuning, the relative performance
gap increases to 15.8. Both the mismatched
microphone and additive noise evaluation
conditions resulted in a significant degradation
in performance for both front ends.
3
INTRODUCTION
SPEECH RECOGNITION OVERVIEW
A noisy communication channel model for speech
production and perception

Bayesian formulation for speech recognition
P(W/A) P(A/W) P(W) / P(A)
Objective minimize word error rate by
maximizing P(W/A)
Approach maximize P(A/W) (training)
P(A/W) acoustic model (hidden Markov Model,
Gaussians)
P(W) language model (Finite state machines,
N-grams)
P(A) acoustics (ignored during maximization)

4
INTRODUCTION
BLOCK DIAGRAM APPROACH

Core components
Transduction
Feature extraction
Acoustic modeling (hidden Markov models)
Language modeling (statistical N-grams)
Search (Viterbi beam)
Knowledge sources

5
INTRODUCTION
AURORA EVALUATION OVERVIEW

WSJ 5K (closed task) with seven (digitally-added)
noise conditions
Common ASR system
Two participants QIO QualC., ICSI,
OGI MFA Moto., FrTel., Alcatel

6
INTRODUCTION
MOTIVATION
ALV Evaluation Results

ALV goal was at least a 25 relative improvement
over the baseline MFCC front end
Two consortia participated
QIO QualComm, ICSI, OGI
MFA Motorola, France Telecom, Alcatel
Generic baseline LVCSR system with no front end
specific tuning
Would front end specific tuning change the
rankings?

7
EVALUATION PARADIGM
THE AURORA 4 DATABASE

Acoustic Training
Derived from 5000 word WSJ0 task
TS1 (clean), and TS2 (multi-condition)
Clean plus 6 noise conditions
Randomly chosen SNR between 10 and 20 dB
2 microphone conditions (Sennheiser and
secondary)
2 sample frequencies 16 kHz and 8 kHz
G.712 filtering at 8 kHz and P.341 filtering at
16 kHz

Development and Evaluation Sets
Derived from WSJ0 Evaluation and Development sets
14 test sets for each
7 recorded on Sennheiser 7 on secondary
Clean plus 6 noise conditions
Randomly chosen SNR between 5 and 15 dB
G.712 filtering at 8 kHz and P.341 filtering at
16 kHz

8
EVALUATION PARADIGM
BASELINE LVCSR SYSTEM

Standard context-dependent cross-word HMM-based
system
Acoustic models state-tied4-mixture cross-word
triphones
Language model WSJ0 5K bigram
Search Viterbi one-best using lexical trees for
N-gram cross-word decoding
Lexicon based on CMUlex
Real-time 4 xRT for training and 15 xRT for
decoding on an800 MHz Pentium

9
EVALUATION PARADIGM
WI007 ETSI MFCC FRONT END
Input Speech

The baseline HMM system used an ETSI standard
MFCC-based front end

Zero-mean and Pre-emphasis

Zero-mean debiasing
10 ms frame duration
25 ms Hamming window
Absolute energy
12 cepstral coefficients
First and second derivatives

Fourier Transf. Analysis
Energy
Cepstral Analysis
10
FRONT END PROPOSALS
QIO FRONT END
Input Speech
Qualcomm, ICSI, OGI (QIO) front end
Fourier Transform

10 msec frame duration
25 msec analysis window
15 RASTA-like filtered cepstral coefficients
MLP-based VAD
Mean and variance normalization
First and second derivatives

Mel-scale Filter Bank
RASTA
MLP-based VAD
DCT
Mean/Variance Normalization
/
11
FRONT END PROPOSALS
MFA FRONT END

10 msec frame duration
25 msec analysis window
Mel-warped Wiener filter based noise reduction
Energy-based VADNest
Waveform processing to enhance SNR
Weighted log-energy
12 cepstral coefficients
Blind equalization (cepstral domain)
VAD based on acceleration of various energy based
measures
First and second derivatives

12
EXPERIMENTAL RESULTS
FRONT END SPECIFIC TUNING

Pruning beams (word, phone and state) were opened
during the tuning process to eliminate search
errors.
Tuning parameters
State-tying thresholds solves the problem of
sparsity of training data by sharing state
distributions among phonetically similar states
Language model scale controls influence of the
language model relative to the acoustic models
(more relevant for WSJ)
Word insertion penalty balances insertions and
deletions (always a concern in noisy environments)

13
EXPERIMENTAL RESULTS
FRONT END SPECIFIC TUNING - QIO

Parameter tuning
clean data recorded on Sennhieser mic.
(corresponds to Training Set 1 and Devtest Set 1
of the Aurora-4 database)
8 kHz sampling frequency
7.5 relative improvement

14
EXPERIMENTAL RESULTS
FRONT END SPECIFIC TUNING - MFA

Parameter tuning
clean data recorded on Sennhieser mic.
(corresponds to Training Set 1 and Devtest Set 1
of the Aurora-4 database)
8 kHz sampling frequency
9.4 relative improvement
Ranking is still the same (14.9 vs. 12.5) !

15
EXPERIMENTAL RESULTS
COMPARISON OF TUNING

Same Ranking relative performance gap increased
from9.6 to 15.8
On TS1, MFA FE significantly better on all 14
test sets (MAPSSWE p0.1)
On TS2, MFA FE significantly better only on test
sets 5 and 14

16
EXPERIMENTAL RESULTS
MICROPHONE VARIATION

Train on Sennheiser mic. evaluate on secondary
mic.
Matched conditions result in optimal performance
Significant degradation for all front ends on
mismatched conditions
Both QIO and MFA provide improved robustness
relative to MFCC baseline

17
EXPERIMENTAL RESULTS
ADDITIVE NOISE
18
SUMMARY AND CONCLUSIONS
WHAT HAVE WE LEARNED?

Front end specific parameter tuning did not
result in significant change in overall
performance (MFA still outperforms QIO)
Both QIO and MFA front ends handle convolution
and additive noise better than ETSI baseline
Both QIO and MFA front ends achieved ALV
evaluation goal of improving performance by at
least 25 relative over ETSI baseline
WER is still high ( 35), further research on
noise robust front end is needed

19
SUMMARY AND CONCLUSIONS
AVAILABLE RESOURCES
20
SUMMARY AND CONCLUSIONS
BRIEF BIBLIOGRAPHY

N. Parihar, Performance Analysis of Advanced
Front Ends, M.S. Dissertation, Mississippi State
University, December 2003.
N. Parihar, J. Picone, D. Pearce, and H.G.
Hirsch, Performance Analysis of the Aurora Large
Vocabulary Baseline System, submitted to the
Eurospeech 2003, Geneva, Switzerland, September
2003.
N. Parihar and J. Picone, DSR Front End LVCSR
Evaluation - AU/384/02, Aurora Working Group,
European Telecommunications Standards Institute,
December 06, 2002.
D. Pearce, Overview of Evaluation Criteria for
Advanced Distributed Speech Recognition, ETSI
STQ-Aurora DSR Working Group, October 2001.
G. Hirsch, Experimental Framework for the
Performance Evaluation of Speech Recognition
Front-ends in a Large Vocabulary Task, ETSI
STQ-Aurora DSR Working Group, December 2002.
ETSI ES 201 108 v1.1.2 Distributed Speech
Recognition Front-end Feature Extraction
Algorithm Compression Algorithm, ETSI, April
2000.

21
SUMMARY AND CONCLUSIONS
BIOGRAPHY

Naveen Parihar is a M.S. student in Electrical
Engineering in the Department of Electrical and
Computer Engineering at Mississippi State
University. He currently leads the Core Speech
Technology team developing a state-of-the-art
public-domain speech recognition system. Mr.
Parihars research interests lie in the
development of discriminative algorithms for
better acoustic modeling and feature extraction.
Mr. Parihar is a student member of the IEEE.
Joseph Picone is currently a Professor in the
Department of Electrical and Computer Engineering
at Mississippi State University, where he also
directs the Institute for Signal and Information
Processing. For the past 15 years he has been
promoting open source speech technology. He has
previously been employed by Texas Instruments and
ATT Bell Laboratories. Dr. Picone received his
Ph.D. in Electrical Engineering from Illinois
Institute of Technology in 1983. He is a Senior
Member of the IEEE and a registered Professional
Engineer.