Title: Reverberationrobust automatic speech recognition using missing data techniques
1Reverberation-robust automatic speech recognition
using missing data techniques
- Guy J. Brown1, Kalle Palomäki2 and Jon Barker1
- 1 Department of Computer Science, University of
Sheffield, UK - 2 Laboratory of Acoustics and Audio Signal
Processing, Helsinki University of Technology,
Finland - g.brown_at_dcs.shef.ac.uk, kalle.palomaki_at_hut.fi,
j.barker_at_dcs.shef.ac.uk
2Overview
- Missing data approach to ASR
- The reverberation problem
- System description
- Acoustic features
- Reverberation mask estimation
- Spectral normalisation
- Feature combination
- ASR results
- Conclusions
3Missing data approach to ASR
- Devised by Cooke et al. (2001) as a means of
handling additive noise in ASR. - Motivated by observation that listeners are able
to recognise speech even when parts of the
spectrum are rendered unreliable (by noise) or
removed (by filtering). - Adapt a hidden Markov model (HMM) classifier to
cope with missing or unreliable features. - Requires that reliable acoustic features are
labelled.
4The time-frequency mask
- Likelihood associated with
spectral features xs cannot be computed directly - Partition xs into reliable part xs,r and
unreliable part xs,u - Compute estimate of likelihood
by using xs,r directly and
exploiting bounds on xs,u (bounded
marginalization) - Provide a time-frequency mask showing reliable
regions. - Mask may be binary (as used here) or real-valued.
5Example time-frequency mask
Rate map for clean speech
Rate map for noisy speech
Mask
6Characteristics of room reverberation
- Room impulse response comprises early and late
(higher-order) reflections. - Early reflections
- Sparse, highly correlated with speech
- May enhance intelligibility by increasing
loudness of speech - May cause spectral deviation due to comb
filtering and varying characteristics of surface
absorption - Higher-order reflections
- Dense, poorly correlated with original speech
- More like additive noise
7Missing data ASR and reverberation
- We use the following approach for reverberated
speech - Use spectral normalisation to deal with
distortion caused by early reflections - Treat late reverberation as additive noise, and
apply standard missing data techniques. - Identify spectral features which are relatively
uncontaminated by reverberation and contain
strong speech energy. - Approach based on modulation filtering.
8System architecture
Auditory filterbank
Rate map
Spectral normalisation
Missing data speech recogniser
Reverberated speech
Reverberation mask
9Acoustic features
- Missing data approach requires spectral features,
so local time-frequency regions can be selected
in the mask. - Features derived from an auditory model
- Filterbank consisting of 32 gammatone filters,
centre frequencies between 50 Hz and 3850 Hz on
ERB scale - Envelope of each filter extracted and smoothed by
first-order lowpass filter with a time constant
of 8 ms. - Smoothed envelope is sampled at 10 ms intervals
and cube root compressed to give a rate map.
10Reverberation mask estimation
- Identify acoustic features that contain strong
speech energy and are relatively unaffected by
reverberation. - Detect modulations in the speech range by
filtering each channel of the rate map with a
modulation filter, pass band between 1.5 Hz and
8.2 Hz. - Apply threshold to modulation-filtered rate map
ym(i,j) -
- where m(i,j) is mask at time i and frequency j,
and q(j) is a frequency-dependent threshold.
11Form of the modulation filter
- The modulation filter h(n) has the following
form - hlp(n) is a linear phase low pass filter which
detects modulations in the speech range. - hdiff(n) is a differentiator which emphasizes
abrupt onsets, which are likely to correspond to
direct sound and early reflections. - Overall filter h(n) is band pass, with 3 dB
cutoff points at 1.5 Hz and 8.2 Hz.
12Form of the modulation filter (contd)
13Example reverberation mask estimation
Rate map (CF103Hz)
Rate map filtered by low pass part
Rate map filtered by entire modulation filter
Estimated reliable regions (solid line) and
unreliable regions (dotted line)
14Spectral normalisation
- Need to compensate for spectral distortion caused
by room impulse response, but with partial
information. - Features known to be unreliable should not be
included in the normalisation process. - We use an utterance-based normalisation scheme.
- Normalisation factor for each channel is the mean
of the L largest reliable features in that
channel. - Generally set L to M/D, where M is the number of
time frames in the rate map and D is a constant
parameter.
15Example of mask estimation
Rate map of unreverberated speech
A priori mask
Frequency (kHz)
Rate map of reverberated speech (T60 1.7sec)
Reverberation mask
Frequency (kHz)
Time (seconds)
Time (seconds)
16Recent work feature combination
- MD approach requires spectral features, which are
more correlated than cepstral features - Problems with modeling spectral data may reduce
baseline ASR performance, offsetting the gain in
robustness - Compute estimate of likelihood
for spectral features xs with missing
data - Compute likelihood f(xcC) for (complete)
cepstral features - Combine likelihoods from the two feature streams
as weighted average in the log domain
17Evaluation
- Compare against missing data using a priori
masks - Measure difference between each element in the
rate map for clean and reverberation-contaminated
speech - Only set mask elements to unity if this
difference lies within a threshold value (tuned
for each condition). - Compare against Kingsburys HMM-MLP recogniser
- Hidden Markov model / multilayer perceptron
architecture - Uses PLP and modulation-filtered spectrogram
features.
18Results
- Test set of 1001 utterances drawn from Aurora 2
corpus. - Connected digits (1-9 plus oh and zero).
- Missing data HMM system trained on rate maps and
deltas. - Reverberated using recorded room impulse
responses.
19Conclusions
- Missing data techniques can be used to tackle the
problem of reverberation. - Detection of modulations in the speech range is a
key element of our approach. - Advantage of the missing data framework
different mask estimation rules can be selected
dynamically to deal with varying acoustic
environments. - May be important for mobile devices.
- Experiments with a priori masks suggest good
potential most recent results superior to
Kingsburys system.