Title: ROBUST SPEECH RECOGNITION Introduction to Microphone Arrays
1ROBUST SPEECH RECOGNITIONIntroduction to
Microphone Arrays
- Richard Stern
- (with Mike Seltzer, Tom Sullivan, and
- Evandro Gouvea)
- Robust Speech Recognition Group
- Carnegie Mellon University
- Telephone (412) 268-2535
- Fax (412) 268-3890
- rms_at_cs.cmu.edu
- http//www.cs.cmu.edu/rms
- Short Course at UNAM
- August 14-17, 2007
2Introduction
- The use of arrays of microphones can improve
speech recognition accuracy in noise - Outline of this talk
- Review classical approaches to microphone array
processing - Delay and sum beamforming
- Traditional adaptive filtering
- Physiological-motivated processing
- Describe and discuss selected recent results
- Matched-filter array processing (Rutgers)
- Array processing based on speech features (CMU)
3OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction
- Major functional components
- Signal processing to extract features from speech
waveforms - Comparison of features to pre-stored templates
- Important design choices
- Choice of features
- Specific method of comparing features to stored
templates
4Why use microphone arrays?
- Microphone arrays can provide directional
response, accepting speech from some directions
but suppressing others
5Another reason why microphone arrays
- Microphone arrays can focus attention on the
direct field in a reverberant environment
6Three classical types of microphone arrays
- Delay-and-sum beamforming and its many variants
- Adaptive arrays based on mean-square suppression
of noise - Physiologically and perceptually motivated
approaches to multiple microphones
7Delay-and-sum beamforming
- Simple processing based on equalizing delays to
sensors - High directivity can be achieved with many sensors
z-n1
Sensor 1
z-n2
Sensor 2
Output
z-nK
Sensor K
8The physics of delay-and-sum beamforming
d
q
dsin(q)
9The physics of delay-and-sum beamforming
- If sensor outputs are added together, the look
direction is q 0 - Look direction can be steered to other
directions by inserting electronic delays to
compensate for physical ones - For look direction of q 0, net array output is
d
q
dsin(q)
10Examples of delay-and-sum beams
11Examples of delay-and-sum beams
- d 8.62 cm, N 9, f 1000 Hz
12Examples of delay-and-sum beams
- d 8.62 cm, N 9, f 1500 Hz
13Examples of delay-and-sum beams
- d 8.62 cm, N 9, f 2000 Hz
14Examples of delay-and-sum beams
- d 8.62 cm, N 9, f 2500 Hz
15Examples of delay-and-sum beams
- d 8.62 cm, N 9, f 3000 Hz
16Examples of delay-and-sum beams
- d 8.62 cm, N 9, f 3500 Hz
17Examples of delay-and-sum beams
- d 8.62 cm, N 9, f 4000 Hz
18Nested microphone arrays (Flanagan et al.)
- 5-element low frequency array
19Nested microphone arrays
- 5-element mid frequency array
20Nested microphone arrays
- 5-element high frequency array
21Combined nested array (Flanagan et al.)
- Three-band quasi-constant beamwidth array
Lowpass filter
Bandpass filter
Highpass filter
22Another delay-and-sum issue spatial aliasing
- d 8.62 cm, N 9, f 4000 Hz
23Another delay-and-sum issue spatial aliasing
- d 8.62 cm, N 9, f 5000 Hz
24Another delay-and-sum issue spatial aliasing
- d 8.62 cm, N 9, f 6000 Hz
25Preventing spatial aliasing
- Spatial aliasing occurs when sensors receive
input more than half a period from one another - The spatial Nyquist constraint depends on both
frequency and arrival angle - To prevent spatial aliasing we require that the
maximum frequency be less than
d
q
dsin(q)
26Filter-and-sum beamforming
- Input filters can (in principle) place delays
that vary with frequency to ameliorate frequency
dependencies of beamforming - Filters can also compensate for channel
characteristics
Sensor 1
Filter
Sensor 2
Filter
Output
Sensor K
Filter
27Compensated delay-and-sum beamforming
- Filter added to compensate for filtering effects
of delay and sum beamforming
z-n1
Sensor 1
z-n2
Output
Sensor 2
Filter
z-nK
Sensor K
28Sample recognition results using compensated
delay-and-sum
- The Flanagan array with CDCN does improve
accuracy
29Traditional adaptive arrays
30Traditional adaptive arrays
- Large established literature
- Use MMSE techniques to establish beams in the
look direction and nulls to additive noise
sources - Generally do not perform well in reverberant
environments - Signal cancellation
- Effective impulse response longer than length of
filter - Techniques to circumvent signal cancellation
- Switching nulling mechanism off and on according
to presence or absence of speech (Van
Compernolle) - Switching off adaptation in reverberant
environments - Use of alternate adaptation algorithms
31Array processing based on human binaural hearing
32Array processing based on human binaural hearing
- Motivation human binaural system is known to
have excellent immunity to additive noise and
reverberation - Binaural phenomena of interest
- Cocktail party effect
- Precedence effect
- Problems with binaural models
- Correlation produces signal distortion from
rectification and squaring - Precedence-effect processing defeats echoes but
also suppresses desired signals - Greatest challenge decoding useful information
from the cross-correlation display
33Correlation-based system motivated by binaural
hearing
34Vowel representations using correlation processing
- Reconstructed features of vowel /a/
Two inputs zero delay
Two inputs 120-ms delay
Eight inputs 120-ms delay
35So what do things sound like on the
cross-correlation display?
- Signals combined with ITDs of 0 and .5 ms
- Individual speech signals
- Combined speech signals
- Separated by delay-and sum beamforming
- Signals separated by cross-correlation display
- Signals separated by additional correlations
across frequency at a common ITD (straightness
weighting)
36Matched-filter beamforming (Rutgers)
- Goal compensation for delay and dispersion
introduced in reverberant environments
600-ms Room response
Autocorrelation function
37Matched-filter beamforming procedure
- Measure or estimate sample response from source
to each sensor - Convolve input with time-reversed sample function
(producing autocorrelation function) - Sum outputs of channels together
- Rationale main lobes of autocorrelation
functions should reinforce while side lobes cancel
38Optimizing microphone arrays for speech
recognition features
- The typical microphone array algorithms has been
signal enhancement rather than speech recognition
MIC1
MIC2
Array Processor
MICN
39Automatic Speech Recognition (ASR)
- Parameterize speech signal and compare parameter
sequence to statistical models of speech sound
units to hypothesize what a user said. - The objective is accurate recognition, a
statistical pattern classification problem.
40ASR feature extraction
- Convert a segment of speech into a compact set of
descriptive features
FFT
MEL SPECTRUM
LOG
400
512
40
40
speech
DCT
13
features
ASR
41Speech recognition with microphone arrays
- Recognition with microphone arrays has been
traditionally been performed by gluing the two
systems together.
- Systems have different objectives.
- Each system does not exploit information present
in the other.
42Array processing based on speech features
- Develop an array processing scheme targeted at
improved speech recognition performance without
regard to conventional array processing objective
criteria.
MIC1
MIC2
ASR
Feature Extraction
Array Proc
MIC3
MIC4
43Choosing array weights based on speech features
- Want an objective function that uses parameters
directly related to recognition
Clean Speech Features
x1
h1
t1
MIC1
Ms
x2
h2
t2
My
MIC2
y
FE
e
S
-
xM
hM
tM
MICM
minimize e
44An objective function for mic arrays based on
speech recognition
- Define Q as the sum of the squared errors of the
log Mel spectra of clean speech s and noisy
speech y - where y is the output of a filter-and-sum
microphone array and M f, l is the lth log
Mel spectral value in frame f. - My f, l is a function of the signals captured
by the array and the filter parameters associated
with each microphone.
45Calibration of microphone arrays for ASR
- Calibration of filter-and-sum microphone array
- User speaks an utterance with known transcription
- With or without close-talking microphone
- Derive optimal set of filters
- Minimize the objective function with respect to
the filter coefficients. - Since objective function is non-linear, use
iterative gradient-based methods - Apply to all future speech
46Calibration Using close-talking recording
- Given a close-talking mic recording for the
calibration utterance, derive an optimal filter
for each channel to improve recognition
Ms
FE
OPT
MIC1
ASR
h1(n)
t1
S
sn
FE
My
MICM
hM(n)
tM
47Multi-microphone data sets
- TMS
- Recorded in CMU auditory lab
- Approx. 5m x 5m x 3m
- Noise from computer fans, blowers ,etc.
- Isolated letters and digits, keywords
- 10 speakers 14 utterances 140 utterances
- Each utterance has close-talking mic control
waveform
7cm
1m
48Multi-microphone data sets (2)
- WSJ off-axis noise source
- Room simulation created using the image method
- 5m x 4m x 3m
- 200ms reverberation time
- WGN source _at_ 5dB SNR
- WSJ test set
- 5K word vocabulary
- 10 speakers 65 utterances 650 utterances
- Original recordings used as close-talking control
waveforms
25cm
2m
15cm
1m
49Results
- TMS data set, WSJ0 WGN point source simulation
- Constructed 50 point filters from a single
calibration utterance - Applied filters to all test utterances
50Calibration without Close-talking Microphone
- Obtain initial waveform estimate using
conventional array processing technique (e.g.
delay and sum). - Use transcription and the recognizer to estimate
the sequence of target clean log Mel spectra. - Optimize filter parameters as before.
51Calibration w/o Close-talking Microphone (2)
- Force align the delay-and-sum waveform to the
known transcription to generate an estimated HMM
state sequence.
BLAH BLAH...
MIC1
t1
S
sn
FALIGN
FE
MICM
tM
HMM
52Calibration w/o Close-talking Microphone (3)
- Extract the means from the single Gaussian HMMs
of the estimated state sequence. - Since the models have been trained from clean
speech, use these means as the target clean
speech feature vectors.
HMM
IDCT
53Calibration w/o Close-talking Microphone (4)
- Use estimated clean speech feature vectors to
optimize filters as before.
OPT
MIC1
ASR
h1(n)
t1
S
sn
FE
My
MICM
hM(n)
tM
54Results
- TMS data set, WSJ0 WGN point source simulation
- Constructed 50 point filters from calibration
utterance - Applied filters to all utterances
55Results (2)
- WER vs. SNR for WSJ WGN
- Constructed 50 point filters from calibration
utterance using transcription only - Applied filters to all utterances
56Performance in highly reverberant rooms
- Comparison of single channel and delay-and-sum
beamforming (WSJ data passed through measured
impulse responses)
Single channel
Delay and sum
57Performance in highly reverberant rooms
- Impact of matched filter and feature-based array
processing - Matched filter array used 1040 points per channel
with perfect knowledge of channel characteristics - Feature-based array used 30 points per channel,
imperfect channel knowledge
58Subband processing using optimized features
(Seltzer)
- Subband processing can address some of the
effects of reverberation. The subband signals
have more desirable narrowband signal properties.
2. Downsample
4. Upsample
1. Divide into independent subbands
5. Resynthesize full signal from subbands
3. Process subbands independently
L
H1(z)
F1(z)
G1(z)
L
H2(z)
F2(z)
G2(z)
S
yn
xn
L
HL(z)
GL(z)
FL(z)
59Subband results with reverberated WSJ task
- WER for all speakers, compared to delay-and-sum
processing
60Summary
- Microphone array processing is effective although
not yet in widespread use except for simple
delay-and-um beamforming - Despite many developments in signal processing,
actual applications to speech are based on very
simple concepts - Major problems and issues
- Maintaining good performance in reverberation
- Real-time time-varying environments and speakers
61(No Transcript)