ROBUST SPEECH RECOGNITION Introduction to Microphone Arrays - PowerPoint PPT Presentation

1 / 61

About This Presentation

Title:

ROBUST SPEECH RECOGNITION Introduction to Microphone Arrays

Description:

Combined nested array (Flanagan et al.) Three-band quasi-constant beamwidth array ... The Flanagan array with CDCN does improve accuracy: Carnegie. Mellon ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 62

Provided by: richar56

Category:

more less

Transcript and Presenter's Notes

Title: ROBUST SPEECH RECOGNITION Introduction to Microphone Arrays

1
ROBUST SPEECH RECOGNITIONIntroduction to
Microphone Arrays

Richard Stern
(with Mike Seltzer, Tom Sullivan, and
Evandro Gouvea)
Robust Speech Recognition Group
Carnegie Mellon University
Telephone (412) 268-2535
Fax (412) 268-3890
rms_at_cs.cmu.edu
http//www.cs.cmu.edu/rms
Short Course at UNAM
August 14-17, 2007

2
Introduction

The use of arrays of microphones can improve
speech recognition accuracy in noise
Outline of this talk
Review classical approaches to microphone array
processing
Delay and sum beamforming
Traditional adaptive filtering
Physiological-motivated processing
Describe and discuss selected recent results
Matched-filter array processing (Rutgers)
Array processing based on speech features (CMU)

3
OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction

Major functional components
Signal processing to extract features from speech
waveforms
Comparison of features to pre-stored templates
Important design choices
Choice of features
Specific method of comparing features to stored
templates

4
Why use microphone arrays?

Microphone arrays can provide directional
response, accepting speech from some directions
but suppressing others

5
Another reason why microphone arrays

Microphone arrays can focus attention on the
direct field in a reverberant environment

6
Three classical types of microphone arrays

Delay-and-sum beamforming and its many variants
Adaptive arrays based on mean-square suppression
of noise
Physiologically and perceptually motivated
approaches to multiple microphones

7
Delay-and-sum beamforming

Simple processing based on equalizing delays to
sensors
High directivity can be achieved with many sensors

z-n1
Sensor 1
z-n2
Sensor 2
Output
z-nK
Sensor K
8
The physics of delay-and-sum beamforming
d
q
dsin(q)
9
The physics of delay-and-sum beamforming

If sensor outputs are added together, the look
direction is q 0
Look direction can be steered to other
directions by inserting electronic delays to
compensate for physical ones
For look direction of q 0, net array output is

d
q
dsin(q)
10
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 500 Hz

11
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 1000 Hz

12
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 1500 Hz

13
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 2000 Hz

14
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 2500 Hz

15
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 3000 Hz

16
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 3500 Hz

17
Examples of delay-and-sum beams

d 8.62 cm, N 9, f 4000 Hz

18
Nested microphone arrays (Flanagan et al.)

5-element low frequency array

19
Nested microphone arrays

5-element mid frequency array

20
Nested microphone arrays

5-element high frequency array

21
Combined nested array (Flanagan et al.)

Three-band quasi-constant beamwidth array

Lowpass filter
Bandpass filter
Highpass filter
22
Another delay-and-sum issue spatial aliasing

d 8.62 cm, N 9, f 4000 Hz

23
Another delay-and-sum issue spatial aliasing

d 8.62 cm, N 9, f 5000 Hz

24
Another delay-and-sum issue spatial aliasing

d 8.62 cm, N 9, f 6000 Hz

25
Preventing spatial aliasing

Spatial aliasing occurs when sensors receive
input more than half a period from one another
The spatial Nyquist constraint depends on both
frequency and arrival angle
To prevent spatial aliasing we require that the
maximum frequency be less than

d
q
dsin(q)
26
Filter-and-sum beamforming

Input filters can (in principle) place delays
that vary with frequency to ameliorate frequency
dependencies of beamforming
Filters can also compensate for channel
characteristics

Sensor 1
Filter
Sensor 2
Filter
Output
Sensor K
Filter
27
Compensated delay-and-sum beamforming

Filter added to compensate for filtering effects
of delay and sum beamforming

z-n1
Sensor 1
z-n2
Output
Sensor 2
Filter
z-nK
Sensor K
28
Sample recognition results using compensated
delay-and-sum

The Flanagan array with CDCN does improve
accuracy

29
Traditional adaptive arrays
30
Traditional adaptive arrays

Large established literature
Use MMSE techniques to establish beams in the
look direction and nulls to additive noise
sources
Generally do not perform well in reverberant
environments
Signal cancellation
Effective impulse response longer than length of
filter
Techniques to circumvent signal cancellation
Switching nulling mechanism off and on according
to presence or absence of speech (Van
Compernolle)
Switching off adaptation in reverberant
environments
Use of alternate adaptation algorithms

31
Array processing based on human binaural hearing
32
Array processing based on human binaural hearing

Motivation human binaural system is known to
have excellent immunity to additive noise and
reverberation
Binaural phenomena of interest
Cocktail party effect
Precedence effect
Problems with binaural models
Correlation produces signal distortion from
rectification and squaring
Precedence-effect processing defeats echoes but
also suppresses desired signals
Greatest challenge decoding useful information
from the cross-correlation display

33
Correlation-based system motivated by binaural
hearing
34
Vowel representations using correlation processing

Reconstructed features of vowel /a/

Two inputs zero delay
Two inputs 120-ms delay
Eight inputs 120-ms delay
35
So what do things sound like on the
cross-correlation display?

Signals combined with ITDs of 0 and .5 ms
Individual speech signals
Combined speech signals
Separated by delay-and sum beamforming
Signals separated by cross-correlation display
Signals separated by additional correlations
across frequency at a common ITD (straightness
weighting)

36
Matched-filter beamforming (Rutgers)

Goal compensation for delay and dispersion
introduced in reverberant environments

600-ms Room response
Autocorrelation function
37
Matched-filter beamforming procedure

Measure or estimate sample response from source
to each sensor
Convolve input with time-reversed sample function
(producing autocorrelation function)
Sum outputs of channels together
Rationale main lobes of autocorrelation
functions should reinforce while side lobes cancel

38
Optimizing microphone arrays for speech
recognition features

The typical microphone array algorithms has been
signal enhancement rather than speech recognition

MIC1
MIC2
Array Processor
MICN
39
Automatic Speech Recognition (ASR)

Parameterize speech signal and compare parameter
sequence to statistical models of speech sound
units to hypothesize what a user said.
The objective is accurate recognition, a
statistical pattern classification problem.

40
ASR feature extraction

Convert a segment of speech into a compact set of
descriptive features

FFT
MEL SPECTRUM
LOG
400
512
40
40
speech
DCT
13
features
ASR
41
Speech recognition with microphone arrays

Recognition with microphone arrays has been
traditionally been performed by gluing the two
systems together.

Systems have different objectives.
Each system does not exploit information present
in the other.

42
Array processing based on speech features

Develop an array processing scheme targeted at
improved speech recognition performance without
regard to conventional array processing objective
criteria.

MIC1
MIC2
ASR
Feature Extraction
Array Proc
MIC3
MIC4
43
Choosing array weights based on speech features

Want an objective function that uses parameters
directly related to recognition

Clean Speech Features
x1
h1
t1
MIC1
Ms
x2
h2
t2
My
MIC2
y
FE
e
S
-
xM
hM
tM
MICM
minimize e
44
An objective function for mic arrays based on
speech recognition

Define Q as the sum of the squared errors of the
log Mel spectra of clean speech s and noisy
speech y
where y is the output of a filter-and-sum
microphone array and M f, l is the lth log
Mel spectral value in frame f.
My f, l is a function of the signals captured
by the array and the filter parameters associated
with each microphone.

45
Calibration of microphone arrays for ASR

Calibration of filter-and-sum microphone array
User speaks an utterance with known transcription
With or without close-talking microphone
Derive optimal set of filters
Minimize the objective function with respect to
the filter coefficients.
Since objective function is non-linear, use
iterative gradient-based methods
Apply to all future speech

46
Calibration Using close-talking recording

Given a close-talking mic recording for the
calibration utterance, derive an optimal filter
for each channel to improve recognition

Ms
FE
OPT
MIC1
ASR
h1(n)
t1
S
sn
FE
My
MICM
hM(n)
tM
47
Multi-microphone data sets

TMS
Recorded in CMU auditory lab
Approx. 5m x 5m x 3m
Noise from computer fans, blowers ,etc.
Isolated letters and digits, keywords
10 speakers 14 utterances 140 utterances
Each utterance has close-talking mic control
waveform

7cm
1m
48
Multi-microphone data sets (2)

WSJ off-axis noise source
Room simulation created using the image method
5m x 4m x 3m
200ms reverberation time
WGN source _at_ 5dB SNR
WSJ test set
5K word vocabulary
10 speakers 65 utterances 650 utterances
Original recordings used as close-talking control
waveforms

25cm
2m
15cm
1m
49
Results

TMS data set, WSJ0 WGN point source simulation
Constructed 50 point filters from a single
calibration utterance
Applied filters to all test utterances

50
Calibration without Close-talking Microphone

Obtain initial waveform estimate using
conventional array processing technique (e.g.
delay and sum).
Use transcription and the recognizer to estimate
the sequence of target clean log Mel spectra.
Optimize filter parameters as before.

51
Calibration w/o Close-talking Microphone (2)

Force align the delay-and-sum waveform to the
known transcription to generate an estimated HMM
state sequence.

BLAH BLAH...
MIC1
t1
S
sn
FALIGN
FE
MICM
tM
HMM
52
Calibration w/o Close-talking Microphone (3)

Extract the means from the single Gaussian HMMs
of the estimated state sequence.
Since the models have been trained from clean
speech, use these means as the target clean
speech feature vectors.

HMM
IDCT
53
Calibration w/o Close-talking Microphone (4)

Use estimated clean speech feature vectors to
optimize filters as before.

OPT
MIC1
ASR
h1(n)
t1
S
sn
FE
My
MICM
hM(n)
tM
54
Results

TMS data set, WSJ0 WGN point source simulation
Constructed 50 point filters from calibration
utterance
Applied filters to all utterances

55
Results (2)

WER vs. SNR for WSJ WGN
Constructed 50 point filters from calibration
utterance using transcription only
Applied filters to all utterances

56
Performance in highly reverberant rooms

Comparison of single channel and delay-and-sum
beamforming (WSJ data passed through measured
impulse responses)

Single channel
Delay and sum
57
Performance in highly reverberant rooms

Impact of matched filter and feature-based array
processing
Matched filter array used 1040 points per channel
with perfect knowledge of channel characteristics
Feature-based array used 30 points per channel,
imperfect channel knowledge

58
Subband processing using optimized features
(Seltzer)

Subband processing can address some of the
effects of reverberation. The subband signals
have more desirable narrowband signal properties.

2. Downsample
4. Upsample
1. Divide into independent subbands
5. Resynthesize full signal from subbands
3. Process subbands independently
L
H1(z)
F1(z)
G1(z)
L
H2(z)
F2(z)
G2(z)
S
yn
xn
L
HL(z)
GL(z)
FL(z)
59
Subband results with reverberated WSJ task

WER for all speakers, compared to delay-and-sum
processing

60
Summary

Microphone array processing is effective although
not yet in widespread use except for simple
delay-and-um beamforming
Despite many developments in signal processing,
actual applications to speech are based on very
simple concepts
Major problems and issues
Maintaining good performance in reverberation
Real-time time-varying environments and speakers

61
(No Transcript)

Write a Comment

User Comments (0)