ROBUST SPEECH RECOGNITION Introduction to Microphone Arrays - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

ROBUST SPEECH RECOGNITION Introduction to Microphone Arrays

Description:

Combined nested array (Flanagan et al.) Three-band quasi-constant beamwidth array ... The Flanagan array with CDCN does improve accuracy: Carnegie. Mellon ... – PowerPoint PPT presentation

Number of Views:104
Avg rating:3.0/5.0
Slides: 62
Provided by: richar56
Category:

less

Transcript and Presenter's Notes

Title: ROBUST SPEECH RECOGNITION Introduction to Microphone Arrays


1
ROBUST SPEECH RECOGNITIONIntroduction to
Microphone Arrays
  • Richard Stern
  • (with Mike Seltzer, Tom Sullivan, and
  • Evandro Gouvea)
  • Robust Speech Recognition Group
  • Carnegie Mellon University
  • Telephone (412) 268-2535
  • Fax (412) 268-3890
  • rms_at_cs.cmu.edu
  • http//www.cs.cmu.edu/rms
  • Short Course at UNAM
  • August 14-17, 2007

2
Introduction
  • The use of arrays of microphones can improve
    speech recognition accuracy in noise
  • Outline of this talk
  • Review classical approaches to microphone array
    processing
  • Delay and sum beamforming
  • Traditional adaptive filtering
  • Physiological-motivated processing
  • Describe and discuss selected recent results
  • Matched-filter array processing (Rutgers)
  • Array processing based on speech features (CMU)

3
OVERVIEW OF SPEECH RECOGNITION
Speech features
Phoneme hypotheses
Decision making procedure
Feature extraction
  • Major functional components
  • Signal processing to extract features from speech
    waveforms
  • Comparison of features to pre-stored templates
  • Important design choices
  • Choice of features
  • Specific method of comparing features to stored
    templates

4
Why use microphone arrays?
  • Microphone arrays can provide directional
    response, accepting speech from some directions
    but suppressing others

5
Another reason why microphone arrays
  • Microphone arrays can focus attention on the
    direct field in a reverberant environment

6
Three classical types of microphone arrays
  • Delay-and-sum beamforming and its many variants
  • Adaptive arrays based on mean-square suppression
    of noise
  • Physiologically and perceptually motivated
    approaches to multiple microphones

7
Delay-and-sum beamforming
  • Simple processing based on equalizing delays to
    sensors
  • High directivity can be achieved with many sensors

z-n1
Sensor 1
z-n2
Sensor 2
Output
z-nK
Sensor K
8
The physics of delay-and-sum beamforming
d
q
dsin(q)
9
The physics of delay-and-sum beamforming
  • If sensor outputs are added together, the look
    direction is q 0
  • Look direction can be steered to other
    directions by inserting electronic delays to
    compensate for physical ones
  • For look direction of q 0, net array output is

d
q
dsin(q)
10
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 500 Hz

11
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 1000 Hz

12
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 1500 Hz

13
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 2000 Hz

14
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 2500 Hz

15
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 3000 Hz

16
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 3500 Hz

17
Examples of delay-and-sum beams
  • d 8.62 cm, N 9, f 4000 Hz

18
Nested microphone arrays (Flanagan et al.)
  • 5-element low frequency array

19
Nested microphone arrays
  • 5-element mid frequency array

20
Nested microphone arrays
  • 5-element high frequency array

21
Combined nested array (Flanagan et al.)
  • Three-band quasi-constant beamwidth array

Lowpass filter
Bandpass filter
Highpass filter
22
Another delay-and-sum issue spatial aliasing
  • d 8.62 cm, N 9, f 4000 Hz

23
Another delay-and-sum issue spatial aliasing
  • d 8.62 cm, N 9, f 5000 Hz

24
Another delay-and-sum issue spatial aliasing
  • d 8.62 cm, N 9, f 6000 Hz

25
Preventing spatial aliasing
  • Spatial aliasing occurs when sensors receive
    input more than half a period from one another
  • The spatial Nyquist constraint depends on both
    frequency and arrival angle
  • To prevent spatial aliasing we require that the
    maximum frequency be less than

d
q
dsin(q)
26
Filter-and-sum beamforming
  • Input filters can (in principle) place delays
    that vary with frequency to ameliorate frequency
    dependencies of beamforming
  • Filters can also compensate for channel
    characteristics

Sensor 1
Filter
Sensor 2
Filter
Output
Sensor K
Filter
27
Compensated delay-and-sum beamforming
  • Filter added to compensate for filtering effects
    of delay and sum beamforming

z-n1
Sensor 1
z-n2
Output
Sensor 2
Filter
z-nK
Sensor K
28
Sample recognition results using compensated
delay-and-sum
  • The Flanagan array with CDCN does improve
    accuracy

29
Traditional adaptive arrays
30
Traditional adaptive arrays
  • Large established literature
  • Use MMSE techniques to establish beams in the
    look direction and nulls to additive noise
    sources
  • Generally do not perform well in reverberant
    environments
  • Signal cancellation
  • Effective impulse response longer than length of
    filter
  • Techniques to circumvent signal cancellation
  • Switching nulling mechanism off and on according
    to presence or absence of speech (Van
    Compernolle)
  • Switching off adaptation in reverberant
    environments
  • Use of alternate adaptation algorithms

31
Array processing based on human binaural hearing
32
Array processing based on human binaural hearing
  • Motivation human binaural system is known to
    have excellent immunity to additive noise and
    reverberation
  • Binaural phenomena of interest
  • Cocktail party effect
  • Precedence effect
  • Problems with binaural models
  • Correlation produces signal distortion from
    rectification and squaring
  • Precedence-effect processing defeats echoes but
    also suppresses desired signals
  • Greatest challenge decoding useful information
    from the cross-correlation display

33
Correlation-based system motivated by binaural
hearing
34
Vowel representations using correlation processing
  • Reconstructed features of vowel /a/

Two inputs zero delay
Two inputs 120-ms delay
Eight inputs 120-ms delay
35
So what do things sound like on the
cross-correlation display?
  • Signals combined with ITDs of 0 and .5 ms
  • Individual speech signals
  • Combined speech signals
  • Separated by delay-and sum beamforming
  • Signals separated by cross-correlation display
  • Signals separated by additional correlations
    across frequency at a common ITD (straightness
    weighting)

36
Matched-filter beamforming (Rutgers)
  • Goal compensation for delay and dispersion
    introduced in reverberant environments

600-ms Room response
Autocorrelation function
37
Matched-filter beamforming procedure
  • Measure or estimate sample response from source
    to each sensor
  • Convolve input with time-reversed sample function
    (producing autocorrelation function)
  • Sum outputs of channels together
  • Rationale main lobes of autocorrelation
    functions should reinforce while side lobes cancel

38
Optimizing microphone arrays for speech
recognition features
  • The typical microphone array algorithms has been
    signal enhancement rather than speech recognition

MIC1
MIC2
Array Processor
MICN
39
Automatic Speech Recognition (ASR)
  • Parameterize speech signal and compare parameter
    sequence to statistical models of speech sound
    units to hypothesize what a user said.
  • The objective is accurate recognition, a
    statistical pattern classification problem.

40
ASR feature extraction
  • Convert a segment of speech into a compact set of
    descriptive features

FFT
MEL SPECTRUM
LOG
400
512
40
40
speech
DCT
13
features
ASR
41
Speech recognition with microphone arrays
  • Recognition with microphone arrays has been
    traditionally been performed by gluing the two
    systems together.
  • Systems have different objectives.
  • Each system does not exploit information present
    in the other.

42
Array processing based on speech features
  • Develop an array processing scheme targeted at
    improved speech recognition performance without
    regard to conventional array processing objective
    criteria.

MIC1
MIC2
ASR
Feature Extraction
Array Proc
MIC3
MIC4
43
Choosing array weights based on speech features
  • Want an objective function that uses parameters
    directly related to recognition

Clean Speech Features
x1
h1
t1
MIC1
Ms
x2
h2
t2
My
MIC2
y
FE
e
S
-
xM
hM
tM
MICM
minimize e
44
An objective function for mic arrays based on
speech recognition
  • Define Q as the sum of the squared errors of the
    log Mel spectra of clean speech s and noisy
    speech y
  • where y is the output of a filter-and-sum
    microphone array and M f, l is the lth log
    Mel spectral value in frame f.
  • My f, l is a function of the signals captured
    by the array and the filter parameters associated
    with each microphone.

45
Calibration of microphone arrays for ASR
  • Calibration of filter-and-sum microphone array
  • User speaks an utterance with known transcription
  • With or without close-talking microphone
  • Derive optimal set of filters
  • Minimize the objective function with respect to
    the filter coefficients.
  • Since objective function is non-linear, use
    iterative gradient-based methods
  • Apply to all future speech

46
Calibration Using close-talking recording
  • Given a close-talking mic recording for the
    calibration utterance, derive an optimal filter
    for each channel to improve recognition

Ms
FE
OPT
MIC1
ASR
h1(n)
t1
S
sn
FE
My
MICM
hM(n)
tM
47
Multi-microphone data sets
  • TMS
  • Recorded in CMU auditory lab
  • Approx. 5m x 5m x 3m
  • Noise from computer fans, blowers ,etc.
  • Isolated letters and digits, keywords
  • 10 speakers 14 utterances 140 utterances
  • Each utterance has close-talking mic control
    waveform

7cm
1m
48
Multi-microphone data sets (2)
  • WSJ off-axis noise source
  • Room simulation created using the image method
  • 5m x 4m x 3m
  • 200ms reverberation time
  • WGN source _at_ 5dB SNR
  • WSJ test set
  • 5K word vocabulary
  • 10 speakers 65 utterances 650 utterances
  • Original recordings used as close-talking control
    waveforms

25cm
2m
15cm
1m
49
Results
  • TMS data set, WSJ0 WGN point source simulation
  • Constructed 50 point filters from a single
    calibration utterance
  • Applied filters to all test utterances

50
Calibration without Close-talking Microphone
  • Obtain initial waveform estimate using
    conventional array processing technique (e.g.
    delay and sum).
  • Use transcription and the recognizer to estimate
    the sequence of target clean log Mel spectra.
  • Optimize filter parameters as before.

51
Calibration w/o Close-talking Microphone (2)
  • Force align the delay-and-sum waveform to the
    known transcription to generate an estimated HMM
    state sequence.

BLAH BLAH...
MIC1
t1
S
sn
FALIGN
FE
MICM
tM
HMM
52
Calibration w/o Close-talking Microphone (3)
  • Extract the means from the single Gaussian HMMs
    of the estimated state sequence.
  • Since the models have been trained from clean
    speech, use these means as the target clean
    speech feature vectors.

HMM
IDCT
53
Calibration w/o Close-talking Microphone (4)
  • Use estimated clean speech feature vectors to
    optimize filters as before.

OPT
MIC1
ASR
h1(n)
t1
S
sn
FE
My
MICM
hM(n)
tM
54
Results
  • TMS data set, WSJ0 WGN point source simulation
  • Constructed 50 point filters from calibration
    utterance
  • Applied filters to all utterances

55
Results (2)
  • WER vs. SNR for WSJ WGN
  • Constructed 50 point filters from calibration
    utterance using transcription only
  • Applied filters to all utterances

56
Performance in highly reverberant rooms
  • Comparison of single channel and delay-and-sum
    beamforming (WSJ data passed through measured
    impulse responses)

Single channel
Delay and sum
57
Performance in highly reverberant rooms
  • Impact of matched filter and feature-based array
    processing
  • Matched filter array used 1040 points per channel
    with perfect knowledge of channel characteristics
  • Feature-based array used 30 points per channel,
    imperfect channel knowledge

58
Subband processing using optimized features
(Seltzer)
  • Subband processing can address some of the
    effects of reverberation. The subband signals
    have more desirable narrowband signal properties.

2. Downsample
4. Upsample
1. Divide into independent subbands
5. Resynthesize full signal from subbands
3. Process subbands independently
L
H1(z)
F1(z)
G1(z)
L
H2(z)
F2(z)
G2(z)
S
yn
xn
L
HL(z)
GL(z)
FL(z)
59
Subband results with reverberated WSJ task
  • WER for all speakers, compared to delay-and-sum
    processing

60
Summary
  • Microphone array processing is effective although
    not yet in widespread use except for simple
    delay-and-um beamforming
  • Despite many developments in signal processing,
    actual applications to speech are based on very
    simple concepts
  • Major problems and issues
  • Maintaining good performance in reverberation
  • Real-time time-varying environments and speakers

61
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com