Title: MultiMicrophone
1Multi-Microphone Speech Processing
Prof. Parham Aarabi Canada Research Chair in
Multi-Sensor Information Systems Founder and
Director of the Artificial Perception Lab
Department of Electrical and Computer
Engineering University of Toronto
2The why of speech recognition
Speech recognition is great for -Interfacing
with handheld/tablet computers -Hands-free car
function control -Interactive man-machine
conversation
3The why not of speech recognition
Speech recognition systems need further research
since -noise significantly degrades their
performance -they can be confused with multiple
speakers -they are computationally demanding
4My teams research goal
Make real-time robust speech recognition possible
using multiple sensors and hardware acceleration
Things we can do with multiple
sensors Multi-Microphone Sound
Localization Multi-Microphone Speech
Enhancement Audiovisual Speech Enhancement
5The Artificial Perception Lab
-an experimental environment with microphone
arrays, camera arrays, -1 postdoc, 8 graduate
students, 25 undergraduate research
students -funded by Dell, Microsoft,
6Our approach
1- localize the speech source of interest
(distributed sound localization) 2- enhance
the speech source of interest 3- recognize the
enhanced signal
7Basic Sound Localization
Microphone arrays can localize sound using time
of arrival and intensity differences.
Applications -smart rooms -automatic
telecon. -robust speech rec. -robotics -other
applications
8Sound Localization
- Sound localization can be expressed as
- F(x) is a Spatial Likelihood Function (SLF)
- Most basic example Steered Response Power (SRP)
(a.k.a. delay-and-sum)
9Sound Localization
- Filter-and-sum version is better (i.e. using
Generalized Cross Correlations Knapp76) - The SRP-PHAT algorithm Dibiase01 uses the
Phase Transform
10Distributed Microphone Arrays (DMAs)
- DMA advantage higher localization accuracy
- Where could DMAs be used?
- Suitable for cars, smart rooms, large
environments (airports, ) - In situations with networked cell phones, PDAs,
etc. - Are current techniques suitable for DMAs?
- Perhaps not!
11- Prior Work
- Account for the different levels of access of
different microphones Aarabi 01
B
A
C
D
Source
E
12Modeling the Environment, for a presumed speaker
location and orientation
- Three attenuation factors
- Source directivity, a(?)
- Microphone directivity, b(?)
- Source-microphone distance, d(x1,x2)
13Modeling the Environment, for a presumed speaker
location and orientation
14Modeling the Environment, for a presumed speaker
location and orientation
- Overall attenuation is
- time-delays of arrival are
15Enhanced Sound Localization
- Enhanced Sound Localization
- F(x,?) is the new Spatial Likelihood Function
- Attenuations ?i will weight the contributions
from each individual microphone
16Enhanced Sound Localization
- N microphones, each observes a signal mi(t)
- Define SLF to be proportional to the log
likelihood
17SLF Generation
- Assuming that the noise is Gaussian, the
log-likelihood based SLF becomes
18Sound Localization Example Using 24 Mics.
19Sound Localization Example Using 24 Mics.
High Likelihood
Low Likelihood
20Sound Localization Example Using 24 Mics.
Low Likelihood
High Likelihood
21Sound Localization Example Using 24 Mics.
22Sound Localization Example Using 24 Mics.
23Sound Localization Example Using 24 Mics.
24Remarks
- Can be extended to the filter-and-sum sound
localization technique - Results in over 60 reduction in percent
anomalies over SRP-PHAT - Brute-force search is inefficient, especially
with the extra orientation dimension
25So now, we can localize a sound source
Microphone Array
26 but, how do we use the location information to
remove noise
Multi-Microphone Phase-Based Speech Enhancement
27Current multi-mic. systems improve SNR
28Our goal reduce perceptual quality of noise
29Spectrogram of one speaker with no noise
Xk(?)
Frequency (?)
Time segment index (k)
30Spectrograms with two speakers and two
microphones
Microphone 2 Recording
Microphone 1 Recording
31The counterpart of spectrograms
Besides the spectrograms X1k(?) and X2k(?),
we also have and
Frequency (?)
Time index (k)
32Basic approach Ideally,
But, in reality, with noise and reverberations,
we have phase error
33signal of interest power distribution
noise power distribution
power distribution
?
0
phase error (?(?))
34Goal scale each time-frequency (TF) block in
order to damage the noise signal
proposed perceptually motivated phase-error filter
power distribution
?
0
phase error (?k(?))
35Hence, we get a TF mask
36Which can be applied to either spectrograms
37Resulting in damaged noise
Original signal
Result
38Phase-error filter design choices
proposed perceptually motivated phase-error filter
power distribution
?
0
phase error (?(?))
39Comparison with other speech enhancement methods
40Speech recognition experiments Using the Bell
Labs CARVUI multi-microphone speech database (56
speakers) SNR-5dB
45o
45o
6cm
41Speech recognition accuracy rates
2 microphones
42Speech recognition accuracy rates
43Speech recognition accuracy rates
44Speech recognition accuracy rates
45Speech recognition accuracy rates
46Speech recognition accuracy rates
47Speech recognition accuracy rates
48superdirective beamforming (4 mics.)
Time-Frequency Speech Separation (4 mics.)
49Ongoing work speech recognition with feedback
Multi-Mic. Speech Processing
Speech Recognition Front-End
Speech Recognition Back-End
hello
50Ongoing work probabilistic speech separation
Speaker 1 Speech Class
By Frey, Kristjansson, Deng, Attias, and others.
Freq. 1 Magnitude
Freq. 2 Magnitude
Freq. 3 Magnitude
Freq. 4 Magnitude
Freq. 5 Magnitude
Freq. N Magnitude
Freq. N Mic. 1 Observation
Freq. N Noise
51Ongoing work probabilistic speech separation
Speaker 1 Speech Class
Speaker 2 Speech Class
Freq. 1 Magnitude
Freq. 1 Magnitude
Freq. 2 Magnitude
Freq. 2 Magnitude
Freq. 3 Magnitude
Freq. 3 Magnitude
Freq. 4 Magnitude
Freq. 4 Magnitude
Freq. 5 Magnitude
Freq. 5 Magnitude
Freq. N Magnitude
Freq. N Magnitude
Speaker Location Based Mixture
Freq. N Noise
Freq. N Noise
Freq. N Mic. 1 Observation
Freq. N Mic. 2 Observation
52- Multi-microphone localization and enhancement
problems - Cannot be performed on standard processors in
real-time - Scalability (not appropriate for 10 mics., even
with DSPs) - Power requirements (not good for mobile
applications) - Space requirements (multiple chips, etc.)
53Solution Hardware Acceleration
- Initially implemented a TDOA-estimation VLSI chip
for sound localization - Initially implemented on FPGA Nguyen et. al.,
and then in 0.18?m CMOS
54Solution Hardware Acceleration
- Currently working on a low-power joint
localization and enhancement IC core
55Solution Hardware Acceleration
- Eventually, the goal is to have a low-power
localization and enhancement co-processor
56Concluding Remarks
- Multi-microphone speech processing is useful for
robust speech recognition - Other work at the APL includes
- Distributed Processing For Microphone Arrays
(from a Sensor Networks view) - Camera Arrays and Audiovisual Speech Processing
57Research has led to spin-off company
58Please visit
www.apl.utoronto.ca