Title: Parham Aarabi
1Parham Aarabi Assistant Professor, Canada
Research Chair in Multi-Sensor Information
Systems, and Founder/Director of the Artificial
Perception Lab
The Edward S. Rogers Sr. Department of Electrical
and Computer Engineering University of Toronto
2Our Research Goal
Multi-Sensor Information Fusion for
Human-Computer Interaction Applications Examples
include Multi-Microphone Sound
Localization Multi-Microphone Speech
Separation Audiovisual Speech Processing Robotic
s
Why
Improve life for humans (e.g. speech recognition
for cars or the disabled, more intelligent
robotics, intelligent environments/cars/homes,
etc.)
3The Artificial Perception Lab
9 Graduate Students, 1 postdoc, 25 undergraduate
researchers
4Goal for today
Discuss central APL projects Sound
Localization Speech Separation/Enhancement
(briefly) Introduce other ongoing
projects Acoustical Robotic Navigation Audiovis
ual Sound Localization
5Microphone Array
Cameras
6(No Transcript)
7Now we transition
Basic Sound Localization
8Basic Sound Localization
Microphone arrays can be used to localize sound
sources since each source emanates a sound wave
that arrives at each microphone at different
times and amplitudes.
Applications -smart rooms -automatic
teleconfer. -robust speech recog. -robotics
-other HCI related app.
9Basic Sound Localization
10Basic Sound Localization
Knowledge about the TDOA between each microphone
pair constrains the source location to a
hyperbola in 2D, or hyperboloid in 3D
11TDOA-based Sound Localization
TDOA estimate
12Basic Sound Localization
13Sound Localization
- Sound localization can be expressed as
- F(y) is a Spatial Likelihood Function (SLF)
- Most basic example Delay-and-sum beamforming
based (a.k.a. Steered Response Power or SRP)
14The simplest SLF generation technique
delay-and-sum energy scanning Many more advanced
techniques exist
15Sound Localization
- Filter-and-sum beamforming based (i.e. using
Generalized Cross Correlations Knapp76) - The SRP-PHAT algorithm Dibiase01 uses the
Phase Transform
16Microphone Array
17Sound Localization
- Problem with SRP-PHAT is that all microphones are
weighted equally - We should be weighting microphones according to
their level-of-access Aarabi01,Aarabi03,
Mungamuru03
18Microphone Array
Microphone Arrays
19Measuring the reliability of a microphone array
- 3 primary factors affect the reliability of a
microphone - Source Directivity
- Microphone Directivity
- Source-Microphone Distance
20Enhanced Sound Localization
- Since we are modeling directivities, it is now
possible to extract source orientation - So, we now have Enhanced Sound Localization
- Spatial Likelihood Function is now also a
function of source orientation, ?
21Enhanced Sound Localization
- Temporal ML Algorithm
- Weighted SRP-PHAT Algorithm
22Sound Localization Example Using 24 Mics.
23Sound Localization Example Using 24 Mics.
24Sound Localization Example Using 24 Mics.
High Likelihood
Low Likelihood
25Sound Localization Example Using 24 Mics.
26More experiments with Stationary Speaker
27Results Stationary Speaker
28Experimental Setup
29Moving Speaker Comparison with SRP-PHAT
Weighted SRP-PHAT
SRP-PHAT
30Moving Speaker Comparison with SRP-PHAT
Over 100 seconds (1000 frames 100ms frames) of
moving speaker trials, resulting in
SRP-PHAT 7.3 Anomalies 23cm average
error Weighted SRP-PHAT 3 Anomalies 20cm
average error An anomaly is when the distance
error is greater than 1m
31Now we transition
Sound Localization Hardware Implementation
Sound Localization Algorithms
32Hardware based Speech Localization
- Problems with current sound localization
implementations - Scalability (not appropriate for 10 mics.)
- Power requirements (not good for mobile
applications) - Space requirements (multiple chips, etc.)
- As a result, we implemented a hardware based
2-microphone sound localization (TDOA estimation)
system in 0.18 µm CMOS
33Hardware based Speech Localization
- Initially implemented on FPGA NguyenICASSP03/ICM
E03 - Used the Phase Transform Technique, as shown
below
34Solution
- A full custom ASIC solution is capable of
- 100 resource utilization
- Efficient power utilization
- Efficient scalability options
35Chip Block Diagram
DSP Front-end
DSP Core
36Maximum Likelihood Engine
- Most computationally expensive part of chip
37The Result
38Chip Testing
39Chip Features
- 1.8 V core consumes 28.98 mW (10 times more
efficient than our FPGA implementation, 20-50
times more efficient than typical DSP
implementations) - At 20dB SNR, about 20 of the localizations
resulted in anomalies, with a 2.2 degree average
angle error in non-anomalous estimations - The next step is to combine speech localization
and separation into a single VLSI chip, for
Tablet PC/PDA/Cell phone applications
40Now we transition
Sound Localization Hardware Implementation
Speech Separation
41Speech Separation Using Time-Frequency
Masking AarabiFusion02, ShiICASSP03/ICME03,
AarabiICME03
Question How can we use knowledge about the
location of the sound source in order to remove
noise/unwanted background speakers?
42Speech Separation Using Time-Frequency Masking
43Speech Separation Using Time-Frequency Masking
Frequency (?)
Time index (k)
44How do we process the noisy recordings to get
back our signal of interest? Idea scale each
time-frequency (TF) block based on the phases in
each recording
Microphone 2 Recording
Microphone 1 Recording
45Time-Frequency (TF) speech representation
The spectrograms X1k(?) and X2k(?) are not
the complete frequency domain representation we
also have and
Frequency (?)
Time index (k)
46Using the two phase functions, we obtain a TF
mask
47Example
Result after applying mask to the first
microphones signal
Original signal
48Speech Recognition Results
TFM (Time-Frequency Masking) outperforms both DS
(Delay-and-sum) and SD (Superdirective
Bitzer99) at 0dB SNR.
49Multi-Microphone Probabilistic Speech
Separation RennieICASSP03/ICME03
The previous technique assumed no prior knowledge
about the speech sources. Question How can
such prior knowledge be used, in conjunction with
the spatial position of the source, to separate
multiple speakers?
50M Sources, N Mics.
s (t)
M
x (t)
1
s (t)
2
x (t)
N
s (t)
1
51Multi-Microphone Probabilistic Speech Separation
Our approach 1-Learn probabilistic models for
each source 2-Estimate the original source
signals by computing the most likely (or,
alternatively, the expected) source signal given
the prior speech model, the mixed microphone
recordings, and the time-delays of arrival This
is an extension of the work of Deng,
Kristjansson, Frey, and Acero, as well as others.
52Graphical model representation (for each
frequency)
s
s
s
53Preliminary Results with 2 Microphones
- Each microphone receives a mixture of 2, 3 or 4
speakers, i.e. 0dB SNR between speakers. - In addition, the microphone signal is corrupted
by independent white Gaussian noise at 20db, 10db
and 0db.
54Now we transition
Speech Separation
Other Topics
55Audiovisual Sound Localization Aarabi01
56Acoustic Robot Navigation
57Conclusions
- The fusion of multiple sensors allows for more
accurate sound localization and speech
recognition. - Current research efforts include
- Audiovisual Speech Separation
- Dynamic Camera and Microphone Arrays
- Multi-Rate Multi-Microphone Signal Enhancement
- Multi-Microphone Speaker Identification
58Please visit
www.apl.utoronto.ca