WOZ acoustic data collection for interactive TV - PowerPoint PPT Presentation

About This Presentation

Title:

WOZ acoustic data collection for interactive TV

Description:

Title: Diapositiva 1 Author: luisa Last modified by: Luca Created Date: 10/11/2006 10:05:01 AM Document presentation format: Presentazione su schermo – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 18

Provided by: lui2

Learn more at: http://www.lrec-conf.org

Category:

more less

Transcript and Presenter's Notes

Title: WOZ acoustic data collection for interactive TV

1
WOZ acoustic data collection for interactive TV

A. Brutti, L. Cristoforetti, W. Kellermann, L.
Marquardt, M. Omologo
Fondazione Bruno Kessler (FBK) - irst
Via Sommarive 18, 38050 Povo (TN), ITALY
Multimedia Communications and Signal
Processing, University of Erlangen-Nuremberg
(FAU)
Cauerstr. 7, 91058 Erlangen, GERMANY
LREC 2008 Marrakech, 28-30/05/08

2
The DICIT EU project

Distant-talking Interfaces for Control of
Interactive TV
User-friendly human-machine interface to enable a
speech-based interaction with a TV and related
digital devices (STB)
Interaction in a natural and spontaneous way
without a close-talk microphone

3
The DICIT EU projectDistant-talking Interfaces
for Control of Interactive TV
4
The DICIT EU projectDistant-talking Interfaces
for Control of Interactive TV
What is the output from each loudspeaker? How is
it at each microphone?
Where is she? What is her head orientation?
When did she speak?
Who is she?
What does she say?
5
The DICIT Project

STREP Project FP6
Strategic objective 2.5.7 Multimodal
Interfaces
Duration October 2006 September 2009

6
What is a Wizard of Oz (WOZ) experiment?

A subject is requested to complete specific tasks
using an artificial system
The user is told that the system is fully
functional and should try to use it in a
intuitively way
The system is operated by a person (wizard) not
visible to the subject
The wizard can react in a more comprehensive way
and can create particular situations

BUT
7
Why a WOZ data collection?

We needed to collect an acoustic database for
testing pre-processing algorithms
acoustic scene analysis
speaker ID and verification
echo cancelation
blind source separation
beamforming
speaker localization and tracking
distant automatic speech recognition
With a WOZ, realistic scenarios can be simulated
at a preliminary stage, allowing for repeatable
experiments
There is no need to have a full-working system in
order to collect real data
Naïve users, do not have the same behavior as
expert users, they use the system in a realistic
way

8
The DICIT WOZ

Experiments were conducted in the laboratories at
FBK and FAU
A room was used as living room with TV,
loudspeakers and seatsAn adjacent room was used
by the wizard and the simulation system, not
visible from the users
Users watched the TV and had to interact with it
by voice and remote control, to change channels
and to retrieve information from the teletext
pages
At some point, they had to move around and speak
with the system

9
Strategy for the recordings

Four users sit in the room, but one of them was
the co-wizard, that ensured the regularity of the
experiment and produced some acoustic events
Users were recorded by close talk and far
microphones
Interactions will be recorded by 3 fixed cameras
that allow the automatic tracking of users
movements
Recordings were done on Italian/German/English
groups

10
Hints for wizard and co-wizard

Wizard
simulate some recognition errors
dont accept speech for 10 after a volume change
(for convergence of the algorithms)
Co-wizard (in the room)
lead the first phase (users registration)
produce noises during teletext interaction (key
jingle, cough, phone ring, etc.)
keep the situation under control (give hints to
the real users)

11
Script for the interaction (person A)

Enter in the room and sit on the seat marked A
Wait for user D to switch on the system and say
your name, then read the four phonetically rich
sentences on the screen
When user D gives you the remote control, try
using it to change channels and volume
(next/previous channel, volume up/down, mute)
Connect to the system using your voice DICIT
activate
Use the voice to change channels and volume e.g.,
I want to see CNN
Select Euronews channel and start the teletext
(using the voice or the remote control)
Use the teletext to obtain the requested news and
weather info. Please move to different positions
in the room when interacting with the system
Log off from the system DICIT logoff and give
the remote control to user B

12
The FBK experimental room
A harmonic 15-electret-microphone array was
developed on purpose and located over the TV
13
Audio and video sensor setup

A harmonic 15-electret-microphonearray was
developed on purposeand located over the TV
A NIST MarkIII 64-electret-microphone linear
array was used for comparison
A table microphone and 2 side mics were
used(omnidirectional pattern)
Every participant wore a close-talk mic for
reference
3 video cameras recorded the sessions for
monitoring and to derive 3D reference positions

14
Clip from a recorded session
15
WOZ preparation

12 video clips and 100 teletext pages were
recorded from real TV, everything was available
in 3 languages
Stereo audio channels were extracted and
decorrelated (by FAU) for the echo canceller and
clips were recreated to fit the simulation
The system was controlled by a PC running
Elektrobit EB GUIDE Studio simulator tool
A remote control infrared receiver was integrated
into the system and enabled the users to use a
real remote control to pilot the TV

16
Recording hardware setup

3 PCs to record all the data, 2 Linux and 1
Windows machines

17
Recorded sessions

FBK and FAU recorded different sessions using a
similar setup, in different languages
Each user interaction lasted about 10 minutes, in
total 360 minutes of recordings
24 or 26 synchronous channels were recorded at
48kHz with 16-bit precision 64 channels from
the MarkIII array at 44.1kHz and 24 bits

Site Language Number of sessions
FBK Italian 6
FAU German 5
FAU English 1
18
Data annotation

The 6 Italian sessions have been manually
transcribed and segmented at word level, using
Transcriber
An automatic segmentation was obtained with a
tool based on energy of the close-talk signals,
then adjusted when necessary
A stereo file was created, with two channels for
close-talk and environment sounds to ease the
annotation process
Annotation comprises the speaker ID, the
transcription of uttered sentence and any noise
included in the acoustic event list
Specific labels for acoustic events have been
introduced, following a defined guideline
Video data has been used to derive 3D coordinates
for the head of the speaker and reference files
were created with a frame rate of 5 labels per
second

19
Data exploitation / testing

Data have been used for a preliminary evaluation
of some FBK algorithms
localization techniques, precision is around 30
cm
682 108 audio segments have been used for the
acoustic event classification system, 92 of
accuracy
data have been used to test the speaker
verification and identification system, but
close-talk is still better that beamformed signal
Room impulse response measurements have been
carried out at both sites, in different
positions. They are useful for i.e. speech
contamination purposes

20
Transcriber session
21
Conclusions

This collection of data has been the first of its
kind and is of significant benefit to acoustic
front-end algorithms and dialogue strategies
36 naïve persons have been recorded, leading to
360 minutes of signals, on 24-26 different
channels recorded in a synchronous way (125 GB of
data)
Users enjoyed the system and tolerated some
recognition errors, they preferred voice modality
over remote control interaction

22
Current status of the project