Title: WOZ acoustic data collection for interactive TV
1WOZ acoustic data collection for interactive TV
- A. Brutti, L. Cristoforetti, W. Kellermann, L.
Marquardt, M. Omologo - Fondazione Bruno Kessler (FBK) - irst
- Via Sommarive 18, 38050 Povo (TN), ITALY
- Multimedia Communications and Signal
Processing, University of Erlangen-Nuremberg
(FAU) - Cauerstr. 7, 91058 Erlangen, GERMANY
- LREC 2008 Marrakech, 28-30/05/08
2The DICIT EU project
- Distant-talking Interfaces for Control of
Interactive TV - User-friendly human-machine interface to enable a
speech-based interaction with a TV and related
digital devices (STB) - Interaction in a natural and spontaneous way
without a close-talk microphone
3The DICIT EU projectDistant-talking Interfaces
for Control of Interactive TV
4The DICIT EU projectDistant-talking Interfaces
for Control of Interactive TV
What is the output from each loudspeaker? How is
it at each microphone?
Where is she? What is her head orientation?
When did she speak?
Who is she?
What does she say?
5The DICIT Project
- STREP Project FP6
- Strategic objective 2.5.7 Multimodal
Interfaces - Duration October 2006 September 2009
6What is a Wizard of Oz (WOZ) experiment?
- A subject is requested to complete specific tasks
using an artificial system - The user is told that the system is fully
functional and should try to use it in a
intuitively way - The system is operated by a person (wizard) not
visible to the subject - The wizard can react in a more comprehensive way
and can create particular situations
BUT
7Why a WOZ data collection?
- We needed to collect an acoustic database for
testing pre-processing algorithms - acoustic scene analysis
- speaker ID and verification
- echo cancelation
- blind source separation
- beamforming
- speaker localization and tracking
- distant automatic speech recognition
- With a WOZ, realistic scenarios can be simulated
at a preliminary stage, allowing for repeatable
experiments - There is no need to have a full-working system in
order to collect real data - Naïve users, do not have the same behavior as
expert users, they use the system in a realistic
way
8The DICIT WOZ
- Experiments were conducted in the laboratories at
FBK and FAU - A room was used as living room with TV,
loudspeakers and seatsAn adjacent room was used
by the wizard and the simulation system, not
visible from the users - Users watched the TV and had to interact with it
by voice and remote control, to change channels
and to retrieve information from the teletext
pages - At some point, they had to move around and speak
with the system
9Strategy for the recordings
- Four users sit in the room, but one of them was
the co-wizard, that ensured the regularity of the
experiment and produced some acoustic events - Users were recorded by close talk and far
microphones - Interactions will be recorded by 3 fixed cameras
that allow the automatic tracking of users
movements - Recordings were done on Italian/German/English
groups
10Hints for wizard and co-wizard
- Wizard
- simulate some recognition errors
- dont accept speech for 10 after a volume change
(for convergence of the algorithms) - Co-wizard (in the room)
- lead the first phase (users registration)
- produce noises during teletext interaction (key
jingle, cough, phone ring, etc.) - keep the situation under control (give hints to
the real users)
11Script for the interaction (person A)
- Enter in the room and sit on the seat marked A
- Wait for user D to switch on the system and say
your name, then read the four phonetically rich
sentences on the screen - When user D gives you the remote control, try
using it to change channels and volume
(next/previous channel, volume up/down, mute) - Connect to the system using your voice DICIT
activate - Use the voice to change channels and volume e.g.,
I want to see CNN - Select Euronews channel and start the teletext
(using the voice or the remote control) - Use the teletext to obtain the requested news and
weather info. Please move to different positions
in the room when interacting with the system - Log off from the system DICIT logoff and give
the remote control to user B
12The FBK experimental room
A harmonic 15-electret-microphone array was
developed on purpose and located over the TV
13Audio and video sensor setup
- A harmonic 15-electret-microphonearray was
developed on purposeand located over the TV - A NIST MarkIII 64-electret-microphone linear
array was used for comparison - A table microphone and 2 side mics were
used(omnidirectional pattern) - Every participant wore a close-talk mic for
reference - 3 video cameras recorded the sessions for
monitoring and to derive 3D reference positions
14Clip from a recorded session
15WOZ preparation
- 12 video clips and 100 teletext pages were
recorded from real TV, everything was available
in 3 languages - Stereo audio channels were extracted and
decorrelated (by FAU) for the echo canceller and
clips were recreated to fit the simulation - The system was controlled by a PC running
Elektrobit EB GUIDE Studio simulator tool - A remote control infrared receiver was integrated
into the system and enabled the users to use a
real remote control to pilot the TV
16Recording hardware setup
- 3 PCs to record all the data, 2 Linux and 1
Windows machines
17Recorded sessions
- FBK and FAU recorded different sessions using a
similar setup, in different languages - Each user interaction lasted about 10 minutes, in
total 360 minutes of recordings - 24 or 26 synchronous channels were recorded at
48kHz with 16-bit precision 64 channels from
the MarkIII array at 44.1kHz and 24 bits
Site Language Number of sessions
FBK Italian 6
FAU German 5
FAU English 1
18Data annotation
- The 6 Italian sessions have been manually
transcribed and segmented at word level, using
Transcriber - An automatic segmentation was obtained with a
tool based on energy of the close-talk signals,
then adjusted when necessary - A stereo file was created, with two channels for
close-talk and environment sounds to ease the
annotation process - Annotation comprises the speaker ID, the
transcription of uttered sentence and any noise
included in the acoustic event list - Specific labels for acoustic events have been
introduced, following a defined guideline - Video data has been used to derive 3D coordinates
for the head of the speaker and reference files
were created with a frame rate of 5 labels per
second
19Data exploitation / testing
- Data have been used for a preliminary evaluation
of some FBK algorithms - localization techniques, precision is around 30
cm - 682 108 audio segments have been used for the
acoustic event classification system, 92 of
accuracy - data have been used to test the speaker
verification and identification system, but
close-talk is still better that beamformed signal - Room impulse response measurements have been
carried out at both sites, in different
positions. They are useful for i.e. speech
contamination purposes
20Transcriber session
21Conclusions
- This collection of data has been the first of its
kind and is of significant benefit to acoustic
front-end algorithms and dialogue strategies - 36 naïve persons have been recorded, leading to
360 minutes of signals, on 24-26 different
channels recorded in a synchronous way (125 GB of
data) - Users enjoyed the system and tolerated some
recognition errors, they preferred voice modality
over remote control interaction
22Current status of the project
- The project is in the second year
- We just finished to integrate the first
prototype - Ready to start the evaluation of the prototype
- More information and demo clips can be found
at http//dicit.fbk.eu
23