WOZ acoustic data collection for interactive TV - PowerPoint PPT Presentation

About This Presentation
Title:

WOZ acoustic data collection for interactive TV

Description:

Title: Diapositiva 1 Author: luisa Last modified by: Luca Created Date: 10/11/2006 10:05:01 AM Document presentation format: Presentazione su schermo – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 18
Provided by: lui2
Learn more at: http://www.lrec-conf.org
Category:

less

Transcript and Presenter's Notes

Title: WOZ acoustic data collection for interactive TV


1
WOZ acoustic data collection for interactive TV
  • A. Brutti, L. Cristoforetti, W. Kellermann, L.
    Marquardt, M. Omologo
  • Fondazione Bruno Kessler (FBK) - irst
  • Via Sommarive 18, 38050 Povo (TN), ITALY
  • Multimedia Communications and Signal
    Processing, University of Erlangen-Nuremberg
    (FAU)
  • Cauerstr. 7, 91058 Erlangen, GERMANY
  • LREC 2008 Marrakech, 28-30/05/08

2
The DICIT EU project
  • Distant-talking Interfaces for Control of
    Interactive TV
  • User-friendly human-machine interface to enable a
    speech-based interaction with a TV and related
    digital devices (STB)
  • Interaction in a natural and spontaneous way
    without a close-talk microphone

3
The DICIT EU projectDistant-talking Interfaces
for Control of Interactive TV
4
The DICIT EU projectDistant-talking Interfaces
for Control of Interactive TV
What is the output from each loudspeaker? How is
it at each microphone?
Where is she? What is her head orientation?
When did she speak?
Who is she?
What does she say?
5
The DICIT Project
  • STREP Project FP6
  • Strategic objective 2.5.7 Multimodal
    Interfaces
  • Duration October 2006 September 2009

6
What is a Wizard of Oz (WOZ) experiment?
  • A subject is requested to complete specific tasks
    using an artificial system
  • The user is told that the system is fully
    functional and should try to use it in a
    intuitively way
  • The system is operated by a person (wizard) not
    visible to the subject
  • The wizard can react in a more comprehensive way
    and can create particular situations

BUT
7
Why a WOZ data collection?
  • We needed to collect an acoustic database for
    testing pre-processing algorithms
  • acoustic scene analysis
  • speaker ID and verification
  • echo cancelation
  • blind source separation
  • beamforming
  • speaker localization and tracking
  • distant automatic speech recognition
  • With a WOZ, realistic scenarios can be simulated
    at a preliminary stage, allowing for repeatable
    experiments
  • There is no need to have a full-working system in
    order to collect real data
  • Naïve users, do not have the same behavior as
    expert users, they use the system in a realistic
    way

8
The DICIT WOZ
  • Experiments were conducted in the laboratories at
    FBK and FAU
  • A room was used as living room with TV,
    loudspeakers and seatsAn adjacent room was used
    by the wizard and the simulation system, not
    visible from the users
  • Users watched the TV and had to interact with it
    by voice and remote control, to change channels
    and to retrieve information from the teletext
    pages
  • At some point, they had to move around and speak
    with the system

9
Strategy for the recordings
  • Four users sit in the room, but one of them was
    the co-wizard, that ensured the regularity of the
    experiment and produced some acoustic events
  • Users were recorded by close talk and far
    microphones
  • Interactions will be recorded by 3 fixed cameras
    that allow the automatic tracking of users
    movements
  • Recordings were done on Italian/German/English
    groups

10
Hints for wizard and co-wizard
  • Wizard
  • simulate some recognition errors
  • dont accept speech for 10 after a volume change
    (for convergence of the algorithms)
  • Co-wizard (in the room)
  • lead the first phase (users registration)
  • produce noises during teletext interaction (key
    jingle, cough, phone ring, etc.)
  • keep the situation under control (give hints to
    the real users)

11
Script for the interaction (person A)
  • Enter in the room and sit on the seat marked A
  • Wait for user D to switch on the system and say
    your name, then read the four phonetically rich
    sentences on the screen
  • When user D gives you the remote control, try
    using it to change channels and volume
    (next/previous channel, volume up/down, mute)
  • Connect to the system using your voice DICIT
    activate
  • Use the voice to change channels and volume e.g.,
    I want to see CNN
  • Select Euronews channel and start the teletext
    (using the voice or the remote control)
  • Use the teletext to obtain the requested news and
    weather info. Please move to different positions
    in the room when interacting with the system
  • Log off from the system DICIT logoff and give
    the remote control to user B

12
The FBK experimental room
A harmonic 15-electret-microphone array was
developed on purpose and located over the TV
13
Audio and video sensor setup
  • A harmonic 15-electret-microphonearray was
    developed on purposeand located over the TV
  • A NIST MarkIII 64-electret-microphone linear
    array was used for comparison
  • A table microphone and 2 side mics were
    used(omnidirectional pattern)
  • Every participant wore a close-talk mic for
    reference
  • 3 video cameras recorded the sessions for
    monitoring and to derive 3D reference positions

14
Clip from a recorded session
15
WOZ preparation
  • 12 video clips and 100 teletext pages were
    recorded from real TV, everything was available
    in 3 languages
  • Stereo audio channels were extracted and
    decorrelated (by FAU) for the echo canceller and
    clips were recreated to fit the simulation
  • The system was controlled by a PC running
    Elektrobit EB GUIDE Studio simulator tool
  • A remote control infrared receiver was integrated
    into the system and enabled the users to use a
    real remote control to pilot the TV

16
Recording hardware setup
  • 3 PCs to record all the data, 2 Linux and 1
    Windows machines

17
Recorded sessions
  • FBK and FAU recorded different sessions using a
    similar setup, in different languages
  • Each user interaction lasted about 10 minutes, in
    total 360 minutes of recordings
  • 24 or 26 synchronous channels were recorded at
    48kHz with 16-bit precision 64 channels from
    the MarkIII array at 44.1kHz and 24 bits

Site Language Number of sessions
FBK Italian 6
FAU German 5
FAU English 1
18
Data annotation
  • The 6 Italian sessions have been manually
    transcribed and segmented at word level, using
    Transcriber
  • An automatic segmentation was obtained with a
    tool based on energy of the close-talk signals,
    then adjusted when necessary
  • A stereo file was created, with two channels for
    close-talk and environment sounds to ease the
    annotation process
  • Annotation comprises the speaker ID, the
    transcription of uttered sentence and any noise
    included in the acoustic event list
  • Specific labels for acoustic events have been
    introduced, following a defined guideline
  • Video data has been used to derive 3D coordinates
    for the head of the speaker and reference files
    were created with a frame rate of 5 labels per
    second

19
Data exploitation / testing
  • Data have been used for a preliminary evaluation
    of some FBK algorithms
  • localization techniques, precision is around 30
    cm
  • 682 108 audio segments have been used for the
    acoustic event classification system, 92 of
    accuracy
  • data have been used to test the speaker
    verification and identification system, but
    close-talk is still better that beamformed signal
  • Room impulse response measurements have been
    carried out at both sites, in different
    positions. They are useful for i.e. speech
    contamination purposes

20
Transcriber session
21
Conclusions
  • This collection of data has been the first of its
    kind and is of significant benefit to acoustic
    front-end algorithms and dialogue strategies
  • 36 naïve persons have been recorded, leading to
    360 minutes of signals, on 24-26 different
    channels recorded in a synchronous way (125 GB of
    data)
  • Users enjoyed the system and tolerated some
    recognition errors, they preferred voice modality
    over remote control interaction

22
Current status of the project
  • The project is in the second year
  • We just finished to integrate the first
    prototype
  • Ready to start the evaluation of the prototype
  • More information and demo clips can be found
    at http//dicit.fbk.eu

23
  • Thank You!
Write a Comment
User Comments (0)
About PowerShow.com