Title: Overview to ICMI02 International Conference on Multimodal Interfaces 02
1Overview to ICMI02 (International Conference on
Multimodal Interfaces 02)
2General Information
- Official Website http//www.is.cs.cmu.edu/icmi/
- Submission results
3Paper Submission
Total 165 submissions
4Paper Acceptance
Total 87 papers in Proceedings
5Attendees
About 150 attendees
6Conference Outline
Sensors, Tools and Platforms
Speech Generation/Recognition
Speech/Text
Dialogue/Language Understanding
ICMI 02
Translation/Multilingual Interface
Signing, Gesturing, Writing
Vision
Gaze Tracking and Lipreading
Face Detection/Recognition
Application, User study, Evaluation
7Keynote Speaker
- Three keynote speakers
- Hiroshi Ishii, Director of Tangible Media Lab,
MIT - Lucas Parra, Sarnoff Corporation
- Clifford Nass, Stanford University
8Keynote Talk I
- Hiroshi Ishii, Tangible Bits Designing the
Seamless Interface between People, Bits, and
Atoms - Goal change the "painted bits" of GUIs
(Graphical User Interfaces) to "tangible bits
(Physical User Interface)
9Keynote Talk I
- Tangible user interfaces employ physical
objects, surfaces, and spaces as tangible digital
information
- Three key concepts
- Interactive Surfaces
- Surfaces(e.g. walls, desktops) become active
interface - Coupling of Bits and Atoms
- Seamless coupling of graspable objects (e.g.
books , cards) - Ambient Media
- Use of ambient media(e.g. sound, light) as
background interface
10Keynote Talk I
11Keynote Talk II
- Lucas Parra, Noninvasive Brain Computer
Interfaces for Rehabilitation and Augmentation - Brain Computer Interfaces
- reading information directly from the brain,
instead of typing, writing and pointing
12Keynote Talk II
- Advance non-invasive brain imaging
- Avoid exposure to the radiation of X-rays
- Application
- Rehabilitation research (two decades before)
- Augment HCI (now)
13Keynote Talk III
- Clifford Nass, Integrating Multiple Modalities
Psychology and Design of Multimodal Interfaces - Question How USERS integrate the modalities and
content? (social psychology experiments) - When should synthetic and veridical aspects of
interfaces be mixed? - What is the link between visual appearance and
voice characteristics of pictorial agents? - How should interfaces respond to
misunderstandings (a common problem in multimodal
interfaces)? - How do multi-modal interfaces manifest gender,
personality, and emotion?
14Keynote Talk III
- An example for their study results
- When integrating voice and face, they should be
consistent - Non-native speakers face Non-native speakers
voice is better than Non-native speakers face
native speakers voice - Cant internationalize website or agent by only
manipulating one dimension - Nass Slides
15Conference Outline
Sensors, Tools and Platforms
Speech Generation/Recognition
Speech/Text
Dialogue/Language Understanding
ICMI 02
Translation/Multilingual Interface
Signing, Gesturing, Writing
Vision
Gaze Tracking and Lipreading
Face Detection/Recognition
Application, User study, Evaluation
16Speech Recognition/Generation
- Covariance-tied Clustering in Speaker
Identification (Zhiqiang Wang, etc.) - Problem Description
- Speaker Identification GMM model EM algorithm
(Suffers from Local Maxima) - Solution K-Means clustering -gt better initial
model - Focus How to provide more robust initial models
using K-Means clustering
17Speech Recognition/Generation
- Euclidean Distance
- Most widely used distance for K-Means clustering
- No consideration for point distribution
- Mahalanobis Distance
- Weights the distance by covariance matrix
- Sphere the data and define better distance
- But, data sparseness has to be solved
- More parameters to estimate (The covariance
matrix)
18Speech Recognition/Generation
- Layered Clustering Algorithm
- Train one covariance matrix for all speech data
- Use k-Means algorithm based on Mahalanobis
distance to split data into two clusters - Get two new covariance matrices
- Iterate 2,3 until it has enough clusters
19Speech Recognition/Generation
- Speaker Database (83 male, 83 female)
- 26-dimensional MFCC vector
- 64 mixture GMMs
- Error Rate
- The number represent the layers of clustering
20Conference Outline
Sensors, Tools and Platforms
Speech Generation/Recognition
Speech/Text
Dialogue/Language Understanding
ICMI 02
Translation/Multilingual Interface
Signing, Gesturing, Writing
Vision
Gaze Tracking and Lipreading
Face Detection/Recognition
Application, User study, Evaluation
21Gesturing
- Prosody-based Co-analysis for continuous
Recognition of Coverbal Gestures (Sanshzar etc.) - Motivation Better ways to recognize gesture
- Previous
- Combination of speech and gesture to boost
classification (Semantically motivated) - This work
- Fuse more elementary features with gesture
features Prosodic feature, including Fundamental
frequency (F0) contour and voiceless
intervals(pause) - Construct co-occurrence model
22Gesturing
- Results
- Correct Recognition Rate 81.8 (vs 72.4 with
only visual feature) - Deletion Error 8.6 (vs 16.1)
- Substitution Error 5.8 (vs. 9.2)
- Comment Coanalysis for audio-visual features
will be helpful, such as the monologue detection
in TREC.
23Conference Outline
Sensors, Tools and Platforms
Speech Generation/Recognition
Speech/Text
Dialogue/Language Understanding
ICMI 02
Translation/Multilingual Interface
Signing, Gesturing, Writing
Vision
Gaze Tracking and Lipreading
Face Detection/Recognition
Application, User study, Evaluation
24Multilingual Interface
- Improved Named Entity Translation and Bilingual
Named Entity Extraction ( Fei Huang, Stephen
Vogel) - Motivation Improve the named entity annotation
quality with bilingual corpus information - Basic Idea
- Cross-lingual information may be hints to correct
named entity extraction error in baseline system - Extracted named entity with high alignment cost
tends to be wrong
25Multilingual Interface
- Proposed NE annotation scheme
- Annotate the bilingual corpus separately, using
BBNs IdentiFinder - Compute Augmented Sentence Alignment Cost(ASAC)
on baseline annotation - Find all the possible NEs from corpus
- Using a greedy approximation algorithm to find
alignment with minimal ASAC, if the cost are less
than baseline algorithm, accept the alignment. - Tag the unaligned NE with most frequent type
26Multilingual Interface
- Results
- Comments Cross-lingual relation contains useful
information
27Conference Outline
Sensors, Tools and Platforms
Speech Generation/Recognition
Speech/Text
Dialogue/Language Understanding
ICMI 02
Translation/Multilingual Interface
Signing, Gesturing, Writing
Vision
Gaze Tracking and Lipreading
Face Detection/Recognition
Application, User study, Evaluation
28Sensors, Tools and Platforms
- Audiovisual arrays for untethered spoken
interfaces (Kenvin Wilson, Trevor Darrell, etc. ) - Motivation
- When faced with a distant speaker at a known
location, microphone array can improve speech
recognition - Estimating the location of a speaker in
reverberant environment is difficult, so use a
video camera array to aid localization - A audio-visual array approach to tracking speaker
29Sensors, Tools and Platforms
- Two array processing problem
- Beamforming spatial filtering for speech signal
- Amplify the signals coming from the selected
region by adding/filtering from the array - Source Localization estimate location of signal
source - One way beamform for all the location and choose
the strongest location (large computation cost) - Another way Using the delay between arrays to
calculate location - These problems are complementary to each other
-
30Sensors, Tools and Platforms
- Person tracking with multiple stereo views
- To aid the source localizaition
- Beamforming only the audio data
- Source Localization audio video
- Process
- Vision Tracker initial guess for the location,
accurate within less than one meter - Beam Power gradient descent search for a local
maxima -
31Sensors, Tools and Platforms
32Sensors, Tools and Platforms
33Summary
- Informedia is related to multimodal interface
- Information gathering, searching and browsing
- Experience On Demand (EOD)
- Capturing Coordinating and Remembering Human
Experience (CCRHE) - Cross-modal relationship
- Cross-modal consistency
-