Title: CALO VISUAL INTERFACE RESEARCH PROGRESS
1CALO VISUAL INTERFACERESEARCH PROGRESS
- David Demirdjian
- Trevor Darrell
- MIT CSAIL
2pTablet (or pLaptop!)
- Goal visual cues to conversation or interaction
state - presence
- attention
- turn-taking
- agreement and grounding gestures
- emotion and expression cues
- visual speech features
3Functional Capabilities
- Help CALO infer
- whether the user is still participating in a
conversation or interaction, - is focused on the interface or listening to
another person. - when the user is speaking,
- further features pertaining to visual speech
- non-verbal means to observe whether a user is
confirming understanding of or agreement with the
current topic or question, - is confused or irritated
- both for meeting understanding, and CALO UI
4Machine Learning Research Challenges
- Focusing on learning methods which capture
personalized interaction - Articulatory models of visual speech
- Sample-based methods for body tracking
- Hidden-state conditional random fields
- Context-based gesture recognition
- (Not all are yet in deployed demo)
5Articulatory models of visual speech
- Traditional models of visual speech presume
synchronous units based on visimes, the visual
correlate of phonemes. - Audiovisual speech production is often
asynchronous - Model with formed with a series of loosely
coupled streams of articulatory features. - (See Saenko and Darrell, ICMI 2004, and Saenko et
al., ICCV 2005, for more information.)
6Sample-based methods for body tracking
- Tracking human bodies requires exploration of a
high-dimensional state space - Estimated posteriors are often sharp and
multimodal. - New tracking techniques based on novel
approximate nearest neighbor hashing method which
have comprehensive pose coverage, and optimally
integrate information over time. - These techniques are suitable for real-time
markerless motion capture, and for tracking the
human body to infer attention and gesture. - (See Demirdjian et al. ICCV 2005, and Taycher et
al. CVPR 2006 for more information.)
7Hidden-state conditional random fields
- Discriminative techniques are efficient and
accurate, and learn to represent only the portion
of a state necessary for a specific task. - Conditional random fields are effective at
recognizing visual gestures, but lack the ability
of generative models to capture gesture
substructure through hidden state. - We have developed a hidden-state conditional
random field formulation. - (See Wang et al. CVPR 2006.)
8Hidden Conditional Random Fields for Head Gesture
Recognition
3 classes Nods, Shakes, Junk
Models Accuracy()
HMM W 0 46.33
CRF W 0 38.42
HCRF(multiclass) W 0 45.37
HCRF(multiclass) W 1 64.44
9Context-based gesture recognition
- Recognition of users gesture should be done in
the context of the current interaction - Visual recognition can be augmented with context
cues from the interaction state - conversational dialog with an embodied agent
- interaction with a conventional windows and mouse
interface. - See Morency, Sidner and Darrell, ICMI 2005 and
Morency and Darrell, IUI 2006
10User Adaptive Agreement Recognition
- Persons idiolect
- User agreement from recognized speech and head
gestures - multimodal co-training
- Challenges
- Asynchrony between modalities
- Missing data problem
11Status
- New pTablet functionalities
- Face/gaze tracking
- Head gesture recognition (nod/shake) Gaze
- Lip/Mouth motion detection
- User enrollment/recognition (ongoing work)
- A/V Integration
- Audio-visual sync./calibration
- Meeting visualization/understanding
12pTablet system
- user model (frontal view)
- pose (6D)
- OAA messages
- person ID
- head pose
- gesture
- lips moving
pTablet camera
Head Gesture Recognition
Speech
audio
13(No Transcript)
14Speaking activity detection
- Face tracking as
- Rigid pose
15Speaking activity detection
- Face tracking as
- Rigid pose Non-rigid facial deformations
16Speaking activity detection
- Speaking activity
- high motion energy in Mouth/lips region
- weak assumption (eg. hand moving in front of
mouth will trigger speaking activity detection) - But complement well audio-based speaker detection
17Speaking activity detection
18User enrollment/recognition
- Idea
- At start the user is automatically identified and
logged in by the pTablet. - If the user is not recognized or misrecognized,
he will have to login manually. - Face recognition based on a Feature Set Matching
algorithm (The Pyramid Match Kernel
Discriminative Classification with Sets of Image
Features. Grauman and Darrell ICCV05)
19Audio-Visual Calibration
- Temporal calibration
- aligning audio with visual data
- How? by aligning lip motion energy in images with
audio energy - Geometric calibration
- estimate camera location/orientation in the world
coordinate system
20Audio-Visual Integration
CAMEO
pTablets
21Calibration
- Alternative approach to estimate the
position/orientation of the pTablets with or
without global view (eg. from CAMEO) - Idea use discourse information (eg. who is
talking to who, dialog bw. 2 people) and local
head pose to find the location of the pTablets
22A/V Integration
- AVIntegrator
- Same functionalities as Year 2 (eg. includes
activity recognition, etc) - modified to accept calibration data estimated
externally
23Integration and activity estimation
- A/V integration
- Activity estimation
- Whos in the room
- Who is looking at who?
- Who is talking to who?
-
24A/V Integrator system
Calibration information
- OAA/MOKB messages
- user list
- speaker
- agrees/disagrees
- who to whom
A/V Integrator
eg. current speaker
Discourse/Dialog
Speech recognition
?
25A/V Integration
26Demonstration?
- Real-time meeting understanding
- Use of pTablet suite for interaction with
personal CALOeg. - use of head pose/lip motion for speaking
activity detection - Yes/No answer by head nods/shakes
- Visual login