CALO VISUAL INTERFACE RESEARCH PROGRESS - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

CALO VISUAL INTERFACE RESEARCH PROGRESS

Description:

whether the user is still participating in a conversation or interaction, ... Person's idiolect. User agreement from recognized speech and head gestures ... – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 27
Provided by: daviddem
Category:

less

Transcript and Presenter's Notes

Title: CALO VISUAL INTERFACE RESEARCH PROGRESS


1
CALO VISUAL INTERFACERESEARCH PROGRESS
  • David Demirdjian
  • Trevor Darrell
  • MIT CSAIL

2
pTablet (or pLaptop!)
  • Goal visual cues to conversation or interaction
    state
  • presence
  • attention
  • turn-taking
  • agreement and grounding gestures
  • emotion and expression cues
  • visual speech features

3
Functional Capabilities
  • Help CALO infer
  • whether the user is still participating in a
    conversation or interaction,
  • is focused on the interface or listening to
    another person.
  • when the user is speaking,
  • further features pertaining to visual speech
  • non-verbal means to observe whether a user is
    confirming understanding of or agreement with the
    current topic or question,
  • is confused or irritated
  • both for meeting understanding, and CALO UI

4
Machine Learning Research Challenges
  • Focusing on learning methods which capture
    personalized interaction
  • Articulatory models of visual speech
  • Sample-based methods for body tracking
  • Hidden-state conditional random fields
  • Context-based gesture recognition
  • (Not all are yet in deployed demo)

5
Articulatory models of visual speech
  • Traditional models of visual speech presume
    synchronous units based on visimes, the visual
    correlate of phonemes.
  • Audiovisual speech production is often
    asynchronous
  • Model with formed with a series of loosely
    coupled streams of articulatory features.
  • (See Saenko and Darrell, ICMI 2004, and Saenko et
    al., ICCV 2005, for more information.)

6
Sample-based methods for body tracking
  • Tracking human bodies requires exploration of a
    high-dimensional state space
  • Estimated posteriors are often sharp and
    multimodal.
  • New tracking techniques based on novel
    approximate nearest neighbor hashing method which
    have comprehensive pose coverage, and optimally
    integrate information over time.
  • These techniques are suitable for real-time
    markerless motion capture, and for tracking the
    human body to infer attention and gesture.
  • (See Demirdjian et al. ICCV 2005, and Taycher et
    al. CVPR 2006 for more information.)

7
Hidden-state conditional random fields
  • Discriminative techniques are efficient and
    accurate, and learn to represent only the portion
    of a state necessary for a specific task.
  • Conditional random fields are effective at
    recognizing visual gestures, but lack the ability
    of generative models to capture gesture
    substructure through hidden state.
  • We have developed a hidden-state conditional
    random field formulation.
  • (See Wang et al. CVPR 2006.)

8
Hidden Conditional Random Fields for Head Gesture
Recognition
3 classes Nods, Shakes, Junk
Models Accuracy()
HMM W 0 46.33
CRF W 0 38.42
HCRF(multiclass) W 0 45.37
HCRF(multiclass) W 1 64.44
9
Context-based gesture recognition
  • Recognition of users gesture should be done in
    the context of the current interaction
  • Visual recognition can be augmented with context
    cues from the interaction state
  • conversational dialog with an embodied agent
  • interaction with a conventional windows and mouse
    interface.
  • See Morency, Sidner and Darrell, ICMI 2005 and
    Morency and Darrell, IUI 2006

10
User Adaptive Agreement Recognition
  • Persons idiolect
  • User agreement from recognized speech and head
    gestures
  • multimodal co-training
  • Challenges
  • Asynchrony between modalities
  • Missing data problem

11
Status
  • New pTablet functionalities
  • Face/gaze tracking
  • Head gesture recognition (nod/shake) Gaze
  • Lip/Mouth motion detection
  • User enrollment/recognition (ongoing work)
  • A/V Integration
  • Audio-visual sync./calibration
  • Meeting visualization/understanding

12
pTablet system
  • user model (frontal view)
  • pose (6D)
  • OAA messages
  • person ID
  • head pose
  • gesture
  • lips moving

pTablet camera
Head Gesture Recognition
Speech
audio
13
(No Transcript)
14
Speaking activity detection
  • Face tracking as
  • Rigid pose

15
Speaking activity detection
  • Face tracking as
  • Rigid pose Non-rigid facial deformations

16
Speaking activity detection
  • Speaking activity
  • high motion energy in Mouth/lips region
  • weak assumption (eg. hand moving in front of
    mouth will trigger speaking activity detection)
  • But complement well audio-based speaker detection

17
Speaking activity detection
18
User enrollment/recognition
  • Idea
  • At start the user is automatically identified and
    logged in by the pTablet.
  • If the user is not recognized or misrecognized,
    he will have to login manually.
  • Face recognition based on a Feature Set Matching
    algorithm (The Pyramid Match Kernel
    Discriminative Classification with Sets of Image
    Features. Grauman and Darrell ICCV05)

19
Audio-Visual Calibration
  • Temporal calibration
  • aligning audio with visual data
  • How? by aligning lip motion energy in images with
    audio energy
  • Geometric calibration
  • estimate camera location/orientation in the world
    coordinate system

20
Audio-Visual Integration
CAMEO
pTablets
21
Calibration
  • Alternative approach to estimate the
    position/orientation of the pTablets with or
    without global view (eg. from CAMEO)
  • Idea use discourse information (eg. who is
    talking to who, dialog bw. 2 people) and local
    head pose to find the location of the pTablets

22
A/V Integration
  • AVIntegrator
  • Same functionalities as Year 2 (eg. includes
    activity recognition, etc)
  • modified to accept calibration data estimated
    externally

23
Integration and activity estimation
  • A/V integration
  • Activity estimation
  • Whos in the room
  • Who is looking at who?
  • Who is talking to who?

24
A/V Integrator system
Calibration information
  • OAA/MOKB messages
  • user list
  • speaker
  • agrees/disagrees
  • who to whom

A/V Integrator
eg. current speaker
Discourse/Dialog
Speech recognition
?
25
A/V Integration
26
Demonstration?
  • Real-time meeting understanding
  • Use of pTablet suite for interaction with
    personal CALOeg.
  • use of head pose/lip motion for speaking
    activity detection
  • Yes/No answer by head nods/shakes
  • Visual login
Write a Comment
User Comments (0)
About PowerShow.com