modeling individual and group actions in meetings with layered HMMs

About This Presentation
Title:

modeling individual and group actions in meetings with layered HMMs

Description:

modeling individual and group actions in meetings. with layered HMMs ... 3 CCTV cameras. all synchronized. multimodal feature extraction: audio. microphone array ... –

Number of Views:19
Avg rating:3.0/5.0
Slides: 23
Provided by: Sch45
Category:

less

Transcript and Presenter's Notes

Title: modeling individual and group actions in meetings with layered HMMs


1
modeling individual and group actions in meetings
with layered HMMs
dong zhang, daniel gatica-perez samy bengio, iain
mccowan, guillaume lathoud idiap research
institute martigny, switzerland
2
meetings as sequences of actions
  • human interaction
  • similar/complementary roles
  • individuals constrained by group
  • agenda prior sequence
  • discussion points
  • presentations
  • decisions to be made
  • minutes posterior sequence
  • key phases
  • summarized discussions
  • decisions made

3
the goal recognizing sequences of meeting
actions
Timeline
Presentation
Group Discussion
Discussion Phase
Whether
Budget
Topic
High
High
Neutral
Group Interest Level
Information Sharing
Decision Making
Group Task
group-level actions meeting actions
meeting views
4
our work two-layer HMMs
  • decompose the recognition problem
  • both layers use HMMs
  • individual action layer I-HMM various models
  • group action layer G-HMM

5
our work in detail
  1. definition of meeting actions
  2. audio-visual observations
  3. action recognition
  4. results

D. Zhang et al, Modeling Individual and Group
Actions in Meetings with Layered HMMs, IEEE
CVPR Workshop on Event Mining, 2004.
I. McCowan et al, ICASSP 2003, PAMI 2005.
N. Oliver et al, ICMI 2002.
6
1. defining meeting actions
  • multiple parallel views
  • tech-based what we can recognize?
  • application-based respond to user needs
  • psychology-based coding schemes from social
    psychology

7
multi-modal turn-taking
  • describes the group discussion state
  • A discussion,
  • monologue (x4),
  • white-board,
  • presentation,
  • note-taking,
  • monologue note-taking (x4),
  • white-board note-taking,
  • presentation note-taking
  • individual actions
  • I speaking,
  • writing,
  • idle
  • actions are multi-modal in nature

8
example
W
Person 2
W
S
W
W
Person 3
W
S
S
W
Presentation
Used
Whiteboard
Used
Monologue1 Note-taking
Group Action
Discussion
Presentation Note-taking
Whiteboard Note-taking
9
2. audio-visual observations
  • audio
  • 12 channels, 48 kHz
  • 4 lapel microphones
  • 1 microphone array
  • video
  • 3 CCTV cameras
  • all synchronized

10
multimodal feature extraction audio
  • microphone array
  • speech activity (SRP-PHAT)
  • seats
  • presentation/whiteboard area
  • speech/silence segmentation
  • lapel microphones
  • speech pitch
  • speech energy
  • speaking rate

11
multimodal feature extraction video
  • head hands blobs
  • skin colour models (GMM)
  • head position
  • hands position features (eccentricity,size,orien
    tation)
  • head hands blob motion
  • moving blobs from background subtraction

12
3. recognition with two-layer HMM
  • each layer trained independently
  • trained as in ASR (Torch)
  • simultaneous segmentation and recognition

13
models for I-HMM
  • early integration
  • all observations concatenated
  • correlation between streams
  • frame-synchronous streams
  • multi-stream (Dupont, TMM 2000)
  • HMM per stream (a or v), trained independently
  • decoding weighted likelihoods combined at each
    frame
  • little inter-stream asynchrony
  • multi-band and a-v ASR
  • asynchronous (Bengio, NIPS 2002)
  • a and v streams with single state sequence
  • states emit on one or both streams, given a sync
    variable
  • inter-stream asynchrony

14
linking the two layers
  • hard decision
  • i-action model with highest probability outputs
    1 all other models output 0.
  • soft decision
  • outputs probability for each individual action
    model

HD (1, 0, 0) SD (0.9, 0.05, 0.05)
Audio-visual features
15
4. experiments data setup
  • 59 meetings (30/29 train/test)
  • four-people, five-minute
  • scripts
  • schedule of actions
  • natural behavior
  • features 5 f/s

mmm.idiap.ch
16
performance measures
  • individual actions frame error rate (FER)
  • group actions action error rate (AER)
  • Subs number of substituted actions
  • Del number of deleted actions
  • Ins number of added actions
  • Total actions number of target actions

17
results individual actions
43000 frames
(0.8,0.2)
(0.2-2.2s)
18
results group actions
  • multi-modality outperforms single modalities
  • two-layer HMM outperforms single-layer HMM for
    a-only, v-only and a-v
  • best model A-HMM
  • soft decision slightly better than hard decision

19
action-based meeting structuring
20
conclusions
  • structuring meetings as sequences of meeting
    actions
  • layered HMMs successful for recognition
  • turn-taking patterns useful for browsing
  • public dataset, standard evaluation procedures
  • open issues
  • less training data (unsupervised, acm mm04)
  • other relevant actions (interest-level, icassp05)
  • other features (words, emotions)
  • efficient models for many interacting streams

21
Linking Two Layers (1)
22
Linking Two Layers (2)
Please refer to D. Zhang, et al Modeling
Individual and Group Actions in Meetings a
Two-Layer HMM Framework. In IEEE Workshop on
Event Mining, CVPR, 2004 .
Write a Comment
User Comments (0)
About PowerShow.com