Title: modeling individual and group actions in meetings with layered HMMs
1modeling individual and group actions in meetings
with layered HMMs
dong zhang, daniel gatica-perez samy bengio, iain
mccowan, guillaume lathoud idiap research
institute martigny, switzerland
2meetings as sequences of actions
- human interaction
- similar/complementary roles
- individuals constrained by group
- agenda prior sequence
- discussion points
- presentations
- decisions to be made
- minutes posterior sequence
- key phases
- summarized discussions
- decisions made
3the goal recognizing sequences of meeting
actions
Timeline
Presentation
Group Discussion
Discussion Phase
Whether
Budget
Topic
High
High
Neutral
Group Interest Level
Information Sharing
Decision Making
Group Task
group-level actions meeting actions
meeting views
4our work two-layer HMMs
- decompose the recognition problem
- both layers use HMMs
- individual action layer I-HMM various models
- group action layer G-HMM
5our work in detail
- definition of meeting actions
- audio-visual observations
- action recognition
- results
D. Zhang et al, Modeling Individual and Group
Actions in Meetings with Layered HMMs, IEEE
CVPR Workshop on Event Mining, 2004.
I. McCowan et al, ICASSP 2003, PAMI 2005.
N. Oliver et al, ICMI 2002.
61. defining meeting actions
- multiple parallel views
- tech-based what we can recognize?
- application-based respond to user needs
- psychology-based coding schemes from social
psychology
7multi-modal turn-taking
- describes the group discussion state
- A discussion,
- monologue (x4),
- white-board,
- presentation,
- note-taking,
- monologue note-taking (x4),
- white-board note-taking,
- presentation note-taking
- individual actions
- I speaking,
- writing,
- idle
- actions are multi-modal in nature
8example
W
Person 2
W
S
W
W
Person 3
W
S
S
W
Presentation
Used
Whiteboard
Used
Monologue1 Note-taking
Group Action
Discussion
Presentation Note-taking
Whiteboard Note-taking
92. audio-visual observations
- audio
- 12 channels, 48 kHz
- 4 lapel microphones
- 1 microphone array
- video
- 3 CCTV cameras
- all synchronized
10multimodal feature extraction audio
- microphone array
- speech activity (SRP-PHAT)
- seats
- presentation/whiteboard area
- speech/silence segmentation
- lapel microphones
- speech pitch
- speech energy
- speaking rate
-
11multimodal feature extraction video
- head hands blobs
- skin colour models (GMM)
- head position
- hands position features (eccentricity,size,orien
tation) - head hands blob motion
- moving blobs from background subtraction
123. recognition with two-layer HMM
- each layer trained independently
- trained as in ASR (Torch)
- simultaneous segmentation and recognition
13models for I-HMM
- early integration
- all observations concatenated
- correlation between streams
- frame-synchronous streams
- multi-stream (Dupont, TMM 2000)
- HMM per stream (a or v), trained independently
- decoding weighted likelihoods combined at each
frame - little inter-stream asynchrony
- multi-band and a-v ASR
- asynchronous (Bengio, NIPS 2002)
- a and v streams with single state sequence
- states emit on one or both streams, given a sync
variable - inter-stream asynchrony
14linking the two layers
- hard decision
- i-action model with highest probability outputs
1 all other models output 0. - soft decision
- outputs probability for each individual action
model -
HD (1, 0, 0) SD (0.9, 0.05, 0.05)
Audio-visual features
154. experiments data setup
- 59 meetings (30/29 train/test)
- four-people, five-minute
- scripts
- schedule of actions
- natural behavior
- features 5 f/s
mmm.idiap.ch
16performance measures
- individual actions frame error rate (FER)
- group actions action error rate (AER)
- Subs number of substituted actions
- Del number of deleted actions
- Ins number of added actions
- Total actions number of target actions
17results individual actions
43000 frames
(0.8,0.2)
(0.2-2.2s)
18results group actions
- multi-modality outperforms single modalities
- two-layer HMM outperforms single-layer HMM for
a-only, v-only and a-v - best model A-HMM
- soft decision slightly better than hard decision
19action-based meeting structuring
20conclusions
- structuring meetings as sequences of meeting
actions - layered HMMs successful for recognition
- turn-taking patterns useful for browsing
- public dataset, standard evaluation procedures
- open issues
- less training data (unsupervised, acm mm04)
- other relevant actions (interest-level, icassp05)
- other features (words, emotions)
- efficient models for many interacting streams
21Linking Two Layers (1)
22Linking Two Layers (2)
Please refer to D. Zhang, et al Modeling
Individual and Group Actions in Meetings a
Two-Layer HMM Framework. In IEEE Workshop on
Event Mining, CVPR, 2004 .