Title: Exploiting video information for Meeting Structuring
1Exploiting video information for Meeting
Structuring
2Agenda
- Introduction
- Feature set extension
- Video features processing
- Video features integration
- Preliminary results
- Conclusions
3Meeting Structuring (1)
- Goal recognise events which involve one or more
communicative modalities - Monologue / Dialogue / Note taking / Presentation
/ Presentation at the whiteboard - Working environment IDIAP framework
- 69 five minutes long meetings of 4 participants
- 30 transcribed meetings
- Scripted meeting structure
-
4Meeting Structuring (2)
- 3 audio derived feature families
- Speaker turns, Prosodic Features, Lexical
Features
Speaker Turns
Mic. Array
Beam-forming
Prosody
Lapel Mic.
Pitch baseline
Energy
Rate Of Speech
ASR
Lexical features
Transcription.
M/DI discrimination
5Meeting Structuring (3)
- Dynamic Bayesian Network based models (using
GMTK, Bilmes et al.) - Multi-stream processing (parallel stream
processing) - Counter structure (state duration modelling)
.
.
C0
C0
C0
- 3 feature families
- Prosodic features (S1)
- Speaker Turns (S2)
- Lexical features (S3)
- Leave-one-out cross-validation
- over 30 annotated meetings
E0
E0
E0
A0
At
At1
.
.
S01
St1
St11
.
.
S02
St2
St12
Corr Sub Del Ins AER
W.o. counter 91.7 4.5 3.8 2.6 10.9
With counter 92.9 5.1 1.9 1.9 9.0
.
Y01
Yt1
Yt11
Y02
Yt2
Yt12
6Feature set extension (1)
- Multi-party meeting are multi-modal
- communicative processes
-
- Our features cover only two modalities audio
(prosodic features speaker turns) and lexical
content (lexical monologue/dialogue
discriminator)
Exploiting video contents is the next step!!
7Feature set extension (2)
- Goal improve the recognition of Note
taking, Presentation and Whiteboard
The three most confused symbols
Three meeting actions which highly involve
body/hands movements
Approach extract low level video features and
leave their interpretation to high level
specialised models
8Feature set extension (3)
- We need motion features for hands/head-torso
regions - Constraints
- The system must be simple
- Reliable against environmental changes
(lighting, backgrounds, ) - Open to further extensions / modifications
- Initial assumptions
- Meetings video contents are quite static
- Participants occupy only few spatial regions
- and tend to stay there
- Meeting room configuration (camera positions,
seats, - furniture ) is fixed
9Video feature extraction (1)
- Motion analysis is performed using
Kanade Lucas Tomasi (KLT) feature tracking
and partitioning resulting trajectories
according to their relative position into the
scene
Four spatial regions for each scene Head 1 /
2 Hands 1 / 2
10KLT (1)
- Assumption brightness of every point of a (slow)
moving or static object does not change for
images taken at near time instants
(Taylor series approximated to the 1st
derivative)
Optical flow constraint equation
Represents how fast the intensity is changing
with time
Moving object speeds
Brightness gradient
If we have one equation in two
unknown hence more than one solution
11KLT (2)
- Minimizing weighted least square error
- In two dimensions the system has the form
- If the solution is
are neighbour points of x, with same
constant velocity
12KLT (3)
- A good feature is
- one that can be tracked well (Tomasi et al.)
- if are the eigenvalues of , the system
is well-conditioned if - and even better if it is part of a human body
(high texture content)
Large eigenvalues, but in the same range
Pixel with higher probability to be skin are
preferred
13KLT (4)
- KLT feature tracking consists of 3 steps
- Select n good features
- Track the selected n features
- Replace lost features
We decided to track n100 features is a square
(7x7) window
14Skin modelling
- Color based approach (Cr,Cb) chromatic subspace
Skin samples taken from unused meetings
Initial experiments made using a single Gaussian
Now 3 components Gaussian Mixture Model
15Video feature extraction (2)
Structure of the implemented system
Video
Skin Detection
KLT
Skin model
100 features / frame
Trajectory Structure
100 trajectories / frame
16Video feature extraction (3)
Trajectory Structure
Remove long and quite static trajectories
Define 4 partitions (regions) (2 x heads,2 x
hands)
H1
H2
Ha1
Ha2
4 regions
Trajectories classification
R
Define 2 additional fixed regions
L
4 regions
Evaluate Average Motion
17Video feature extraction (4)
2.
1.
3.
4.
18Video feature extraction (5)
Taking motion vectors averaged over many
trajectories helps reducing noise
For each scene 4 motion vectors, one for each
region, are estimated (to be soon enhanced with 2
more regions/vectors L and R)
In order to detect if someone is entering or
leaving the scene
- Open issues
- Loss of tracking for fast moving objects
- Account during the tracking
- Assumption of a fixed scene structure
- Delayed/offline processing
19Integration
Goal extend the multi-stream model with a new
video stream
- It is possible that the extended model will be
- intractable due to the increased state space
- In this case
- State space reduction through a multi-time-scale
- approach will be attempted
- Early integration of Speaker turns
- Lexical features will be investigated
20Preliminary results
- Before proceeding with the proposed integration
we need to - compare video performances against the other
features families - validate the extracted video features
Speaker Turns Prosodic Features Lexical Features Video Features
Accuracy 85.9 69.9 52.6 48.1
Corr Sub Del Ins AER
(A) Two-stream model 87.8 4.5 7.7 3.2 15.4
(B) Two-stream model 90.4 3.2 6.4 4.5 14.1
Video features alone have quite poor
performances, but they seem to be helpful if
evaluated together with Speaker Turns
- (Speaker Turns) (Prosody Lexical Features)
- (Speaker Turns) (Video Features)
21Summary
- Extraction of video features through
- A skin detector enhanced KLT feature tracker
- Segmentation of trajectories into 4/6 spatial
regions - (Simple and fast approach, but with some open
problems) - Validation of Motion Vectors as a video feature
- Integration in the existing framework (work in
progress)