Title: EM detection of common origin of multimodal cues
1EM detection of common origin of multimodal cues
- Athanasios K. Noulas
- Ben J.A. Kröse
- Intelligent Systems Laboratory
- University of Amsterdam
MultimediaN
2Overview of the presentation
- Problem Description
- Our Objective
- Our Approach
- Proposed model
- Learning
- Results
- Contributions, Applications Open Issues
3Problem Description- Scenario
- We work with multimedia data where people appear
talking - Great variety of video streams match our scenario
- News videos
- Interviews\Talk shows
- Movies
4Problem Description
5Our objective
- We want to assign the visual and audio cues to
the person that generated them - Ideally we would like to estimate on each time
slice (0.04 sec) the identity of the speaker and
the visible person(s).
6Available cues
7Our Approach
- We deal with noisy data coming as a sequence of
observations from a non-deterministic process
(stochastic process). - We model the problem as a Dynamic Bayesian
Network
8Dynamic Bayesian Networks
- Can model complex relationships between variables
- Make inference about hidden variables
- Deal with dynamic systems, taken into account
temporal relations
Example Hidden Markov Model
9Our Dynamic Bayesian Network
- We have two layers in our model, the single
modality analysis and the modality fusion. - The hidden variables represent the identities of
the speaker and the visible persons - The visible variables represent the features we
extract from the multimedia stream.
10Our Approach - Audio Analysis
- Features are extracted from the audio stream.
(Mel coefficients) - These features are used to make inference about
the speakers identity
11Our Approach - Video Analysis
- We detect faces in the video frames (number of
faces and their position) - We extract face features (color histogram)
- These features are used to make inference about
the identity of the visible persons
12Independent Modality Analysis
Accuracy 72
Squares indicate detected faces, green color
indicates speaker
13Modality Fusion
- At this point the labels for the state of the
video and audio modality are independent.
- We need a quantity that will measure the
correlation among the two different modalities.
In terms of graphical models this is a visible
node, child of both the audio and video label
nodes.
14Our Approach - Fusion Model
15Measuring the Correlation
- We need a quantity that can be estimated from the
data and relates the different modalities. - A statistical measurement of correlation between
two random variables is Mutual Information
16Mutual Information
- Intuitively, mutual information between variables
X and Y measures the information about X that is
shared by Y. - The higher the value of the mutual information,
the more correlated the values of X and Y are.
17Mutual Information
- We expect the transformations of pixel values
coming from the sound source (for instance lips
of the speaker) to be correlated with the
transformations of the audio signal. - Therefore, we estimate the Mutual Information
between each pixels value variation and the
Average Acoustic Energy of the Audio stream
18Mutual Information Image
?
Corresponding Audio Stream Segment
19Mutual Information Example
Face
Olaf
Michael
20Learning with our model
- We need to learn the parameters of our model
- The person models
- The transition matrixes
- We use the EM algorithm
- E step We estimate the expectation of the
system state for each slice - M-step We estimate the model parameters that
maximize this expectation
21Inference with our model
- Since the person models and the DBN structure are
known, any inference technique can be used - We use the Viterbi algorithm in order to acquire
the state sequence that maximizes the likelihood
of our observation sequence.
22Results
Accuracy 96
Squares indicate detected faces, green color
indicates speaker
23Results
Squares indicate detected faces, green color
indicates speaker
24Contributions
- To our best knowledge, original contributions of
this work in the problem of multimodal stream
segmentation are - Use of M.I. as a measure of modality correlation
under a DBN - Inference on frame-duration intervals regarding
the origin of audio and video cues - Excellent results on real-world problems without
use of training data
25Applications
- The output of our model can be used
- To enhance video content extraction applications
- To browse multimedia libraries using their script
in an intelligent way - To create user friendly interfaces in robot-human
interaction and video conferencing situations
26Open Issues
- Real time approximation
- Incorporation of more complex video and audio
features - Development of a graphical user interface that
will utilize the output of the model to provide
intelligent search possibilities
27Questions
- I thank you for your time!
- Questions?