EM detection of common origin of multimodal cues

About This Presentation

Title:

EM detection of common origin of multimodal cues

Description:

Contributions, Applications & Open Issues. Problem Description - Scenario ... from the sound source (for instance lips of the speaker) to be correlated with ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 28

Provided by: athanasi3

Category:

more less

Transcript and Presenter's Notes

Title: EM detection of common origin of multimodal cues

1
EM detection of common origin of multimodal cues

Athanasios K. Noulas
Ben J.A. Kröse
Intelligent Systems Laboratory
University of Amsterdam

MultimediaN
2
Overview of the presentation

Problem Description
Our Objective
Our Approach
Proposed model
Learning
Results
Contributions, Applications Open Issues

3
Problem Description- Scenario

We work with multimedia data where people appear
talking
Great variety of video streams match our scenario
News videos
Interviews\Talk shows
Movies

4
Problem Description
5
Our objective

We want to assign the visual and audio cues to
the person that generated them
Ideally we would like to estimate on each time
slice (0.04 sec) the identity of the speaker and
the visible person(s).

6
Available cues
7
Our Approach

We deal with noisy data coming as a sequence of
observations from a non-deterministic process
(stochastic process).
We model the problem as a Dynamic Bayesian
Network

8
Dynamic Bayesian Networks

Can model complex relationships between variables

Make inference about hidden variables

Deal with dynamic systems, taken into account
temporal relations

Example Hidden Markov Model
9
Our Dynamic Bayesian Network

We have two layers in our model, the single
modality analysis and the modality fusion.
The hidden variables represent the identities of
the speaker and the visible persons
The visible variables represent the features we
extract from the multimedia stream.

10
Our Approach - Audio Analysis

Features are extracted from the audio stream.
(Mel coefficients)
These features are used to make inference about
the speakers identity

11
Our Approach - Video Analysis

We detect faces in the video frames (number of
faces and their position)
We extract face features (color histogram)
These features are used to make inference about
the identity of the visible persons

12
Independent Modality Analysis
Accuracy 72
Squares indicate detected faces, green color
indicates speaker
13
Modality Fusion

At this point the labels for the state of the
video and audio modality are independent.

We need a quantity that will measure the
correlation among the two different modalities.

In terms of graphical models this is a visible
node, child of both the audio and video label
nodes.
14
Our Approach - Fusion Model
15
Measuring the Correlation

We need a quantity that can be estimated from the
data and relates the different modalities.
A statistical measurement of correlation between
two random variables is Mutual Information

16
Mutual Information

Intuitively, mutual information between variables
X and Y measures the information about X that is
shared by Y.
The higher the value of the mutual information,
the more correlated the values of X and Y are.

17
Mutual Information

We expect the transformations of pixel values
coming from the sound source (for instance lips
of the speaker) to be correlated with the
transformations of the audio signal.
Therefore, we estimate the Mutual Information
between each pixels value variation and the
Average Acoustic Energy of the Audio stream

18
Mutual Information Image

?
Corresponding Audio Stream Segment
19
Mutual Information Example

Speaker Olaf Michael

Face
Olaf
Michael
20
Learning with our model

We need to learn the parameters of our model
The person models
The transition matrixes
We use the EM algorithm
E step We estimate the expectation of the
system state for each slice
M-step We estimate the model parameters that
maximize this expectation

21
Inference with our model

Since the person models and the DBN structure are
known, any inference technique can be used
We use the Viterbi algorithm in order to acquire
the state sequence that maximizes the likelihood
of our observation sequence.

22
Results
Accuracy 96
Squares indicate detected faces, green color
indicates speaker
23
Results
Squares indicate detected faces, green color
indicates speaker
24
Contributions

To our best knowledge, original contributions of
this work in the problem of multimodal stream
segmentation are
Use of M.I. as a measure of modality correlation
under a DBN
Inference on frame-duration intervals regarding
the origin of audio and video cues
Excellent results on real-world problems without
use of training data

25
Applications

The output of our model can be used
To enhance video content extraction applications
To browse multimedia libraries using their script
in an intelligent way
To create user friendly interfaces in robot-human
interaction and video conferencing situations

26
Open Issues

Real time approximation
Incorporation of more complex video and audio
features
Development of a graphical user interface that
will utilize the output of the model to provide
intelligent search possibilities