EM detection of common origin of multimodal cues - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

EM detection of common origin of multimodal cues

Description:

Contributions, Applications & Open Issues. Problem Description - Scenario ... from the sound source (for instance lips of the speaker) to be correlated with ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 28
Provided by: athanasi3
Category:

less

Transcript and Presenter's Notes

Title: EM detection of common origin of multimodal cues


1
EM detection of common origin of multimodal cues
  • Athanasios K. Noulas
  • Ben J.A. Kröse
  • Intelligent Systems Laboratory
  • University of Amsterdam

MultimediaN
2
Overview of the presentation
  • Problem Description
  • Our Objective
  • Our Approach
  • Proposed model
  • Learning
  • Results
  • Contributions, Applications Open Issues

3
Problem Description- Scenario
  • We work with multimedia data where people appear
    talking
  • Great variety of video streams match our scenario
  • News videos
  • Interviews\Talk shows
  • Movies

4
Problem Description
5
Our objective
  • We want to assign the visual and audio cues to
    the person that generated them
  • Ideally we would like to estimate on each time
    slice (0.04 sec) the identity of the speaker and
    the visible person(s).

6
Available cues
7
Our Approach
  • We deal with noisy data coming as a sequence of
    observations from a non-deterministic process
    (stochastic process).
  • We model the problem as a Dynamic Bayesian
    Network

8
Dynamic Bayesian Networks
  • Can model complex relationships between variables
  • Make inference about hidden variables
  • Deal with dynamic systems, taken into account
    temporal relations

Example Hidden Markov Model
9
Our Dynamic Bayesian Network
  • We have two layers in our model, the single
    modality analysis and the modality fusion.
  • The hidden variables represent the identities of
    the speaker and the visible persons
  • The visible variables represent the features we
    extract from the multimedia stream.

10
Our Approach - Audio Analysis
  • Features are extracted from the audio stream.
    (Mel coefficients)
  • These features are used to make inference about
    the speakers identity

11
Our Approach - Video Analysis
  • We detect faces in the video frames (number of
    faces and their position)
  • We extract face features (color histogram)
  • These features are used to make inference about
    the identity of the visible persons

12
Independent Modality Analysis
Accuracy 72
Squares indicate detected faces, green color
indicates speaker
13
Modality Fusion
  • At this point the labels for the state of the
    video and audio modality are independent.
  • We need a quantity that will measure the
    correlation among the two different modalities.

In terms of graphical models this is a visible
node, child of both the audio and video label
nodes.
14
Our Approach - Fusion Model
15
Measuring the Correlation
  • We need a quantity that can be estimated from the
    data and relates the different modalities.
  • A statistical measurement of correlation between
    two random variables is Mutual Information

16
Mutual Information
  • Intuitively, mutual information between variables
    X and Y measures the information about X that is
    shared by Y.
  • The higher the value of the mutual information,
    the more correlated the values of X and Y are.

17
Mutual Information
  • We expect the transformations of pixel values
    coming from the sound source (for instance lips
    of the speaker) to be correlated with the
    transformations of the audio signal.
  • Therefore, we estimate the Mutual Information
    between each pixels value variation and the
    Average Acoustic Energy of the Audio stream

18
Mutual Information Image

?
Corresponding Audio Stream Segment
19
Mutual Information Example
  • Speaker Olaf Michael

Face
Olaf
Michael
20
Learning with our model
  • We need to learn the parameters of our model
  • The person models
  • The transition matrixes
  • We use the EM algorithm
  • E step We estimate the expectation of the
    system state for each slice
  • M-step We estimate the model parameters that
    maximize this expectation

21
Inference with our model
  • Since the person models and the DBN structure are
    known, any inference technique can be used
  • We use the Viterbi algorithm in order to acquire
    the state sequence that maximizes the likelihood
    of our observation sequence.

22
Results
Accuracy 96
Squares indicate detected faces, green color
indicates speaker
23
Results
Squares indicate detected faces, green color
indicates speaker
24
Contributions
  • To our best knowledge, original contributions of
    this work in the problem of multimodal stream
    segmentation are
  • Use of M.I. as a measure of modality correlation
    under a DBN
  • Inference on frame-duration intervals regarding
    the origin of audio and video cues
  • Excellent results on real-world problems without
    use of training data

25
Applications
  • The output of our model can be used
  • To enhance video content extraction applications
  • To browse multimedia libraries using their script
    in an intelligent way
  • To create user friendly interfaces in robot-human
    interaction and video conferencing situations

26
Open Issues
  • Real time approximation
  • Incorporation of more complex video and audio
    features
  • Development of a graphical user interface that
    will utilize the output of the model to provide
    intelligent search possibilities

27
Questions
  • I thank you for your time!
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com