Visual and auditory scene analysis using graphical models - PowerPoint PPT Presentation

1 / 41

About This Presentation

Title:

Visual and auditory scene analysis using graphical models

Description:

Visual and auditory scene analysis using graphical models Nebojsa Jojic www.research.microsoft.com/~jojic Our representation Objects rather than pixels regions with ... – PowerPoint PPT presentation

Number of Views:161

Avg rating:3.0/5.0

Slides: 42

Provided by: Nebojs6

Category:

more less

Transcript and Presenter's Notes

Title: Visual and auditory scene analysis using graphical models

1
Visual and auditory scene analysis using
graphical models

Nebojsa Jojic
www.research.microsoft.com/jojic

2
People
Interns Anitha Kannan Nemanja Petrovic Matt
Beal
Collaborators Brendan Frey Hagai Attias Sumit
Basu
Windows Ollivier Colle Nenad
Stefanovic Sheldon Fisher
Soon to join Trausti Kristijansson
3
Our representation

Objects rather than pixels
regions with stable appearance over time
moving coherently
occluding each other
subject to lighting changes
associated audio and its structure
Applications compression, editing, watermarking,
indexing, search/retrieval,

4
A structured probability model

Reflects desired structure
Randomly generates plausible images
Represents the data by parameters

5
Intrinsic appearance
Intrinsic appearance
Illumination
Illumination
Mask
Mask
Appearance
Appearance
Position
Position
Observed image
Observed image
(a) Block processing
(b) Structured probability model
6
Inference, learning and generation

Inference (inverting the generative process)
Bayesian inference
Variational inference
Loopy belief propagation
Sampling techniques
Learning
Expectation maximization (EM)
Generalized EM
Variational EM
Generation
Editing by changing some variables
Video/audio textures

7
Basic flexible layer model
8
Basic flexible layer model
s1
m1
s2
m2
T1
T2
T1m1
T1s1
T2s2
T2m2
x
9
Multiple flexible layers

Layer 1 variables
Layer L variables
Class
c1
cL
Class
Appearance
Mask
s1
m1
sL
mL
T1
TL

Transformation
T1m1
T1s1
TLsL
TLmL
x
Observed image
10
Probability distribution
c1
c2
c3
11
Layer equation (Adelson et al)

12

(

)

13
Probability distribution
14
Likelihood, learning, inference

Pdf of x, p(x) integral over the product of all
the conditional pdfs
Inference hard!
Maximizing p(xt) efficiently done using
variational EM
Infer hidden variables
Optimize parameters keeping the above fixed
Loop

15
Flexible sprites
16
Stabilization
17
Walking back
18
Moon-walking
19
Video editing
20
Video editing
21
Video indexingSix break points vs. six things
in video

Traditional video segmentation Find breakpoints
Example MovieMaker (cut and paste)
Our goal Find possibly recurring scenes or
objects

timeline
1
3
2
4
2
1
4
3
2
3
2
3
5
6
22
Video clustering
Class index
Class mean (representative image)
Mean with added variability
Shift
Transformed (shifted image)
Transformed image with added non-uniform noise
Optimizing average or minimum frame likelihood
23
Video indexingSix break points vs. six things
in video

Traditional video segmentation Find breakpoints
Example MovieMaker (cut and paste)
Our goal Find possibly recurring scenes or
objects

timeline
1
3
2
4
2
1
4
3
2
3
2
3
5
6
24
Video indexingSix break points vs. six things
in video

Differences

timeline
A class is detected at multiple intervals on the
timeline. For example, class 1 models a babys
face. Break pointers miss it at the second
occurrence. The class occurs more in the rest of
the sequence
1
3
2
4
2
1
4
3
2
3
2
3
5
6
25
Video indexingSix break points vs. six things
in video

Differences

timeline
One long shot contains a pan of the camera back
and forth among three scenes (classes 2,3 and 5)
1
3
2
4
2
1
4
3
2
3
2
3
5
6
26
Video indexingSix break points vs. six things
in video

Differences

timeline
Two shots detected just because the camera was
turned off and then on with a slightly different
vantage point are considered a single scene class.
1
3
2
4
2
1
4
3
2
3
2
3
5
6
27
Example Clustering a 20-minute whale watching
sequence
28
Learned scene classes
29
A random interesting 20s video
30
Adding other variables (see also
www.research.microsoft.com/users/jojic/FlexibleSpr
ites.htm)

Subspace variables (for PCA-like models)
Deformation fields
Cluster variables
Illumination
Texture
Time series model
Context
Rendering model

31
Adding other modalities and/or sensors
Intrinsic appearance
Intrinsic appearance
Illumination
Illumination
Mask
Mask
Appearance
Appearance
Position
Position
audio model
time delay
?
A
Mic 2
Mic 1
Observed image
Observed audio
32
Speaker detection and tracking
33
Audio-visual textures
34
Challenges

Computational complexity
Achieving modularity in inference
Generality at expense of optimality?

35
Rewards

Object-based media
Meta data, annotations
Automated search
Compression
Manipulability
Structured probability models
Ease of development
Unified framework
Compatible with other reasoning engines

36
A unified theory of natural signals

Probabilistic formulation
flexibility in stability and coherence
unsupervised learning possible
Structured probability models
Random variables observed and hidden
Dependence models
Inference and learning engines

37
Variational inference and learning
Gaussian Multinomial

Generalized E step (variational inference)
optimize Bn wrt q(hn), keeping the model fixed
2. Generalized M step optimize ?Bn wrt to model
parameters, keeping q(hn) fixed

38
Use of FFTs in inference
Gaussian Multinomial
Optimizing terms of the form ?q(T) (x-Ts)T(x-Ts)
requires xTTs for all T correlation if T are
shifts! In FFT domain XS
39
Use of FFTs in learning
Gaussian Multinomial
Computing expectations of the form ?q(T)TTx
reduces to QX in FFT domain!
40
Media is multidisciplinary

Image processing
Filtering, compression, fingerprinting, hashing,
scene cut detection
Telecommunications
Encryption, transmission, error correction
Computer vision
Motion estimation, structure from motion,
motion/object recognition, feature extraction
Computer graphics
Rendering, mixing natural and synthetic, art
Signal processing
Speech recognition, speaker detection/tracking,
source separation, audio encoding, fingerprinting

41
Lack of a new unifying theory

The old general theory of signal decomposition
lacked
Semantics in the representation (objects, motion
patterns, illumination conditions, )
Notion of unknown and hidden cases
Narrow application-dependent frameworks
Structure from motion
Video segmentation and indexing
Face recognition
HMMs for speech recognition