Title: Visual and auditory scene analysis using graphical models
1Visual and auditory scene analysis using
graphical models
- Nebojsa Jojic
- www.research.microsoft.com/jojic
2People
Interns Anitha Kannan Nemanja Petrovic Matt
Beal
Collaborators Brendan Frey Hagai Attias Sumit
Basu
Windows Ollivier Colle Nenad
Stefanovic Sheldon Fisher
Soon to join Trausti Kristijansson
3Our representation
- Objects rather than pixels
- regions with stable appearance over time
- moving coherently
- occluding each other
- subject to lighting changes
- associated audio and its structure
- Applications compression, editing, watermarking,
indexing, search/retrieval,
4A structured probability model
- Reflects desired structure
- Randomly generates plausible images
- Represents the data by parameters
5Intrinsic appearance
Intrinsic appearance
Illumination
Illumination
Mask
Mask
Appearance
Appearance
Position
Position
Observed image
Observed image
(a) Block processing
(b) Structured probability model
6Inference, learning and generation
- Inference (inverting the generative process)
- Bayesian inference
- Variational inference
- Loopy belief propagation
- Sampling techniques
- Learning
- Expectation maximization (EM)
- Generalized EM
- Variational EM
- Generation
- Editing by changing some variables
- Video/audio textures
7Basic flexible layer model
8Basic flexible layer model
s1
m1
s2
m2
T1
T2
T1m1
T1s1
T2s2
T2m2
x
9Multiple flexible layers
Layer 1 variables
Layer L variables
Class
c1
cL
Class
Appearance
Mask
s1
m1
sL
mL
T1
TL
Transformation
T1m1
T1s1
TLsL
TLmL
x
Observed image
10Probability distribution
c1
c2
c3
11Layer equation (Adelson et al)
12 (
)
13Probability distribution
14Likelihood, learning, inference
- Pdf of x, p(x) integral over the product of all
the conditional pdfs - Inference hard!
- Maximizing p(xt) efficiently done using
variational EM - Infer hidden variables
- Optimize parameters keeping the above fixed
- Loop
15Flexible sprites
16Stabilization
17Walking back
18Moon-walking
19Video editing
20Video editing
21Video indexingSix break points vs. six things
in video
- Traditional video segmentation Find breakpoints
- Example MovieMaker (cut and paste)
- Our goal Find possibly recurring scenes or
objects
timeline
1
3
2
4
2
1
4
3
2
3
2
3
5
6
22Video clustering
Class index
Class mean (representative image)
Mean with added variability
Shift
Transformed (shifted image)
Transformed image with added non-uniform noise
Optimizing average or minimum frame likelihood
23Video indexingSix break points vs. six things
in video
- Traditional video segmentation Find breakpoints
- Example MovieMaker (cut and paste)
- Our goal Find possibly recurring scenes or
objects
timeline
1
3
2
4
2
1
4
3
2
3
2
3
5
6
24Video indexingSix break points vs. six things
in video
timeline
A class is detected at multiple intervals on the
timeline. For example, class 1 models a babys
face. Break pointers miss it at the second
occurrence. The class occurs more in the rest of
the sequence
1
3
2
4
2
1
4
3
2
3
2
3
5
6
25Video indexingSix break points vs. six things
in video
timeline
One long shot contains a pan of the camera back
and forth among three scenes (classes 2,3 and 5)
1
3
2
4
2
1
4
3
2
3
2
3
5
6
26Video indexingSix break points vs. six things
in video
timeline
Two shots detected just because the camera was
turned off and then on with a slightly different
vantage point are considered a single scene class.
1
3
2
4
2
1
4
3
2
3
2
3
5
6
27Example Clustering a 20-minute whale watching
sequence
28Learned scene classes
29A random interesting 20s video
30Adding other variables (see also
www.research.microsoft.com/users/jojic/FlexibleSpr
ites.htm)
- Subspace variables (for PCA-like models)
- Deformation fields
- Cluster variables
- Illumination
- Texture
- Time series model
- Context
- Rendering model
31Adding other modalities and/or sensors
Intrinsic appearance
Intrinsic appearance
Illumination
Illumination
Mask
Mask
Appearance
Appearance
Position
Position
audio model
time delay
?
A
Mic 2
Mic 1
Observed image
Observed audio
32 Speaker detection and tracking
33Audio-visual textures
34Challenges
- Computational complexity
- Achieving modularity in inference
- Generality at expense of optimality?
35Rewards
- Object-based media
- Meta data, annotations
- Automated search
- Compression
- Manipulability
- Structured probability models
- Ease of development
- Unified framework
- Compatible with other reasoning engines
36A unified theory of natural signals
- Probabilistic formulation
- flexibility in stability and coherence
- unsupervised learning possible
- Structured probability models
- Random variables observed and hidden
- Dependence models
- Inference and learning engines
37Variational inference and learning
Gaussian Multinomial
- Generalized E step (variational inference)
optimize Bn wrt q(hn), keeping the model fixed - 2. Generalized M step optimize ?Bn wrt to model
parameters, keeping q(hn) fixed
38Use of FFTs in inference
Gaussian Multinomial
Optimizing terms of the form ?q(T) (x-Ts)T(x-Ts)
requires xTTs for all T correlation if T are
shifts! In FFT domain XS
39Use of FFTs in learning
Gaussian Multinomial
Computing expectations of the form ?q(T)TTx
reduces to QX in FFT domain!
40Media is multidisciplinary
- Image processing
- Filtering, compression, fingerprinting, hashing,
scene cut detection - Telecommunications
- Encryption, transmission, error correction
- Computer vision
- Motion estimation, structure from motion,
motion/object recognition, feature extraction - Computer graphics
- Rendering, mixing natural and synthetic, art
- Signal processing
- Speech recognition, speaker detection/tracking,
source separation, audio encoding, fingerprinting
41Lack of a new unifying theory
- The old general theory of signal decomposition
lacked - Semantics in the representation (objects, motion
patterns, illumination conditions, ) - Notion of unknown and hidden cases
- Narrow application-dependent frameworks
- Structure from motion
- Video segmentation and indexing
- Face recognition
- HMMs for speech recognition
-