Title: Learning the Appearance and Motion of People in Video
1Learning the Appearance and Motion of People in
Video
- Hedvig Sidenbladh, KTH
- hedvig_at_nada.kth.se, www.nada.kth.se/hedvig/
- Michael Black, Brown University
- black_at_cs.brown.edu, www.cs.brown.edu/people/black/
2Collaborators
- David Fleet, Xerox PARC
fleet_at_parc.xerox.com - Dirk Ormoneit, Stanford University
ormoneit_at_stat.stanford.edu - Jan-Olof Eklundh, KTH
joe_at_nada.kth.se
3Goal
- Tracking and reconstruction of human motion in 3D
- Articulated 3D model
- Monocular sequence
- Pinhole camera model
- Unknown, cluttered
environment
4Why is it Important?
- Human-machine interaction
- Robots
- Intelligent rooms
- Video search
- Animation, motion capture
- Surveillance
5Why is it Hard?
6Why is it Hard?
- People move fast and non-linearly
- 3D to 2D projection ambiguities
- Large occlusion
- Similar appearance of different limbs
- Large search space
Extreme case
7Bayesian Inference
Exploit cues in the images. Learn likelihood
models p(image cue model) Build
models of human form and motion. Learn priors
over model parameters
p(model) Represent the posterior distribution
p(model cue) p(cue model) p(model)
8Human Model
- Limbs truncated cones in 3D
- Pose determined by parameters ?
9Bregler and Malik 98
State of the Art.
- Brightness constancy cue
- Insensitive to appearance
- Full-body required multiple cameras
- Single hypothesis
10Brightness Constancy
I(x, t1) I(xu, t) h
Image motion of foreground as a function of the
3D motion of the body. Problem no fixed model of
appearance (drift).
11Cham and Rehg 99
State of the Art.
- Single camera, multiple hypotheses
- 2D templates (no drift but view dependent)
I(x, t) I(xu, 0) h
12Multiple Hypotheses
- Posterior distribution over model parameters
often multi-modal (due to ambiguities) - Represent whole distribution
- sampled representation
- each sample is a pose
- predict over time using a particle filtering
approach
13Deutscher, North, Bascle, Blake 00
State of the Art.
- Multiple hypotheses
- Multiple cameras
- Simplified clothing, lighting and background
14Sidenbladh, Black, Fleet 00
State of the Art.
- Multiple hypotheses
- Monocular
- Brightness constancy
- Activity specific prior
- Significant changes in view and depth,
template-based methods will fail
15How to Address the Problems
- Bayesian formulation
- p(model cue) p(cue model) p(model)
16What do people look like?
Changing background
Varying shadows
Occlusion
Deforming clothing
Low contrast limb boundaries
What do non-people look like?
17Edge Detection?
- Probabilistic model?
- Under/over-segmentation, thresholds,
18Key Idea 1
- Use the 3D model to predict the location of limb
boundaries in the scene. - Compute various filter responses steered to the
predicted orientation of the limb. - Compute likelihood of filter responses using a
statistical model learned from examples.
19Key Idea 2
Explain the entire image.
p(image foreground, background)
Generic, unknown, background
Foreground person
20Key Idea 2
p(image foreground, background) ?
p(foreground part of image foreground)
p(foreground part of image background)
Do not look in parts of the image considered
background
Foreground part of image
21Training Data
Points on limbs
Points on background
22Edge Distributions
Edge response steered to model edge
Similar to Konishi et al., CVPR 99
23Edge Likelihood Ratio
24Ridge Distributions
Ridge response steered to limb orientation
Ridge response only on certain image scales!
25Ridge Likelihood Ratio
26Motion Training Data
xu
x
Motion response I(x, t1) - I(xu, t)
Motion response temporal brightness change
given model of motion
noise term in brightness constancy assumption
27Motion distributions
Different underlying motion models
28Fg, Bg Likelihood
29Likelihood Formulation
- Independence assumptions
- Cues p(image model) p(cue1 model) p(cue2
model) - Spatial p(image model) ? p(image(x) model)
- Scales p(image model) ? p(image(?) model)
- Combines cues and scales!
- Simplification, in reality there are dependencies
x?image
?1,...
30Likelihood
Foreground pixels
Background pixels
31Step One Discussed
- Bayesian formulation
- p(model cue) p(cue model) p(model)
32Models of Human Dynamics
- Model of dynamics are used to propagate the
sampled distribution in time - Constant velocity model
- All DOF in the model parameter space, ?,
independent - Angles are assumed to change with constant speed
- Speed and position changes are randomly sampled
from normal distribution
33Models of Human Dynamics
- Action-specific model - Walking
- Training data 3D motion capture data
- From training set, learn mean cycle and common
modes of deviation (PCA)
Mean cycle
Small noise
Large noise
34Step Two Also Discussed
- Bayesian formulation
- p(model cue) p(cue model) p(model)
35Particle Filter
- Problem Expensive represententation of
posterior! - Approaces to solve problem
- Lower the number of samples. (Deutsher et al.,
CVPR00) - Represent the space in other ways (Choo and
Fleet, ICCV01)
36Tracking an Arm
1500 samples 2 min/frame
Moving camera, constant velocity model
37Self Occlusion
1500 samples 2 min/frame
Constant velocity model
38Walking Person
samples from 15000 to 2500 by using the learned
likelihood
2500 samples 10 min/frame
Walking model
39Ongoing and Future Work
- Learned dynamics
- Correlation across scale
- Estimate background motion
- Statistical models of color and texture
- Automatic initialization
40Lessons Learned
- Probabilistic (Bayesian) framework allows
- Integration of information in a principled way
- Modeling of priors
- Particle filtering allows
- Multi-modal distributions
- Tracking with ambiguities and non-linear models
- Learning image statistics and combining cues
improves robustness and reduces computation
41Conclusions
- Generic, learned, model of appearance
- Combines multiple cues
- Exploits work on image statistics
- Use the 3D model to predict features
- Model of foreground and background
- Exploits the ratio between foreground and
background likelihood - Improves tracking
42Other Related Work
J. Sullivan, A. Blake, M. Isard, and
J.MacCormick. Object localization by Bayesian
correlation. ICCV99.
J. Sullivan, A. Blake, and J.Rittscher.
Statistical foreground modelling for object
localisation. ECCV00.
J. Rittscher, J. Kato, S. Joga, and A. Blake. A
Probabilistic Background Model for Tracking.
ECCV00.
S. Wachter and H. Nagel. Tracking of persons in
monocular image sequences. CVIU, 74(3), 1999.