Title: Detecting Pedestrians Using Patterns of Motion and Appearance Viola
1Detecting Pedestrians Using Patterns of Motion
and Appearance (Viola Jones)
2Closely Related Work
- P. Viola M. Jones - Robust Real-time Object
Detection, Workshop on Statistical and
Computational Theories of Vision, July 2001 - P. Viola M. Jones Rapid Object Detection
Using a Boosted Cascade of Simple Features,
ICCVPR, 2001. - P. Viola M. Jones Robust Real-Time Face
Detection, IJCV, 2003 - P. Viola, M. Jones D. Snow Detecting
Pedestrians Using Patterns of Motion and
Appearance, ICCV 2003
3The Goals
- Development of a representation of image motion
which is extremely efficient. - Implementation of a state of the art pedestrian
detection system which operates on low-res images
under difficult conditions.
The Approach
- Find extremely basic features of the images that
can be computed very quickly. (Real-time) - Get a huge set of features, and then use machine
learning techniques (AdaBoost) to find the best
distinguishing features.
4The Features
- First 5 images are created from the original 2
(It It1) to represent motion - ?, U, D, L, R by shifting It 1 pixel in the
corresponding direction (e.g. U means Up, ? means
no shift, its the temporal gradient) and taking
the absolute difference with It1. - These images represent crude gradients in motion.
The sum of the pixels of the images going in the
direction of motion will be greater than those
that dont.
5The Features
- A feature is a thresholded filter, fi.
- a if fi(It, ?, U, D, L, R) gt ti
- ß otherwise
- For some constants a, ß, ti
- There are essentially 3 types of filters.
- 1. fi ri (S)
- 2. fi abs(ri (?) ri (S))
- 3. fj ?j(S)
- ?m represents a sum of pixels over a rectangular
filter m. - S is one of It, ?, U, D, L or R.
- ri (S) is a sum of pixel values over a box region
of image S.
6Examples
It
It1
7Representing Motion (Examples)
- Compute U, D, L, R by shifting image It over 1
pixel and taking the absolute difference with
It1. ? is computed as just abs(It - It1).
D has a sum of 121,020 U has a sum of 62,126. So
motion is in the upward direction
D
U
8Filter Type 1
- fi ri (S) S is any of It, ?, U, D, L, R.
- ri (S) is the sum of
pixel values over a box region.
L
9Filter Type 2
?
U
S is any of U, D, L, R. ri (S) is the sum of
pixel values over a box region.
10Rectangular Features (Filter Type 3)
- fi ?i(S), ? represents a rectangular filter
- The total difference in pixel values between the
dark and light parts of the rectangles are the
filters.
Difference 224
Difference 6,683
Difference 5476
If we set the threshold to 300 this filter can
recognize the symmetry between eyes.
11Classifier
- A classifier is a thresholded sum of features.
- C(It, It1) 1 iff Si Fi(It, ?, U, D, L, R) gt T,
- A feature is a thresholded filter.
- a if fi(It, ?, U, D, L, R) gt ti
- ß otherwise
- This gives us 4 parameters to select (a, ß, ti,
T) in addition to choosing what subset of filters
to use.
12AdaBoost
- 1990 - The Strength of Weak Learnability
(Schapire) - 1997 Generalized version of AdaBoost (Schapire
Singer) - AdaBoost is an algorithm for constructing a
strong classifier as linear combination - of simple weak classifiers ht(x).
13Cascaded Classifier
- Using all the features in the classifier would
take too long. - Instead a cascade of classifiers was used where
each subsequent level of the cascade contains
more features. - This way image patches that are very different
from actual pedestrians can be thrown out using
only a few features.
14Experiments
- Train each classifier in the cascade using 2250
positive examples and 2250 false positives from
the previous stages of cascade. (This lowers the
false positive rate at each stage) - Each stage is trained so that 99.5 of true
positives from previous stage are kept while 10
of false positives are eliminated (if this cant
be done, more features are added).
15Experiments
- Two detectors (dynamic and static).
- Dynamic trained using 54,624 filters on the
original image It and the motion images ?, U, D,
L, R. - Static trained using 24,328 filters on only the
original image It.
16Results
- ROC curves for the classification (by adjusting
the number of features)
17Results
- Correct detections - 80
- False positives (the total number of false
positives / the total number of patches tested) - 1/400,000 for the dynamic detector which
corresponds to 1 false positive every 2 frames. - 1/15,000 for the static detector which
corresponds to 13 false positives per frame.
18Results
Dynamic detector
Static detector
19 Dynamic Detector
Static Detector
20Comments
- Using more complex features such as optical flow
would likely be more successful (but might make
things slower). - Why not use basic background subtraction? It
would greatly reduce the amount of pixels the
detector would have to search over.
21Comments
- Using information about where pedestrians were in
previous frames would improve the detector and
help against occlusions, etc. (i.e. tracking). - Is overfitting a problem? AdaBoost can succumb
to overfitting the training data (thus
generalizing badly) by picking too many features.
Here we have 2250 training examples and 54,624
features. Is 24.3 features per training example
not too much?