Title: Object Labelling from Human Action Recognition
1Object Labelling fromHuman Action Recognition
1st IEEE Conference on Pervasive Computing
and Communications, 2003 Fort Worth, Texas
Contributors Patrick Peursum, Svetha Venkatesh,
Geoff West, Hai Hung Bui (Presenter)
School of Computing, Curtin University of
Technology Perth, Western Australia
2Introduction
- Aim infer object identity from human action
- Indirectly recognise an object by detecting the
signature human action needed to use the object - Monitoring human activity in the home has certain
problems and opportunities for this - ? Frequent and repeated human activity
- ? Indoors scenes
- ? Objects are often directly used, eg appliances
- x Wide angle views, cluttered environment
- x Scene and object locations change over time
1
3Objectives
- Evidence-based approach to labelling
- Label object in a scene based on repeated human
interactions - Accumulation of evidence over time
- Flexible and robust to noise, errors
- Potentially adaptable to changes in the scene
- Independent of objects physical structure
- Learn location of chairs and floor areas
- Initial study into the potential of the approach
2
4Related Work
- Traditional object recognition
- Function-based variant (Stark and Bowyer 1991)
- Inherent difficulty in recognition using physical
structure - Human activity / action recognition
- Focus is mainly on detecting anomalous activities
(eg surveillance applications) - Human-object interaction recognition
- Work on the use of occlusion to estimate object
positions and size (Grimson et al, 1998) - Top-down view of desk scenes, using hand
movements for action recognition (Moore, 1999)
3
5Method - Overview
4. Scene Labelling
1. Raw Video
3. Activity Segmentation
2. Person Segmentation and Tracking
4
6Method - Person Segmentation
- Raw Video
- Four ceiling-mounted cameras, 25 fps
- Monitor single scene - overlapping FoVs
- Person segmentation and tracking
- Gaussian mixture-model background subtraction
(Stauffer et al, 2000) to find person - Bounding box used to outline person
- Tracking via Kalman Filter on box centroid
- Views calibrated to world coordinate system using
Tsais algorithm (Tsai, 1986)
5
7Method - Activity Segmentation
- HMMs used to segment four activities
- Walking, sitting down, seated, standing up
- Sitting/Standing Strict Left-Right HMMs (10
states) - Walk/Seated Standard HMMs (5 and 3 states)
- Walking ? Floor interaction
- Others ? Chair interaction
- Training
- 24 sequences of a person sitting into a chair
- Each sequence manually segmented into the four
different activities and used for HMM training - Training Features (from bounding box)
- World height (mm), change in height/width,
velocity
6
8Method - Activity Segmentation (2)
- Activity segmentation window
- Fixed-size moving window (30 frames)
- Window moves forward one frame at a time
- Frames within window used to calculate log
likelihoods of all four HMMs - Best HMM taken as activity for window
- Best HMM must significantly outperform other HMMs
- Minimises short-lived false positives
- Last activity re-instated if no significantly
best HMM - Voting between views to elect activity
- Activity estimated to begin halfway through window
7
9Method - Scene Labelling
- Objects labelled according to activity
- Labelled area depends on activity / object
- Sit Chairs labelled using the persons fitted
ellipse - Walk Floors labelled using lower 5 of the
fitted ellipse - Labels are weights that are updated via an
exponential-forgetting function - wLt1(x,y) wLt(x,y) (1 - ?) (? ?)
- ? 1 if L is detected object, 0
otherwise - w is the weight of Lth label (chair or floor) at
time t, pixel (x,y) - ? is the learning rate for label updating
- ? controls the label to be strengthened
8
10Method - Use of Occlusion
- Partial occlusions used to refine labelling
- Person is occluded when walking behind chair
- Bounding box used to judge occluded area
- Can cause over-estimation of occluded area
- Chair labels are erased in unoccluded area
- ...since occlusion is a strong indicator of chair
bounds - Learning rate for chair labels in area is
retarded by a factor of 4 - Feeds occlusion evidence back into labelling
process - Floor labelling is unaffected
9
11Experiments
- Three video sequences (2000 frames each)
- Four camera views per sequence
- Activity segmentation and label weighting on each
view - Strongest label for each pixel assigned as pixel
label - Threshold then applied to eliminate weak labels
- Labelling analysed by overlaying edges
(manually-defined)
Figure 1. Sample labelling with chair and floor
edges
10
12Camera Views
NW
NE
SW
SE
11
Figure 2. NW, NE, SW and SE Views of Lab
13Demonstration Video
NW
NE
SW
SE
12
14Results - Activity Segmentation
- Activity segmentation evaluation
- Ground truth estimated manually, with an
uncertainty of ?5 frames
Table 1. Error means and variances for activity
segmentation
13
15Analysis - Activity Segmentation
- Sit / Walk segmentation
- Highly accurate given uncertainty of ?5 frames
- Sit found late, walk found early
- Conservatively estimates sitting action -
improves robustness - Seated lost, Stand far too early
- Problems are related
- End of Sit is misinterpreted as start of
Stand - Can solve with termination probabilities
(Al-Ohali et al, 2002) - Loss of Seated not critical
- Later instances of sitting offset loss of evidence
14
16Results - Scene Labelling
- Labelling Accuracy
- Chair area includes space between chair legs
- Other are all non-chair, non-floor pixels
- Chair precision of 49.07 seems quite poor
- Floor recall seems low, but is misleading
Table 2. Confusion matrix for labelling (all
image pixels)
15
17Analysis - Scene Labelling
- Table ignores unseen pixels (ie Other)
- Chair precision better, but still low
- Not unexpected use of fitted ellipse
over-labelling - Occlusion helps, but not many instances of
occlusion - Floor recall much higher (93.6, up from 66.7)
- Not all floor area visited, so high Other
misclassifications
Table 3. Confusion matrix for labelling (labelled
pixels only)
16
18Conclusions and Future Work
- Action-based approach to object labelling
- Advantage of evidence accumulation
- Robust to noise false positives have minimal
impact - No use of background image information
- Accuracy would be improved with inclusion of
image information (eg regions) as secondary
evidence. - Must increase variation in objects, situations
- Will require addressing limitations, including
- Finer measurements of human to separate subtler
actions - More information on object labels (eg object
position) - Experiment with shifting objects around
17
19References
- Y. Al-Ohali, M. Cheriet and C. Suen. Introducing
termination probabilities to HMM. ICPR 2002 - W. Grimson, C. Stauffer, R. Romano and L. Lee.
Using adaptive tracking to classify and monitor
activities in a site. CVPR 1998, pages 22-29 - D. Moore, I. Essa and M.Hayes. Exploiting Human
Actions and Object Context for Recognition Tasks.
ICCV 1999 - L. Stark and K. Bowyer. Achieving generalized
object recognition through reasoning about
association of function to structure. Pattern
Analysis and Machine Intelligence,
3(8)1097-1104, October 1991 - C. Stauffer and W. Grimson. Learning patterns of
activity using real-time tracking. Pattern
Analysis and Machine Intelligence, 22(8)747-757,
August 2000 - R. Tsai. An efficient and accurate camera
calibration technique for 3D machine vision. CVPR
1986, pages 364-374 -
18