S-Seer: A Selective Perception System for Multimodal Office Activity Recognition

1 / 41

About This Presentation

Title:

S-Seer: A Selective Perception System for Multimodal Office Activity Recognition

Description:

S-Seer: A Selective Perception System for Multimodal Office Activity ... Audio: Two binaural mini-microphones (20-16000 Hz, SNR 58 dB), sampled at 44100 KHz ... –

Number of Views:121

Avg rating:3.0/5.0

Slides: 42

Provided by: nur71

Category:

more less

Transcript and Presenter's Notes

Title: S-Seer: A Selective Perception System for Multimodal Office Activity Recognition

1
S-Seer A Selective Perception System for
Multimodal Office Activity Recognition

Nuria Oliver Eric Horvitz
Adaptive Systems Interaction
Microsoft Research

2
Overview of the Talk

Background Seer system
Value of information
Selective Perception Policies
Selective-Seer (S-Seer)
Experiments and Video
Summary and Future Directions

3
Background and Motivation

Research Area Automatic recognition of human
behavior from sensory observations
Applications
Multimodal human-computer interaction
Visual surveillance, office awareness,
distributed teams
Accessibility, medical applications

4
Sensing in Multimodal Systems

Multimodal sensing reasoning in personal
computing as central vs. peripheral
Multimodal signal processing Typically requires
large portionif not nearly allof computational
resources
Need for strategies to control allocation of
resources for perception in multimodal systems
Design-time and/or real-time

5
Seer Office Awareness System (ICMI 2002, CVIU
2004 (to appear))

Seer Prototype for performing real-time,
multimodal, multi-scale office activity
recognition
Distinguish among

Phone conversation
Face-to-face conversation
Working on the computer
Presentation
Nobody around
Distant conversation
Other activities

6
Multimodal Inputs

Vision One static Firewire camera sampled at 30
fps
Audio Two binaural mini-microphones (20-16000
Hz, SNR 58 dB), sampled at 44100 KHz
Keyboard and mouse History of the activity
during the last 1, 5 and 60 seconds

7
HMMs for Behavior Recognition
e.g., Four State HMM
State Trellis
Graphical Model
Most likely Path Viterbi
Time
State2
State3
State4
State1
8
Several Limitations of HMMs for Multimodal
Reasoning

First-order Markov assumption doesnt address
long-term dependencies and multiple time
granularities
Assumes single process dynamicsbut signals may
be generated by multiple processes.
Context limited to a single state variable.
Multiple processes represented by Cartesian
product HMM becomes intractable quickly
Large parameter space ? large data needs
Empirical experience Representation sensitive to
changes in the environment (lighting, background
noise, etc)

9
Seer Explored Layered HMMs (LHMMs)

Goal Decompose parameter space to reduce
training and re-training requirements.
Approach Segment the problem into distinct
layers that operate at different temporal
granularities.
Consequence Data explained at different levels
of temporal abstraction.

10
Seer Multi-Scale Activity Recognition
Time
11
(No Transcript)
12
SEERs Architecture
Phone Conversation Face to Face
Conversation Working on the Computer Presentation
Nobody Present Distant Conversation
Keyboard/Mouse Activities
Feature Vector
Sound Localization
Audio HMMs Classification Results
Video HMMs Classification Results
One Person Present One Active Person
Present Multiple People Present Nobody Present
Ambient Noise Human Speech Music Keyboard Phone
Ring
Feature Vector
Feature Vector
PCA on LPC Coeff
Energy, Mean Variance of Fundamental Freq
Zero Crossing Rate
Skin Color Probability
Face Density
Foreground/ Background
Motion Density
13
Value of LHMMs for Seer Task

Comparison between traditional (cartesian
product) HMMs and LHMMs
60 minutes of office activity data
(10min/activity3 users)
50 of data for training and 50 for testing
6 office activities recognized
Phone Conversation
Face to Face Conversation
Working on the computer
Distant conversation
Presentation
Nobody Around

HMMs LHMMs
Accuracy 72.7 99.7
param 1360 670
14
HMMs Inference in Seer
15
LHMMs Inference in Seer
16
Selective Perception Policies (ICMI03)

Seer performs well but sensing consumes a large
portion of the available CPU
Seek to understand value and cost of different
sensors / analyses
Define policies for dynamically selecting
sensors/features
Principled decision-theoretic approach
Expected Value of Information (EVI)
Heuristic approach
Observational Frequencies Rate-based perception
Random
Select features randomly as background case

17
Related work

Principles for Guiding Perception
Expected value of information (EVI) as core
concept of decision analysis (Raiffa, Howard)
Value of information in probabilistic reasoning
systems, use in sequential diagnosis (Gorry, 79
Ben-Bassat 80, Horvitz, et al, 89 Heckerman, et
al. 90)
Probability and utility to model the behavior of
vision modules (Bolles, IJCAI77), to score plans
of perceptual actions (Garvey76), reliability
indicators to fuse vision modules (Toyama
Horvitz, 2000)
Growing interest in applying decision theory in
perceptual applications in the area of active
vision search tasks (Rimey93)

18
Policy 1 Expected Value of Information (EVI)

Decision-theoretic principles to determine value
of observations.
EVI computed by considering value of eliminating
uncertainty about the state of observational
features under consideration
Example Vision sensor (camera) features
Motion density
Face density
Foreground density
Skin color density
There are K16 possible combinations of these
features representing plausible sets of
observations.

19
Subsets of Features Example
No Features
Skin Color Probability
Motion Density
Skin Color Probability
Face Density
Fgnd/Bckgnd
Skin Color Probability Motion Density
Skin Color Probability Face Density
Motion Density
Skin Color Probability Fgnd/Bckgnd
Motion Density Face Density
Motion Density Fgnd/Bckgnd
Face Density Fgnd/Bckgnd
Face Density
Skin Color Probability Motion Density Face
Density
Skin Color Probability Motion Density
Fgnd/Bckgnd
Skin Color Probability Face Density
Fgnd/Bckgnd
Foreground/ Background
Face Density Motion Density Fgnd/Bckgnd
Skin Color Probability Face Density Motion
Density Fgnd/Bckgnd
20
Criticality of Utility Model

EVI guided sensing via considering influence of
observations on expected utility of the systems
actions
Need to endow system with representation of
utility of actions in the world.
Assess the utilities as the
value of asserting that the real world activity
is
Maximum expected utility action,

21
Considering Outcome of Making an Observation

Expected value (EV) of observing computing
features

values of a set
observational features f
E prior observational evidence
Represent uncertainty about the values that the
system will observe when evaluating
Consider the change in expected value given the
current probability distribution,

22
Balancing Costs and Benefits

The net expected value of the information (NEVI)
of feature combination is,

The cost is the cost assigned to the
computational latency associated with sensing and
computing feature combination,
If the difference is positive, it is worth
collecting the information and therefore
computing the feature

23
Cost Models

Distinct cost models
Measure of total computation usage
Cost associated with latencies that users will
experience
Costs of computation can be context dependent

Example Expected cost model that takes into
account the likelihood the user will experience
poor responsiveness, and frustration if
encountered.

24
Single and Multistep Analyses

Real-world applications of EVI typically employ a
greedy approach, i.e., compute next best
observations at each step
In our analysis, we extend typical EVI
computations by reasoning about groups
(combinations) of features
We select the feature combination with the
greatest EVI at each step,
Sequential diagnosis or hypothetico-deductive
cycle

25
Hypothetico-Deductive Cycle
Selective Perception Analysis
Control Module
Probability model for selective
perception analysis
Decides which set of features to compute
Probabilistic Module
26
EVI with HMMs

Given that our probabilistic models are HMMs, the
term can be computed as

Where
is the forward variable at time
t and state s
is the state transition probability of
going from state s to l
is the observation probability
All of them for model

27
EVI in HMMs

If we discretize the observation space, the NEVI
becomes

In SEER we discretize the observation space into
M bins with M typically 10.
The computational overhead to carry out EVI in
the discrete case is O(MFNNJ), where
M is the maximum cardinality of the features,
F is the number of feature combinations,
N is the maximum number of states in the HMMs and
J is the number of HMMs.

28
Policy 2 Heuristic Rate-based Perception

For comparison, we consider selective perception
policies based on defining observational
frequencies and duty cycles for each feature.

Period
ON
Audio Classification
OFF
Duty Cycle
ON
Video Classification
OFF
ON
Sound Localization
OFF
ON
Keyboard/Mouse
OFF
Time
29
Policy 3 Random Selection

Baseline policy for comparisons
Randomly select a feature combination from all
possible

30
S-Seer
Phone Conversation Face to Face
Conversation Working on the Computer Presentation
Nobody Present Distant Conversation
Keyboard/Mouse Activities
Sound Localization
Feature Vector
Video HMMs Classification Results
Audio HMMs Classification Results
One Person Present One Active Person
Present Multiple People Present Nobody Present
Ambient Noise Human Speech Music Keyboard Phone
Ring
Feature Vector
Feature Vector
PCA on LPC Coeff
Energy, Mean Variance of Fundamental Freq
Zero Crossing Rate
Skin Color Probability
Face Density
Foreground/ Background
Motion Density
31
Experiments with Selective Perception

Qualitative and formal evaluations

DC Distant Conversation
NP Nobody Present
O Other
P Presentation
FFC F-F Conversation
WC Working on computer
PC Phone Conversation

32
Experiments with Selective Perception

At times the system does not use any features
(e.g. time50)
The system guided by EVI tends to have longer
switching time than when using all the features
all the time
Some features might not be activated at all (e.g.
sound localization feature)

33
Comparison of the Selective Perception Policies

Mean accuracies when testing EVI, observational
frequencies and random selection with 600
sequences of real-time data (100 seq/behavior)

All features EVI ObsFreq Random
Phone Conversation 100 100 89 78
Face to Face Conversation 100 100 86.9 90.2
Presentation 100 97.8 100 91.2
Other Activity 100 100 100 96.7
Nobody Present 100 98.9 100 100
Distant Conversation 100 100 100 100
34
Comparison of the Selective Perception Policies

Mean computational costs ( CPU time) when
testing EVI, observational frequencies and random
selection with 600 sequences of real-time data
(100 seq/behavior)

All features EVI ObsFreq Random
Phone Conversation 61.22 44.5 43 47.5
Face to Face Conversation 67.07 56.5 38.5 53.4
Presentation 49.8 20.88 35.9 53.3
Other Activity 59 19.6 37.8 48.9
Nobody Present 44.33 35.7 39.4 41.9
Distant Conversation 44.54 23.27 33.9 46.1
35
Richer Utility and Cost Models

Initial experiments
Identity matrix as the systems utility model
Measure of the cost, , as the
percentage of CPU usage
Richer utility models for misdiagnosis
One can assess in the cost to a user of
misclassifying as
Seek amounts that users would be willing to pay
to avoid having the activity misdiagnosed for all
possible N-1 misdiagnoses
Richer models for cost of perceptual analysis
We map computational costs and utility to the
same currency
Cost that a user would be willing to pay to
avoid latencies of different kinds in different
settings.

36
Context-sensitive Cost Models

S-SEERs domain level reasoning supports such
context-sensitive cost models
Assuming cost (C) of computation is zero when the
users are not using the computer, we can generate
an expected cost (EC) of perception as follows
where
represents the latency associated
with observing and analyzing the set of features
E represents the evidence already observed
The index 1..m contains the subset of activities
that do not include interaction with the user

37
Studies with Richer Utility and Cost Models

Condition cost models on software application
that has focus
Consider that user is interacting vs. not
interacting
Analysis of influence of activity-dependent cost
model

900 sequences of office activity (150
seq/activity) with
Rich cost of misdiagnosis
Activity-dependent cost model ? cost higher when
user interacting with computer
e.g. Presentation, Person-present other activity
vs. Nobody present, distant converation
overheard, etc.

38
Feature Activation ( time)
Video Audio Sound Loc Kb/Mouse
Phone Conv 86.7 78 86.7 78 0 14.7 100 100
Face to Face Conv 65.3 48.7 65.3 40.7 0 0 100 100
Presentation 10 2 10 2 0 2 27.3 53.3
Other Activity 10 1.3 10 1.3 0 1.3 63.3 63.3
Nobody Present 78.7 86 78.7 86 0 86 80.7 88
Distant Conv 47.3 100 47.3 100 0 86 100 100
Constant Cost Activity Dependent Cost
39
Video
40
Summary

Decision-theoretic approach to feature selection
in multimodal systems
How do observations affect the utility of system?
Selective perception significantly reduces the
computational burden of S-SEER while preserving
good recognition accuracies
Comparative studies, EVI provides overall best
trade-off between the recognition accuracy of the
system and its computational burden

41
Future Work

Extending utility models
Models of cost of latencies
Cost of misdiagnosis in applications
Models of persistence and volatility
Models that represent the decay of confidence
about states of the world with increasing time
since an observation was made
Design-time and real-time applications
Exploration of the decision theoretic approach to
other graphical models
Emotional content

42
Thank you!

Write a Comment

User Comments (0)