Title: S-Seer: A Selective Perception System for Multimodal Office Activity Recognition
1S-Seer A Selective Perception System for
Multimodal Office Activity Recognition
- Nuria Oliver Eric Horvitz
- Adaptive Systems Interaction
- Microsoft Research
2Overview of the Talk
- Background Seer system
- Value of information
- Selective Perception Policies
- Selective-Seer (S-Seer)
- Experiments and Video
- Summary and Future Directions
3Background and Motivation
- Research Area Automatic recognition of human
behavior from sensory observations - Applications
- Multimodal human-computer interaction
- Visual surveillance, office awareness,
distributed teams - Accessibility, medical applications
4Sensing in Multimodal Systems
- Multimodal sensing reasoning in personal
computing as central vs. peripheral - Multimodal signal processing Typically requires
large portionif not nearly allof computational
resources - Need for strategies to control allocation of
resources for perception in multimodal systems - Design-time and/or real-time
5Seer Office Awareness System (ICMI 2002, CVIU
2004 (to appear))
- Seer Prototype for performing real-time,
multimodal, multi-scale office activity
recognition - Distinguish among
- Phone conversation
- Face-to-face conversation
- Working on the computer
- Presentation
- Nobody around
- Distant conversation
- Other activities
6Multimodal Inputs
- Vision One static Firewire camera sampled at 30
fps - Audio Two binaural mini-microphones (20-16000
Hz, SNR 58 dB), sampled at 44100 KHz - Keyboard and mouse History of the activity
during the last 1, 5 and 60 seconds
7HMMs for Behavior Recognition
e.g., Four State HMM
State Trellis
Graphical Model
Most likely Path Viterbi
Time
State2
State3
State4
State1
8Several Limitations of HMMs for Multimodal
Reasoning
- First-order Markov assumption doesnt address
long-term dependencies and multiple time
granularities - Assumes single process dynamicsbut signals may
be generated by multiple processes. - Context limited to a single state variable.
Multiple processes represented by Cartesian
product HMM becomes intractable quickly - Large parameter space ? large data needs
- Empirical experience Representation sensitive to
changes in the environment (lighting, background
noise, etc)
9Seer Explored Layered HMMs (LHMMs)
- Goal Decompose parameter space to reduce
training and re-training requirements. - Approach Segment the problem into distinct
layers that operate at different temporal
granularities. - Consequence Data explained at different levels
of temporal abstraction.
10Seer Multi-Scale Activity Recognition
Time
11(No Transcript)
12SEERs Architecture
Phone Conversation Face to Face
Conversation Working on the Computer Presentation
Nobody Present Distant Conversation
Keyboard/Mouse Activities
Feature Vector
Sound Localization
Audio HMMs Classification Results
Video HMMs Classification Results
One Person Present One Active Person
Present Multiple People Present Nobody Present
Ambient Noise Human Speech Music Keyboard Phone
Ring
Feature Vector
Feature Vector
PCA on LPC Coeff
Energy, Mean Variance of Fundamental Freq
Zero Crossing Rate
Skin Color Probability
Face Density
Foreground/ Background
Motion Density
13Value of LHMMs for Seer Task
- Comparison between traditional (cartesian
product) HMMs and LHMMs - 60 minutes of office activity data
(10min/activity3 users) - 50 of data for training and 50 for testing
- 6 office activities recognized
- Phone Conversation
- Face to Face Conversation
- Working on the computer
- Distant conversation
- Presentation
- Nobody Around
HMMs LHMMs
Accuracy 72.7 99.7
param 1360 670
14HMMs Inference in Seer
15LHMMs Inference in Seer
16Selective Perception Policies (ICMI03)
- Seer performs well but sensing consumes a large
portion of the available CPU - Seek to understand value and cost of different
sensors / analyses - Define policies for dynamically selecting
sensors/features - Principled decision-theoretic approach
- Expected Value of Information (EVI)
- Heuristic approach
- Observational Frequencies Rate-based perception
- Random
- Select features randomly as background case
17Related work
- Principles for Guiding Perception
- Expected value of information (EVI) as core
concept of decision analysis (Raiffa, Howard) - Value of information in probabilistic reasoning
systems, use in sequential diagnosis (Gorry, 79
Ben-Bassat 80, Horvitz, et al, 89 Heckerman, et
al. 90) - Probability and utility to model the behavior of
vision modules (Bolles, IJCAI77), to score plans
of perceptual actions (Garvey76), reliability
indicators to fuse vision modules (Toyama
Horvitz, 2000) - Growing interest in applying decision theory in
perceptual applications in the area of active
vision search tasks (Rimey93)
18Policy 1 Expected Value of Information (EVI)
- Decision-theoretic principles to determine value
of observations. - EVI computed by considering value of eliminating
uncertainty about the state of observational
features under consideration - Example Vision sensor (camera) features
- Motion density
- Face density
- Foreground density
- Skin color density
- There are K16 possible combinations of these
features representing plausible sets of
observations.
19Subsets of Features Example
No Features
Skin Color Probability
Motion Density
Skin Color Probability
Face Density
Fgnd/Bckgnd
Skin Color Probability Motion Density
Skin Color Probability Face Density
Motion Density
Skin Color Probability Fgnd/Bckgnd
Motion Density Face Density
Motion Density Fgnd/Bckgnd
Face Density Fgnd/Bckgnd
Face Density
Skin Color Probability Motion Density Face
Density
Skin Color Probability Motion Density
Fgnd/Bckgnd
Skin Color Probability Face Density
Fgnd/Bckgnd
Foreground/ Background
Face Density Motion Density Fgnd/Bckgnd
Skin Color Probability Face Density Motion
Density Fgnd/Bckgnd
20Criticality of Utility Model
- EVI guided sensing via considering influence of
observations on expected utility of the systems
actions - Need to endow system with representation of
utility of actions in the world. - Assess the utilities as the
value of asserting that the real world activity
is - Maximum expected utility action,
21Considering Outcome of Making an Observation
- Expected value (EV) of observing computing
features
- values of a set
observational features f - E prior observational evidence
- Represent uncertainty about the values that the
system will observe when evaluating - Consider the change in expected value given the
current probability distribution,
22Balancing Costs and Benefits
- The net expected value of the information (NEVI)
of feature combination is,
- The cost is the cost assigned to the
computational latency associated with sensing and
computing feature combination, - If the difference is positive, it is worth
collecting the information and therefore
computing the feature
23Cost Models
- Distinct cost models
- Measure of total computation usage
- Cost associated with latencies that users will
experience - Costs of computation can be context dependent
- Example Expected cost model that takes into
account the likelihood the user will experience
poor responsiveness, and frustration if
encountered.
24Single and Multistep Analyses
- Real-world applications of EVI typically employ a
greedy approach, i.e., compute next best
observations at each step - In our analysis, we extend typical EVI
computations by reasoning about groups
(combinations) of features - We select the feature combination with the
greatest EVI at each step, - Sequential diagnosis or hypothetico-deductive
cycle
25Hypothetico-Deductive Cycle
Selective Perception Analysis
Control Module
Probability model for selective
perception analysis
Decides which set of features to compute
Probabilistic Module
26EVI with HMMs
- Given that our probabilistic models are HMMs, the
term can be computed as
- Where
- is the forward variable at time
t and state s - is the state transition probability of
going from state s to l - is the observation probability
- All of them for model
27EVI in HMMs
- If we discretize the observation space, the NEVI
becomes
- In SEER we discretize the observation space into
M bins with M typically 10. - The computational overhead to carry out EVI in
the discrete case is O(MFNNJ), where - M is the maximum cardinality of the features,
- F is the number of feature combinations,
- N is the maximum number of states in the HMMs and
- J is the number of HMMs.
28Policy 2 Heuristic Rate-based Perception
- For comparison, we consider selective perception
policies based on defining observational
frequencies and duty cycles for each feature.
Period
ON
Audio Classification
OFF
Duty Cycle
ON
Video Classification
OFF
ON
Sound Localization
OFF
ON
Keyboard/Mouse
OFF
Time
29Policy 3 Random Selection
- Baseline policy for comparisons
- Randomly select a feature combination from all
possible
30S-Seer
Phone Conversation Face to Face
Conversation Working on the Computer Presentation
Nobody Present Distant Conversation
Keyboard/Mouse Activities
Sound Localization
Feature Vector
Video HMMs Classification Results
Audio HMMs Classification Results
One Person Present One Active Person
Present Multiple People Present Nobody Present
Ambient Noise Human Speech Music Keyboard Phone
Ring
Feature Vector
Feature Vector
PCA on LPC Coeff
Energy, Mean Variance of Fundamental Freq
Zero Crossing Rate
Skin Color Probability
Face Density
Foreground/ Background
Motion Density
31Experiments with Selective Perception
- Qualitative and formal evaluations
- DC Distant Conversation
- NP Nobody Present
- O Other
- P Presentation
- FFC F-F Conversation
- WC Working on computer
- PC Phone Conversation
32Experiments with Selective Perception
- At times the system does not use any features
(e.g. time50) - The system guided by EVI tends to have longer
switching time than when using all the features
all the time - Some features might not be activated at all (e.g.
sound localization feature)
33Comparison of the Selective Perception Policies
- Mean accuracies when testing EVI, observational
frequencies and random selection with 600
sequences of real-time data (100 seq/behavior)
All features EVI ObsFreq Random
Phone Conversation 100 100 89 78
Face to Face Conversation 100 100 86.9 90.2
Presentation 100 97.8 100 91.2
Other Activity 100 100 100 96.7
Nobody Present 100 98.9 100 100
Distant Conversation 100 100 100 100
34Comparison of the Selective Perception Policies
- Mean computational costs ( CPU time) when
testing EVI, observational frequencies and random
selection with 600 sequences of real-time data
(100 seq/behavior)
All features EVI ObsFreq Random
Phone Conversation 61.22 44.5 43 47.5
Face to Face Conversation 67.07 56.5 38.5 53.4
Presentation 49.8 20.88 35.9 53.3
Other Activity 59 19.6 37.8 48.9
Nobody Present 44.33 35.7 39.4 41.9
Distant Conversation 44.54 23.27 33.9 46.1
35Richer Utility and Cost Models
- Initial experiments
- Identity matrix as the systems utility model
- Measure of the cost, , as the
percentage of CPU usage - Richer utility models for misdiagnosis
- One can assess in the cost to a user of
misclassifying as - Seek amounts that users would be willing to pay
to avoid having the activity misdiagnosed for all
possible N-1 misdiagnoses - Richer models for cost of perceptual analysis
- We map computational costs and utility to the
same currency - Cost that a user would be willing to pay to
avoid latencies of different kinds in different
settings.
36Context-sensitive Cost Models
- S-SEERs domain level reasoning supports such
context-sensitive cost models - Assuming cost (C) of computation is zero when the
users are not using the computer, we can generate
an expected cost (EC) of perception as follows - where
- represents the latency associated
with observing and analyzing the set of features - E represents the evidence already observed
- The index 1..m contains the subset of activities
that do not include interaction with the user
37Studies with Richer Utility and Cost Models
- Condition cost models on software application
that has focus - Consider that user is interacting vs. not
interacting - Analysis of influence of activity-dependent cost
model
- 900 sequences of office activity (150
seq/activity) with - Rich cost of misdiagnosis
- Activity-dependent cost model ? cost higher when
user interacting with computer - e.g. Presentation, Person-present other activity
vs. Nobody present, distant converation
overheard, etc.
38Feature Activation ( time)
Video Audio Sound Loc Kb/Mouse
Phone Conv 86.7 78 86.7 78 0 14.7 100 100
Face to Face Conv 65.3 48.7 65.3 40.7 0 0 100 100
Presentation 10 2 10 2 0 2 27.3 53.3
Other Activity 10 1.3 10 1.3 0 1.3 63.3 63.3
Nobody Present 78.7 86 78.7 86 0 86 80.7 88
Distant Conv 47.3 100 47.3 100 0 86 100 100
Constant Cost Activity Dependent Cost
39Video
40Summary
- Decision-theoretic approach to feature selection
in multimodal systems - How do observations affect the utility of system?
- Selective perception significantly reduces the
computational burden of S-SEER while preserving
good recognition accuracies - Comparative studies, EVI provides overall best
trade-off between the recognition accuracy of the
system and its computational burden
41Future Work
- Extending utility models
- Models of cost of latencies
- Cost of misdiagnosis in applications
- Models of persistence and volatility
- Models that represent the decay of confidence
about states of the world with increasing time
since an observation was made - Design-time and real-time applications
- Exploration of the decision theoretic approach to
other graphical models - Emotional content
42Thank you!