Title: Interactive Event Detection in Video and Audio
1Interactive Event Detection in Video and Audio
- Rahul SukthankarIntel Research Pittsburgh
Carnegie Mellon University
2Contributors
- Diamond team L. Huston, Satya, L. Mummert, C.
Helfrich, L. Fix - Forensic video retrievalJ. Campbell, P. Pillai,
Diamond team - Volumetric video analysis Y. Ke, M. Hebert
- Sound object detection in soundtracksD. Hoiem,
Y. Ke - Interactive search-assisted diagnosis for breast
cancerY. Liu, R. Jin, B. Zheng, D. Jukic
3Why Interactive Event Detection?
- Events of interest are often not known a priori
- Data exploration find me more things like this
- Users requirements change based on partial
results - Surveillance Alert me if you see X hmm
actually I want Y - Challenges
- Limited training data
- can we still learn good event detectors?
- Efficiency
- how best to organize/index/pre-process the data?
4Outline
- Event detection in audio
- sound object detection from a few examples
- Diamond
- efficient search of non-indexed data
- Event detection in video
- forensic video surveillance
- volumetric analysis for action detection
5Example Sound Object Detection
- Applications of sound object detection
- Alert me if you hear a gunshot. (monitoring)
- Fast forward to the next swordfight in LotR
(search and retrieval) - Approach
- Learn boosted classifier from 5-10 examples of
the object - Scan windowed classifier over all possible
locations
Clip 1
Clip Classifier
Classify each clip as object or non-object
Return locations of detected sound object
Audio stream
Clip N
D. Hoiem, Y. Ke, R. Sukthankar, ICASSP 2005
6Sound Object Detection Clip Classifier
- Feature extraction
- Weak classifier small decision trees on
features - Learn classifier cascade using Adaboost
D. Hoiem, Y. Ke, R. Sukthankar, ICASSP 2005
7Sound Object Detection Results
 stage 1 stage 1 stage 2 stage 2 stage 3 stage 3
 pos neg pos neg pos neg
meow 0.0 1.4 0.0 1.2 2.2 0.8
phone 0.0 0.4 4.3 0.1 5.9 0.0
car horn 0.0 3.9 0.6 2.2 3.6 1.3
door bell 1.4 2.1 2.1 0.4 6.3 0.1
swords 6.1 1.3 6.7 0.1 6.7 0.0
scream 0.3 5.5 2.7 1.4 5.3 1.1
dog bark 0.7 1.0 6.0 0.3 7.7 0.2
laser gun 0.0 6.8 4.4 5.1 6.7 0.9
explosion 4.1 5.2 7.5 1.5 12.0 0.5
light saber 4.8 6.8 9.7 1.0 13.9 0.2
gunshot 8.1 6.1 12.5 2.3 14.5 1.1
close door 7.9 7.8 14.5 4.8 17.6 2.3
male laugh 4.3 14.7 9.5 9.7 13.3 7.0
average 2.9 4.4 6.0 2.2 8.5 1.1
8Framework for Interactive Event Detection
- Interactive event detection ? non-indexed
search - Search and indexing
- If queries can be predicted in advance, indexing
is possible(e.g., Google for text data) - Alternative is brute-force search through
non-indexed data - How to perform efficient non-indexed search?
- May need to execute arbitrary code (learned event
detector)
9Brute-Force Search
- Event detection vast majority of the data is
useless - BFS scales poorly with storage volume
Search app
Storage
User
10Diamond Early Discard
- Reject as close to storage as possible
- Reduce volume of data transferred
- Scales much better!
Search app
Storage
User
11Diamond Architecture
Assoc DMA
Searchlet
App Code (proprietary or open)
Filter API
Storage Runtime
Diamond API (open)
Diamond code (open)
Assoc DMA
Searchlet
Storage access protocol (open)
Filter API
Storage Runtime
Assoc DMA
Searchlet
Diamond is a collaborative projectbetween Intel
Research CMU
Filter API
Storage Runtime
12Anatomy of a Diamond Searchlet
- Sequence of partially-ordered filters
- each filter can pass or drop an object
- filters share state through attributes
- Diamond determines an optimal filter order
13Example Application Forensic Video Surveillance
- Timely reconstruction of a crime scene
- large quantities of video surveillance data
- current practice gather manually scan video
tapes - obvious optimization transfer data to central
site - Better solution send your detector to the data
J. Campbell et al., VSSN 2004
14Video Action Detection Goal
15Idea Treat Video as a Volume
16Related work Recognition usingSVMs on
Space-Time Interest Points
Space-time interest points
Figures courtesy Schuldt et al., ICPR 2004
17Problem with Space-Time Interest PointsToo
Sparse
Two examples of smooth motions where no stable
space-time interest points are detected.
18Problem with Space-Time Interest Points
Dependent on lighting conditions
19Volumetric Features on Optical Flow
20Our Features 3D Extension of Viola-Jones
Volumetric features
Integral Volume
(x, y, t)
Volumetric features can be efficiently computed
using integral volumes, with only 8 memory
accesses per feature. The sum of the volume ise
a f g b c h d.
21Classifier cascade learned usingDirect Feature
Selection, Wu et al., NIPS, 2002
Millions of potential features for selection, so
Adaboost is too slow.
An example of the features learned by the
classifier to recognize the hand-wave action in a
detection volume
22Detection
- Use a sliding volume over video sequence
- Model true event as a cluster of detections with
Gaussian distribution.
23Generic Volumetric Features
- Processing non-indexed video is slow lots of
data - Are there application-independent representations
for video? - Goal pre-process video once, support multiple
video event apps.
Y. Ke, unpublished 2006
24Related workSpace-Time Behavior Based
Correlation
Figures courtesy Shechtman Irani, CVPR 2005
25Interactive Search-Assisted Diagnosis
ISAD Results
Rank1 benignbiopsy
CLOSE?
suspiciousmass (query)
Rank2 benignbiopsy
Rank3 malignantbiopsy
CollaboratorsB. Zheng, D. Jukic, L. Yang, R. Jin
26Query-adaptive Local Distance Learning
- Previously
- Various Lp norms Euclidean distance is typically
not the best - Global metric learning
- Learn metric that best satisfies user-given
pairwise data constraints - Fares poorly with multimodal data
- Local metric learning
- Learn metric that does above, but weighs nearby
constraints higher - Chicken egg problem
- Whats new
- Learn a metric for the given query based on
neighborhood
27Summary
- Many real applications require interactive event
detection - Good for ML algorithms that
- operate with limited training data
- train quickly/incrementally
- exploit unlabeled data
- Diamond infrastructure for efficient
non-indexed search - http//diamond.cs.cmu.edu/
- Interactive event detection in video is still
painful - Good general-purpose representation for event
detection?