Title: Gesture Recognition in Complex Scenes
1Gesture Recognition in Complex Scenes
- Vassilis Athitsos
- Computer Science and Engineering Department
- University of Texas at Arlington
- Jonathan Alon (ex-BU, now Negevtech, Israel).
- Jingbin Wang (ex-BU, now Google).
- Quan Yuan (Boston University).
- Alexandra Stefan (Boston University).
- Stan Sclaroff (Boston University).
- George Kollios (Boston University).
- Margrit Betke (Boston University).
3Motivation ASL Dictionary
4Motivation ASL Dictionary
- Addresses needs of a large community
- 500,000 to 2 million ASL users in the US.
- ??? in the European Union.
- Direct impact in education of Deaf children.
- Most born to hearing parents, learn ASL at
school. - Challenging problems in vision, learning,
database indexing. - Large-scale motion-based video retrieval.
- Efficient large-scale multiclass recognition.
- Learning complex patterns from few examples.
5Sources of Information
- Hand motion.
- Hand pose.
- Shape.
- Orientation.
- Facial expressions.
- Body pose.
6Dynamic Gestures
- What gesture did the user perform?
Class 8
7Typical Motion Recognition Approach
input sequence
class 0
8Bottom-up Shortcoming
input frame
hand likelihood
- Hand detection is often hard!
- Color, motion, background subtraction are often
not enough. - Bottom-up frameworks are a fundamental computer
vision bottleneck.
9Key Idea
hand candidates
input frame
- Hand detection can return multiple candidates.
- Design a recognition module for this type of
10Nearest-Neighbor Recognition
- Question how should we measure similarity?
11Database Sequences
Example database gesture
- Assumption hand location is known in all frames
of the database gestures. - Database is built offline.
- In worst case, manual annotation.
- Online user experience is not affected.
12Comparing Trajectories
- is the hand position at frame i.
- Temporary assumption known hand location.
- How do we compare these trajectories?
13Comparing Trajectories
- Comparing i-th frame to i-th frame is
problematic. - What do we do with frame 8?
14Comparing Trajectories
- Alignment ((f1, g1), , (fm, gm)).
- Must include all frames of both sequences.
- A frame can occur multiple consecutive times.
15Comparing Trajectories
- ((1,1), (1,2), (2,3), (3,4), (4,5),(5,6), (6,7),
16Optimal Alignment
- Cost of ((f1, g1), , (fm, gm)) has two terms
- Correspondence cost average cost of each (fi,
gi), - Transition cost cost of two consecutive
pairings. - Dynamic Time Warping (DTW) computes optimal
alignment. - Complexity quadratic to length of sequences.
Frame 1
Frame 50
Frame 80
Frame 1
. .
Frame 32
. .
Frame 51
- For each cell (i, j)
- Compute optimal alignment of M(1i) to Q(1j).
- Answer depends only on (i-1, j), (i, j-1), (i-1,
j-1). - Time complexity proportional to size of table.
. .
. .
- Alignment ((f1, g1 , k1), , (fm, gm , km))
- Matching cost average cost of each (fi , gi ,
ki), - Transition cost cost of two consecutive
pairings. - How do we find the optimal alignment?
. .
. .
- For each cell (i, j, k)
- Compute optimal alignment of M(1i) to Q(1j),
using the k-th candidate for frame Q(j). - Answer depends on (i-1, j,k), (i, j-1,), (i-1,
- Result optimal alignment.
- ((f1, g1, k1), (f2, g2, k2), , (fm, gm, km)).
- We get hand locations for free!
21Application Gesture Recognition with Short
22Experiment 10 Digits.
23Experiment 10 Digits.
- Test set 90 gestures, from 3 users.
- Database 90 gestures from 3 users.
- Each test gesture was only matched to the 60
examples from the other users - Accuracy 91.
- Higher level module (recognition) tolerant to
lower-level (detection) ambiguities. - Recognition disambiguates detection.
- This is important for designing plug-and-play
modules. - Use in ASL dictionary.
- User signs unknown word in front of computer.
- Video sequences of signs are ranked in order of
DSTW score.
25Static Gestures (Hand Poses)
- Given a hand model, and a single image of a hand,
estimate - 3D hand shape (joint angles).
- 3D hand orientation.
Input image
Articulated hand model
26Static Gestures
- Given a hand model, and a single image of a hand,
estimate - 3D hand shape (joint angles).
- 3D hand orientation.
Input image
Articulated hand model
27Similarity Based Matching
- Goal
- Estimate the class of query gesture q.
- Method
- Find the most similar database gestures.
query gesture
- How do we measure similarity?
- Tolerate errors in feature extraction.
- Hand detection and segmentation.
- How do we achieve efficient retrieval?
- Efficient approximations of slow similarity
query gesture
29Goal Hand Tracking Initialization
- Given the 3D hand pose in the previous frame,
estimate it in the current frame. - Problem no good way to automatically initialize
a tracker. - Rehg et al. (1995), Heap et al. (1996),
Shimada et al. (2001), - Wu et al. (2001), Stenger et al. (2001), Lu
et al. (2003),
30Assumptions in Our Approach
- A few tens of distinct hand shapes.
31Assumptions in Our Approach
- A few tens of distinct hand shapes.
- All 3D orientations should be allowed.
- Motivation American Sign Language.
32Assumptions in Our Approach
- A few tens of distinct hand shapes.
- All 3D orientations should be allowed.
- Motivation American Sign Language.
- Input single image, bounding box of hand.
33Assumptions in Our Approach
input image
skin detection
segmented hand
- We do not assume precise segmentation!
- No clean contour extracted.
34Approach Database Search
- Over 100,000 computer-generated images.
- Known hand pose.
- We avoid direct estimation of 3D info.
- With a database, we only match 2D to 2D.
- We can find all plausible estimates.
- Hand pose is often ambiguous.
36Building the Database
26 hand shapes
37Building the Database
4128 images are generated for each hand
shape. Total 107,328 images.
38Features Edge Pixels
- We use edge images.
- Easy to extract.
- Stable under illumination changes.
edge image
39Similarity Measure Chamfer Distance
Overlaying input and model
How far apart are they?
40Directed Chamfer Distance
- Input two sets of points.
- red, green.
- c(red, green)
- Average distance from each red point to nearest
green point.
41Directed Chamfer Distance
- Input two sets of points.
- red, green.
- c(red, green)
- Average distance from each red point to nearest
green point. - c(green, red)
- Average distance from each red point to nearest
green point.
42Chamfer Distance
- Input two sets of points.
- red, green.
- c(red, green)
- Average distance from each red point to nearest
green point. - c(green, red)
- Average distance from each red point to nearest
green point.
Chamfer distance C(red, green) c(red, green)
c(green, red)
43Evaluating Retrieval Accuracy
- A database image is a correct match for the input
if - the hand shapes are the same,
- 3D hand orientations differ by at most 30 degrees.
correct matches
incorrect matches
44Evaluating Retrieval Accuracy
- An input image has 25-35 correct matches among
the 107,328 database images. - Ground truth for input images is estimated by
45Evaluating Retrieval Accuracy
- Retrieval accuracy measure what is the rank of
the highest ranking correct match?
46Evaluating Retrieval Accuracy
rank 1
rank 2
rank 3
rank 5
rank 6
rank 4
highest ranking correct match
47Results on 703 Real Hand Images
Rank of highest ranking correct match Percentage of test images
1 15
1-10 40
1-100 73
48Results on 703 Real Hand Images
Rank of highest ranking correct match Percentage of test images
1 15
1-10 40
1-100 73
- Results are better on nicer images
- Dark background.
- Frontal view.
- For half the images, top match was correct.
segmented hand
edge image
initial image
correct match
rank 1
segmented hand
edge image
initial image
correct match
rank 644
segmented hand
edge image
initial image
incorrect match
rank 1
segmented hand
edge image
initial image
correct match
rank 1
segmented hand
edge image
initial image
correct match
rank 33
segmented hand
edge image
initial image
incorrect match
rank 1
segmented hand
edge image
hard case
segmented hand
edge image
easy case
- 3D pose estimation from a single image is hard!
- What is our system good for?
- Cleanly segmented frontal views.
- Generating hypotheses that domain
knowledge/constraints can disambiguate. - Tracker initialization and error recovery.
- How would our system be integrated with a tracker?
57Research Directions
- More accurate similarity measures.
- Problem higher-level features are more
informative, but harder to calculate. - Better tolerance to segmentation errors.
- Clutter.
- Incorrect scale and translation.
58Current Work ASL Dictionary
59Current Work ASL Dictionary
- Computer vision challenge
- Estimate hand pose and motion accurately and
fast. - Our existing hand pose method leaves many
questions unanswered. - Machine learning challenge
- Currently, in DSTW, there is no learning.
- learn models of signs.
- 4000 classes, 1-5 examples per sign.
- Data mining challenge
- indexing methods for large numbers of classes.
60- Comments, questions, complaints
- E-mail athitsos_at_uta.edu
- Web http//crystal.uta.edu/athitsos/