Scalable Clustering for Vision using GPUs - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Scalable Clustering for Vision using GPUs

Description:

Scalable Clustering for Vision using GPUs K Wasif Mohiuddin P J Narayanan Center for Visual Information Technology International Institute of Information Technology ... – PowerPoint PPT presentation

Number of Views:150

Avg rating:3.0/5.0

Slides: 61

Provided by: CVJ7

Category:

more less

Transcript and Presenter's Notes

Title: Scalable Clustering for Vision using GPUs

1
Scalable Clustering for Vision using GPUs

K Wasif MohiuddinP J Narayanan
Center for Visual Information TechnologyInternati
onal Institute of Information Technology
(IIIT)Hyderabad

2
Publications

K Wasif Mohiuddin and P J Narayanan
Scalable Clustering using Multiple GPUs.
HIPC 11 (Conference on High Performance
Computing), Bangalore, India)
2) K Wasif Mohiuddin and P J Narayanan
GPU Assisted Video Organizing Application.
ICCV11, Workshop on GPU in Computer Vision
Applications, Barcelona, Spain).

3
Presentation Flow

Scalable Clustering on Multiple GPUs
GPU assisted Personal Video Organizer

4
Introduction

Classification of data desired for meaningful
representation.
Unsupervised learning for finding hidden
structure.
Application in computer vision, data mining with
Image Classification
Document Retrieval
K-Means algorithm

5
Clustering
Mean Evaluation
Select Centers
Labeling
Relabeling
6
Need for High Performance Clustering

Clustering 125k vectors of 128 dimension with 2k
clusters took nearly 8 minutes on CPU per
iteration.
A fast, efficient clustering implementation is
needed to deal with large data, high
dimensionality and large centers.
In computer vision, SIFT(128 dim) and GIST are
common. Features can run into several millions
Bag of Words for Vocabulary generation using SIFT
vectors

7
Challenges and Contributions

Computational O(ndk1 log n)
Growing n, k, d for large scale applications.
Contributions A complete GPU based
implementation with
Exploitation of intra-vector parallelism
Efficient Mean evaluation
Data Organization for coalesced access
Multi GPU framework

8
Related Work

General Improvements
KD-trees Moor et al, SIGKKD-1999
Triangle Inequality Elkan, ICML-2003
Distributed Systems Dhillon et al, LSPDM-2000
Pre CUDA GPU Efforts Improvements
Fragment Shader Hall et al, SIGGRAPH-2004

9
Related Work (cont)

Recent GPU efforts
Mean on CPU Che et al, JPDC-2008
Mean on CPU GPU Hong et al, WCCSIE-2009
GPU Miner Wenbin et al, HKUSTCS-2008
HPK-Means Wu et al, UCHPC-2009
Divide Rule Li et al, ICCIT-2010
One thread assigned per vector. Parallelism not
exploited within data object.
Lacking efficiency in Mean evaluation
Proposed techniques are parameter dependant.

10
K-Means

Objective Function ?i?j?xi(j) -cj?2
1 i n, 1 j k
K random centers are initially chosen from input
data objects.
Steps
Membership Evaluation
New Mean Evaluation
Convergence

11
GPU Architecture

Fermi architecture has16 Streaming
Multiprocessors (SM)
Each SM having 32 cores, so overall has 512 CUDA
cores.
Kernels unleash multiple threads to perform a
task in a Single Instruction Multiple Data (SIMD)
fashion.
Each SM has registers divided equally amongst its
threads. Each thread has a private local memory.
Single uni?ed memory request path for loads and
stores using the L1 cache per SM and L2 cache
that services all operations
Double precision, faster context switching,
faster atomic operations and multiple kernel
execution

12
K-Means on GPU

Membership Evaluation
Involves Distance and Minima evaluation.
Single thread per component of vector
Parallel computation done on d components of
input and center vectors stored in row major
format.
Log summation for distance evaluation.
For each input vector we traverse across all
centers.

13
Membership on GPU

Center Vectors

1
2
p
p
i
Label
Input Vector
k-1
k
14
Membership on GPU(Cont)

Data objects stored in row major format
Provides coalesced access
Distance evaluation using shared memory.
Square root finding avoided

15
K-Means on GPU (Cont)

Mean Evaluation Issues
Random reads and writes
Concurrent writes
Non uniform distribution of data objects per
label.

Write
Read/Write
Threads
Data
16
Mean Evaluation on GPU

Store labels and index in 64 bit records
Group data objects with same membership using
Splitsort operation.
We split using labels as key
Gather primitive used to rearrange input in order
of labels.
Sorted global index of input vectors is
generated.

Splitsort Suryakant Narayanan IIITH, TR 2009
17
Splitsort Transpose Operation
18
Mean Evaluation on GPU (cont)

Row major storage of vectors enabled coalesced
access.
Segmented scan followed by compact operation for
histogram count.
Transpose operation before rearranging input
vectors.
Using segmented scan again we evaluated mean of
rearranged vectors as per labels.

19
Implementation Details

Tesla
2 vectors per block , 2 centers at a time
Centers accessed via texture memory
Fermi
2 vectors per block, 4 centers at a time
Centers accessed via global memory using L2 cache
More shared memory for distance evaluation
Occupancy of 83 achieved in case of Fermi and
Tesla.

20
Limitations of a GPU device

Highly computational memory consuming
algorithm.
Overloading on GPU device
Limited Global and Shared memory on a GPU device.
Handling of large data vectors
Scalability of the algorithm

21
Multi GPU Approach

Partition input data into chunks proportional to
number of cores.
Broadcast k centers to all the nodes.
Perform Membership partial mean on each of the
GPUs sent to their respective nodes.

22
Multi GPU Approach (cont)

Nodes direct partial sums to Master node.
New means evaluated by Master node for next
iteration.

Master Node
S SaSb..Sz
Sa
Sb
Sz
New Centers
Node A
Node B
Node Z
23
Results

Generated Gaussian SIFT vectors
Variation in parameters n, d, k
Performance on CPU(1 Gb RAM, 2.7 Ghz), Tesla T10,
GTX 480, 8600 tested up to nmax 4 Million, kmax
8000 , dmax 256
MultiGPU (4xT10 GTX 480) using MPI
nmax 32 Million, kmax 8000, dmax 256
Comparison with previous GPU implementations.

24
Overall Results
N, K CPU GPU Tesla T10 GTX 480 4xT10
10K, 80 1.3 0.119 0.18 0.097
50K, 800 71.3 2.73 1.73 0.891
125K, 2K 463.6 14.18 7.71 2.47
250K, 4K 1320 38.5 27.7 7.45
1M, 8K 28936 268.6 170.6 68.5
Times of K-Means on CPU, GPUs in seconds for
d128.
25
Performance on GPUs
Performance of 8600 (32 cores), Tesla(240 cores),
GTX 480(480 cores) for d128 and k1,000.
26
Performance vs n
Linear in n, with d128 and k4,000.
27
Overall Performance

Multi GPU provided linear speedup
Speedup of up to 170 on GTX 480
6 Million vectors of 128 dimension clustered in
just 136 sec per iteration.
Low end GPUs provide nearly 10-20 times of
speedup.

28
Comparison
N K D Li et al Wu et al Our K-Means
2 Million 400 8 1.23 4.53 1.27
4 Million 100 8 0.689 4.95 0.734
4 Million 400 8 2.26 9.03 2.4
51,200 32 64 0.403 - 0.191
51,200 32 128 0.475 - 0.262
Up to twice increase in speedup against the best
GPU implementation on GTX 280
29
Multi GPU Results
N Dim 1 Tesla 4xTesla 4xTeslaGTX480
1 M 128 120.4 33.6 22.8
1.5 M 128 181.7 47.2 34.8
3 M 128 364.2 95.67 67.4
6 M 128 - 183.8 136.7
16 M 16 220.4 57.8 40.9
32 M 16 - 116 84.3
Scalable to number of cores in a Multi GPU,
Results on Tesla, GTX 480 in seconds for d128,
k4000
30
Time Division
Time on GTX 480 device. Mean evaluation reduced
to 6 of the total time for large input of high
dimensional data.
31
Presentation Flow

Scalable Clustering on Multiple GPUs
GPU assisted Personal Video Organizer

32
1 2 3 4
33
Motivation

Many and varied videos in everyones collection
and growing every day
Sports, TV Shows, Movies, home events, etc.
Categorizing them based on content useful
No effective tools for video (or images)
Existing efforts are very category specific
Cant need heavy training or large clusters of
computers
Goal Personal categorization performed on
personal machines
Training and testing on a personal scale

34
Challenges and Contributions

Algorithmic Extend image classification to
videos.
Data Use small amount of personal videos span
across wide class of categories.
Computational Need do it on laptops or personal
workstations.
Contributions A video organization scheme with
Learning categories from user-labelled data
Fast category assignment for the collection.
Exploiting the GPU for computation
Good performance even on personal machines

35
Related Work

Image Categorization
ACDSee, Dbgallery, Flickr, Picasa, etc
Image Representation
SIFT Lowe IJCV04, GIST Torralba IJCV01, HOG
Dalal Triggs CVPR05 etc.
Key Frame extraction
Difference of Histograms Gianluigi SPIE05

36
Related Workcontd

Genre Classification
SVM Ekenel et al AIEMPro2010
HMM Haoran et al ICICS2003
GMM Truong et al, ICPR2000
Motion and color Chen et al, JVCIR2011
Spatio-temporal behavior Rea et al, ICIP2000
Involved extensive learning of categories for a
specific type of videos
Not suitable for personal collections that vary
greatly.

37
Video Classification Steps

Category Determination
User tags videos separately for each class
Learning done using these videos
Cluster centers derived for each class
Category Assignment
Use the trained categories on remaining videos
Final assigning done based on scoring
Ambiguities resolved by user

38
Category Determination

Segmentation Thresholding
Keyframe extrction PHOG Features
K-Means

39
Work Division

Less intensive steps processed on CPU.
Computationally expensive steps moved onto GPU.
Steps like key frame extraction, feature
extraction and clustering are time consuming.

40
Key frame Extraction

Segmentation
Compute color histogram for all the frames.
Divide video into shots using the score of
difference of histograms across consecutive
frames.
Thresholding
Shots having more than 60 frames selected.
Four equidistant frames chosen as key frames from
every shot.

41
PHOG

Edge Contours extracted using canny edge
detector.
Orientation gradients computed with a 3 x 3 Sobel
mask without Gaussian smoothing.
HOG descriptor discretized into K orientation
bins.
HOG vector is computed for each grid cell at each
pyramid resolution levelBosch et al. CIVR2007

42
Final Representation

Cluster the accumulated key frames separately for
every category.
Grouping of similar frames into single cluster.
Meaningful representation of key frames for each
category is achieved.
Reduced search space for the test videos.

43
K-Means

Partitions n data objects into k partitions
Clustering of extracted training key frames.
Separately for each of the categories.
Represent each category with meaningful cluster
centers.
For instance grouping frames consisting of pitch,
goal post, etc.
30 clusters per category generated.

44
PHOG on GPU

HoG computed using previous code Prisacariu et
al. 2009
Gradients evaluated using convolution kernels
from NVIDIA CUDA SDK.
One thread per pixel and the thread block size is
1616.
Each thread computes its own histogram
PHOG descriptors computed by applying HOG for
different scales and merging them.
Downsample the image and send to HoG.

45
Category Assignment

Segmentation, Thresholding, keyframes
Extract keyframes from untagged videos.
Compute PHOG for each keyframe
Classify each keyframe independently
K-Nearest Neighbor classifier
Allot each keyframe to the nearest k clusters
Final scoring for category assignment

46
K-Nearest Neighbor

Classification done based on closest training
samples.
K nearest centers evaluated for each frame.
Euclidean distance used as distance metric.

47
KNN on GPU

Each block handles L new key frames at a time
loops over all key frames.
Find distances for each key frame against all
centers sequentially
Deal each dimension in parallel using a thread
Find the vector distance using a log summation
Write back to global memory
Sort the distance as key for each key frame.
Keep the top k values

48
Scoring

Use the distance ratio r d1 / d2 of distances
d1 and d2 to the two neighbors.
If r lt threshold, allot a single membership to
the keyframe. Threshold used 0.6
Assign multiple memberships otherwise. We assign
to top c/2 categories.
Final category
Count the votes for each category for the video
If the top category is a clear winner, assign to
it. (20 more score than the next)
Seek manual assignment otherwise.

49
Results

Selected four popular Sport categories
Cricket, Football, Tennis, Table Tennis
Collected a dataset of about 100 videos of10 to
15 minutes each.
The user tags 3 videos per category.
Rest of the videos used for testing.
4 frames considered to represent a shot.
Roughly 200 key frames per category.

50
Keyframes (Football)
51
Keyframes (Cricket)
52
Category Labeling
Final key frames for tagged multiple Cricket
videos
Final key frames for tagged multiple Football
videos
Clubbing of key frames from various tagged videos
for each category.
53
Category Labeling
Final key frames for tagged multiple Tennis
videos
Final key frames for tagged multiple Table Tennis
videos
54
Frame classification per category

Variation of K nearest neighbors
Evaluated using 12 tagged videos, 3 per category.
Reduction in error percentage for certain
categories using 3 NN vs just NN.
64 to 73 for cricket
58 to 66 for football
Achieved overall accuracy of nearly 96

55
Category Determination
GPU Device No of Videos Keyf-rames Segmentation (sec) PHOG Features (sec) K-Means (sec)
8600 4 756 182.7 139.6 3.94
8600 12 2432 584.3 468.4 14.3
280 4 756 24.8 19.2 0.59
280 12 2432 76.9 61.8 1.97
580 4 756 11.8 9.1 0.26
580 12 2432 37.91 30.2 0.89
80 secsper video
5 secsper video
Time taken to process the Category Labeling phase
on NVIDIA 8600, GTX 280 and GTX 580 cards
56
Category Assignment

Videos of total duration 1375 minutes are
processed in less than 10 minutes.
Time share for K-NN in seconds

GPU Device No of Videos Keyframes K-NN
8600 88 16946 40.33 sec
280 88 16946 5.39 sec
580 88 16946 2.46 sec
Processing time per 10-15 minute video5 sec on
GTX580, 80 sec on an 8600
57
Conclusions

Complete GPU based implementation.
Achieved a speedup of up to 170 on single NVIDIA
Fermi GPU.
High Performance for large d due to processing
of vector in parallel.
Scalable in problem size n, d, k and number of
cores.
Use of operations like Splitsort, Transpose for
coalesced memory access.
Large datasets clustered using Multi GPU frame
work.

58
Conclusions (contd)

Achieved accuracy up to 96.
Involving user for ambiguous videos reduced
misclassification rate.
Exploited the computational power of GPU for
vision algorithms.
Effective training with variations in a single
category.
Could be extended to other class of sport
categories as well as other genres of video.
More sophisticated classification algorithms can
help accuracy.

59
Future Work

With evolving GPU architecture the approach may
be altered to enhance the performance.
Improve Multi GPU framework by message passing.
Target applications in computer vision which use
extensive amount of clustering.
Explore for more categories of video and
effective training.

60
Thank You