Title: GPU Acceleration of Iterative Clustering
1GPU AccelerationofIterative Clustering
- Jesse D. Hall
- John C. Hart
- University of Illinois, Urbana-Champaign
2Iterative Clustering(aka k-means/Lloyds/LBG
Alg.)
Partitioning Assign eachpoint to
nearestcluster center
Fitting Find newcenter foreach cluster
Initialization Pick (random)collection
ofcluster centers
Cluster bunnies courtesy Nate Carr
3Applications
- Vector quantization
- Image compression (Gersho Gray, VQ Signal
Compression, 92) - Light field compression (Levoy Hanrahan, S96)
- Texture synthesis (Wei Levoy, S00)
- Clustered principal component analysis
- Precomputed radiance transfer (Sloan, us
Snyder, S03) - Mesh clustering
- Multichart geometry images (Sander et al., SGP03)
- Variational Shape Approx. (Cohen-Steiner et al.,
S04)
Vector quantization of a light field Levoy
Hanrahan, S96
Vector quantization ofradiance transfer
4Clustering of
Orientation
Illumination
Position
5CPU GPU Approach
- GPU Partitioning
- Metric evaluation independent for each point
- Metric evaluation usually not data dependent
- Ideal for SIMD streaming implementation
- CPU Fitting
- Fitting inherently a reduction operation
- Relies on sophisticated data structures (e.g.
kd-trees) and processes (e.g. matrix
eigenstructure)
6GPU Partitioning
- Load cluster center and ID into fragment shader
constants - Fragment shader evaluates metric on point data
stored in (deep) texture - Distance stored in z-buffer if less than current
z-value (and ID written to framebuffer) - After all clusters processed, framebuffer
contains IDs of nearest cluster for each
datapoint - Requires z-buffer readback which will be faster
with PCI-Express
7Hierarchical GPU Partitioning
or
- Organize cluster centers in a kd-tree
- Point may need to backtrack severaltimes to find
best cluster - Nevertheless O(n log k) in practice
- Requires lots of decisions per point whichwould
hinder GPU implementation - Instead send group (e.g. previouscluster) of
datapoints simultaneouslythrough kd-tree - Traverse based on groups bounding box
- Keep track of individual points closest distance
and ID - Groups organized as rectangles in texture memory
8Clustered Principal Components for Precomputed
Radiance Transfer
- Lloyds Algorithm on PRT datasets would take hours
on John Snyders PC, ran overnight - Can implement CPCA cluster metric in Cg
- Final result is d2
- VDIM of 4-vectors needed to store a point
- NPCA of PCA basis vectors
- More complex than Euclidean distance, but same
of texture fetches
9GPU CPU v. CPU
NVIDIA GeForce FX 5900 vs. AMD Athlon XP
2800 16D points, 128 clusters
3x
1016K points, 128 clusters
1164K 16D points
12Conclusions
- GPU CPU 3x faster than CPU alone
- Slower CPU fared just as well ? process is
bandwidth limited - Scales well even with of clusters which
corresponds to of fragment program state
changes ? process dominated by floating point
performance - GPU valuable for preprocessing