Title: GPU Acceleration of Iterative Clustering
1GPU AccelerationofIterative Clustering
- Jesse D. Hall
- John C. Hart
- University of Illinois, Urbana-Champaign
2Iterative Clustering(aka k-means/Lloyds/LBG
Partitioning Assign eachpoint to
nearestcluster center
Fitting Find newcenter foreach cluster
Initialization Pick (random)collection
ofcluster centers
Cluster bunnies courtesy Nate Carr
- Vector quantization
- Image compression (Gersho Gray, VQ Signal
Compression, 92) - Light field compression (Levoy Hanrahan, S96)
- Texture synthesis (Wei Levoy, S00)
- Clustered principal component analysis
- Precomputed radiance transfer (Sloan, us
Snyder, S03) - Mesh clustering
- Multichart geometry images (Sander et al., SGP03)
- Variational Shape Approx. (Cohen-Steiner et al.,
Vector quantization of a light field Levoy
Hanrahan, S96
Vector quantization ofradiance transfer
4Clustering of
5CPU GPU Approach
- GPU Partitioning
- Metric evaluation independent for each point
- Metric evaluation usually not data dependent
- Ideal for SIMD streaming implementation
- CPU Fitting
- Fitting inherently a reduction operation
- Relies on sophisticated data structures (e.g.
kd-trees) and processes (e.g. matrix
6GPU Partitioning
- Load cluster center and ID into fragment shader
constants - Fragment shader evaluates metric on point data
stored in (deep) texture - Distance stored in z-buffer if less than current
z-value (and ID written to framebuffer) - After all clusters processed, framebuffer
contains IDs of nearest cluster for each
datapoint - Requires z-buffer readback which will be faster
with PCI-Express
7Hierarchical GPU Partitioning
- Organize cluster centers in a kd-tree
- Point may need to backtrack severaltimes to find
best cluster - Nevertheless O(n log k) in practice
- Requires lots of decisions per point whichwould
hinder GPU implementation - Instead send group (e.g. previouscluster) of
datapoints simultaneouslythrough kd-tree - Traverse based on groups bounding box
- Keep track of individual points closest distance
and ID - Groups organized as rectangles in texture memory
8Clustered Principal Components for Precomputed
Radiance Transfer
- Lloyds Algorithm on PRT datasets would take hours
on John Snyders PC, ran overnight - Can implement CPCA cluster metric in Cg
- Final result is d2
- VDIM of 4-vectors needed to store a point
- NPCA of PCA basis vectors
- More complex than Euclidean distance, but same
of texture fetches
NVIDIA GeForce FX 5900 vs. AMD Athlon XP
2800 16D points, 128 clusters
1016K points, 128 clusters
1164K 16D points
- GPU CPU 3x faster than CPU alone
- Slower CPU fared just as well ? process is
bandwidth limited - Scales well even with of clusters which
corresponds to of fragment program state
changes ? process dominated by floating point
performance - GPU valuable for preprocessing