Title: A Hardware Processing Unit For Point Sets
1A Hardware Processing Unit For Point Sets
- S. Heinzle, G. Guennebaud,M. Botsch, M. Gross
- Graphics Hardware 2008
2Motivation
- Point-based graphics established
- Powerful algorithms
- Representation
- Processing
- Manipulation
- Rendering
- Decomposition
- Get neighborhood
- Operate on neighbors
3Motivation
- GPUs not suited for getting neighborhood
- SIMD
- Incoherent branching
- Dynamic data structures slow
- Recursive calls not supported
- CPUs
- Small number of FPUs
- Inflexible memory caches
Courtesy of NVIDIA
Courtesy of Intel
4Contributions
- Hardware architecture for point sets
- Neighbor search module
- Novel advanced caching mechanism
- Reconfigurable processing module
- Programmability using FPGA compiler
- FPGA prototype and measurements
- Small Lean
- ? Integration into multi-core CPU/GPU possible
5Outline
- Related Work
- Spatial Searching and Caching
- Architecture and Prototype
- Results
- Conclusion
6Related Work
kNN on GPUsMa and McCool 02
Kd-Tree Hardware Woop et al. 05 Woop et al. 06
Kd-Tree on GPUs Popov et al. 07
7Related Work
Algebraic Moving Least Squares, Guennebaud and
Gross 07
Linear Moving Least Squares, Adamson and Alexa
04
- Adaptive SPH Fluid Simulation
- Adams et al. 07
8Linear Moving Least Squares
- Implicit surface definition defined by set of
points
9Linear Moving Least Squares
- Implicit surface definition defined by set of
points
x
10Linear Moving Least Squares
10
ni
pi
x
11Linear Moving Least Squares
- Iterative projections onto plane
x
12Linear Moving Least Squares
- Iterative projections onto plane
x
x
13Linear Moving Least Squares
- Iterative projections onto plane
x
x
14Linear Moving Least Squares
- Iterative projections onto plane
x
x
15Linear Moving Least Squares
- Surface defined by points projecting onto
themselves
x
16Outline
- Related Work
- Spatial Searching and Caching
- Architecture Prototype
- Results
- Conclusion
17Spatial Search
- Spatial search kNN and eNN
- Common in most point operations
- Based on kd-tree
- Example eNN
18Spatial Search
- kNN search similar to eNN search
- Start with infinite radius
- Sort leaf points into priority queue
- Shrink radius with every point sorted
19Coherent Neighbor Cache(eNN)
- Find neighbors in slightly bigger radius
- Re-use result for spatially close query
20Coherent Neighbor Cache(kNN, exact)
- Find (k1) neighbors
- Re-use result for spatially close query
21Coherent Neighbor Cache(kNN, approximation)
- Approximation error e
- Enlarge radius
22Outline
- Related Work
- Spatial Searching and Caching
- Architecture Prototype
- Results
- Conclusion
23The Architecture
Host
24Coherent Neighbor Cache
0
0
0
1
1
1
n
n
n
- Eight cached neighborhoods
- Problem parallel queries in kd-tree module
- ? Interleave spatially similar queries
25Kd-Tree Traversal
26NodeRecurse
- Kd-tree structure on chip
- 16 threads
- Pipelining and multi-threading
27Stacks
- 16 stacks
- Parallel read/write
- Bounded in depth
- 6 bytes per thread per recursion
28Leaf
- 16 parallel priority queues (1-cycle ops)
- Queues store pointers and distances
- Bandwidth bottleneck
-
29Processing Module
- Multithreaded quad-port bank of 16 registers
- 128 threads
- Programmability using FPGA-technology
30Further Data
- Implemented on two FPGAs
- 64 bit DDR DRAM
- Interconnection no overhead
- Resource usage regs and LUTs
- Virtex 2 Pro 100 (kNN) 26 registers, 38 LUTs
- Virtex 2 Pro 70 (MLS)47 registers, 52 LUTs
- Clock frequency 75 MHz
-
31Outline
- Related Work
- Spatial Searching and Caching
- Architecture Prototype
- Results
- Conclusion
32Applications
- Tested on various applications
- PCI interface of prototype slow
Weyrich et al. 04
Adams et al. 07
33Results kNN
75 MHz
2200 MHz
1200 MHz
CUDA x4
ASIC estimate, 500 MHz x6.6
Number of queries
CUDA w/o sort x4.0
CPU x1.5
CUDA x2.4
CUDA w/o sort x3.1
CPU x1.4
CUDA x1.6
FPGA x1
CPU x1.1
FPGA x1
FPGA x1
Number of Neighbors
34Results kNN
75 MHz
- Small hardware footprint
- FPGA slightly slower
- Realistic clock frequency
- ? Prototype faster than CPU/GPU
2200 MHz
1200 MHz
CUDA x4
ASIC estimate, 500 MHz x6.6
Number of queries
CUDA w/o sort x4.0
CPU x1.5
CUDA x2.4
CUDA w/o sort x3.1
CPU x1.4
CUDA x1.6
FPGA x1
CPU x1.1
FPGA x1
FPGA x1
Number of Neighbors
35Results MLS
75 MHz
2200 MHz
1200 MHz
Number of queries
MLS CUDA x3.8
FPGA x1
MLS CPU x0.4
Number of Neighbors
36Coherent Neighbor Cache
CPU, e0.1
Number of queries
FPGA, e0.1
FPGA, exact
Level of coherence
37Results Approximation Error (MLS projection)
MLS Error
e approximation
no approx.
38Results Approximation Error (MLS projection)
Cache hits
Cache Hits
e approximation
39Approximation Error (visual)
40Approximation Error (visual)
- Coherent Neighbor Cache
- Not optimal for exact queries
- Approximate queries
- Can be tolerated in most cases
- Greatly increases performance
- Even for small approximations
41Outline
- Related Work
- Spatial Searching and Caching
- Architecture Prototype
- Results
- Conclusion
42Conclusion
- Novel hardware architecture for
- Nearest-neighbor searches
- Generic meshless processing operators
- Cache exploiting spatial coherence
- Good performance considering resources
- Possible GPU integration
43Future Work
- Programmable data structure
- Support different data structures
- Programmability in data structure
- Construction on-chip
- Real programmability in point processing module
44A Hardware Processing Unit For Point Sets
- S. Heinzle, G. Guennebaud,M. Botsch, M. Gross
- Graphics Hardware 2008