A Hardware Processing Unit For Point Sets - PowerPoint PPT Presentation

About This Presentation

Title:

A Hardware Processing Unit For Point Sets

Description:

Coherent Neighbor Cache (eNN) Find neighbors in slightly bigger radius ... Coherent Neighbor Cache (kNN, exact) Find (k 1) neighbors ... – PowerPoint PPT presentation

Number of Views:40

Avg rating:3.0/5.0

Slides: 45

Provided by: shei71

Learn more at: https://www.graphicshardware.org

Category:

more less

Transcript and Presenter's Notes

Title: A Hardware Processing Unit For Point Sets

1
A Hardware Processing Unit For Point Sets

S. Heinzle, G. Guennebaud,M. Botsch, M. Gross
Graphics Hardware 2008

2
Motivation

Point-based graphics established
Powerful algorithms
Representation
Processing
Manipulation
Rendering
Decomposition
Get neighborhood
Operate on neighbors

3
Motivation

GPUs not suited for getting neighborhood
SIMD
Incoherent branching
Dynamic data structures slow
Recursive calls not supported
CPUs
Small number of FPUs
Inflexible memory caches

Courtesy of NVIDIA
Courtesy of Intel
4
Contributions

Hardware architecture for point sets
Neighbor search module
Novel advanced caching mechanism
Reconfigurable processing module
Programmability using FPGA compiler
FPGA prototype and measurements
Small Lean
? Integration into multi-core CPU/GPU possible

5
Outline

Related Work
Spatial Searching and Caching
Architecture and Prototype
Results
Conclusion

6
Related Work

Kd-Tree
Bentley 75

kNN on GPUsMa and McCool 02
Kd-Tree Hardware Woop et al. 05 Woop et al. 06
Kd-Tree on GPUs Popov et al. 07
7
Related Work
Algebraic Moving Least Squares, Guennebaud and
Gross 07
Linear Moving Least Squares, Adamson and Alexa
04

Adaptive SPH Fluid Simulation
Adams et al. 07

8
Linear Moving Least Squares

Implicit surface definition defined by set of
points

9
Linear Moving Least Squares

Implicit surface definition defined by set of
points

x
10
Linear Moving Least Squares
10
ni
pi
x
11
Linear Moving Least Squares

Iterative projections onto plane

x
12
Linear Moving Least Squares

Iterative projections onto plane

x
x

13
Linear Moving Least Squares

Iterative projections onto plane

x
x

14
Linear Moving Least Squares

Iterative projections onto plane

x
x

15
Linear Moving Least Squares

Surface defined by points projecting onto
themselves

x
16
Outline

Related Work
Spatial Searching and Caching
Architecture Prototype
Results
Conclusion

17
Spatial Search

Spatial search kNN and eNN
Common in most point operations
Based on kd-tree
Example eNN

18
Spatial Search

kNN search similar to eNN search
Start with infinite radius
Sort leaf points into priority queue
Shrink radius with every point sorted

19
Coherent Neighbor Cache(eNN)

Find neighbors in slightly bigger radius
Re-use result for spatially close query

20
Coherent Neighbor Cache(kNN, exact)

Find (k1) neighbors
Re-use result for spatially close query

21
Coherent Neighbor Cache(kNN, approximation)

Approximation error e
Enlarge radius

22
Outline

Related Work
Spatial Searching and Caching
Architecture Prototype
Results
Conclusion

23
The Architecture
Host
24
Coherent Neighbor Cache
0
0
0
1
1
1
n
n
n

Eight cached neighborhoods
Problem parallel queries in kd-tree module
? Interleave spatially similar queries

25
Kd-Tree Traversal
26
NodeRecurse

Kd-tree structure on chip
16 threads
Pipelining and multi-threading

27
Stacks

16 stacks
Parallel read/write
Bounded in depth
6 bytes per thread per recursion

28
Leaf

16 parallel priority queues (1-cycle ops)
Queues store pointers and distances
Bandwidth bottleneck

29
Processing Module

Multithreaded quad-port bank of 16 registers
128 threads
Programmability using FPGA-technology

30
Further Data

Implemented on two FPGAs
64 bit DDR DRAM
Interconnection no overhead
Resource usage regs and LUTs
Virtex 2 Pro 100 (kNN) 26 registers, 38 LUTs
Virtex 2 Pro 70 (MLS)47 registers, 52 LUTs
Clock frequency 75 MHz

31
Outline

Related Work
Spatial Searching and Caching
Architecture Prototype
Results
Conclusion

32
Applications

Tested on various applications
PCI interface of prototype slow

Weyrich et al. 04
Adams et al. 07
33
Results kNN
75 MHz
2200 MHz
1200 MHz
CUDA x4
ASIC estimate, 500 MHz x6.6
Number of queries
CUDA w/o sort x4.0
CPU x1.5
CUDA x2.4
CUDA w/o sort x3.1
CPU x1.4
CUDA x1.6
FPGA x1
CPU x1.1
FPGA x1
FPGA x1
Number of Neighbors
34
Results kNN
75 MHz

Small hardware footprint
FPGA slightly slower
Realistic clock frequency
? Prototype faster than CPU/GPU

2200 MHz
1200 MHz
CUDA x4
ASIC estimate, 500 MHz x6.6
Number of queries
CUDA w/o sort x4.0
CPU x1.5
CUDA x2.4
CUDA w/o sort x3.1
CPU x1.4
CUDA x1.6
FPGA x1
CPU x1.1
FPGA x1
FPGA x1
Number of Neighbors
35
Results MLS
75 MHz
2200 MHz
1200 MHz
Number of queries
MLS CUDA x3.8

kNN bottleneck
FPGA
GPU

FPGA x1

FPGA faster than CPU

MLS CPU x0.4
Number of Neighbors
36
Coherent Neighbor Cache
CPU, e0.1
Number of queries
FPGA, e0.1
FPGA, exact
Level of coherence
37
Results Approximation Error (MLS projection)
MLS Error
e approximation
no approx.
38
Results Approximation Error (MLS projection)
Cache hits
Cache Hits
e approximation
39
Approximation Error (visual)
40
Approximation Error (visual)