Title: Real-time%20Ray%20Tracing%20on%20GPU%20with%20BVH-based%20Packet%20Traversal
1Real-time Ray Tracing on GPU with BVH-based
Packet Traversal
- Stefan Popov, Johannes Günther, Hans-Peter
Seidel, Philipp Slusallek
2Background
- GPUs attractive for ray tracing
- High computational power
- Shading oriented architecture
- GPU ray tracers
- Carr the ray engine
- Purcell Full ray tracing on the GPU, based on
grids - Ernst KD trees with parallel stack
- Carr, Thrane Simonsen BVH
- Foley, Horn, Popov KD trees - stackless
traversal
3Motivation
- So far
- Interactive RT on GPU, but
- Limited model size
- No dynamic scene support
- The G80 new approach to the GPU
- High performance general purpose processor with
graphics extensions - PRAM architecture
- BVH allow for
- Dynamic/deformable scenes
- Small memory footprint
- Goal Recursive ordered traversal of BVH on the
G80
4GPU Architecture (G80)
- Multi-threaded scalar architecture
- 12K HW threads
- Threads cover latencies
- Off-chip memory ops
- Instruction dependencies
- 4 or 16 cycles to issue instr.
- 16 (multi-)cores
- 8-wide SIMD
- 128 scalar cores in total
- Cores process threads in 32 wide SIMD chunks
5GPU Architecture (G80)
- Scalar register file (8K)
- Partitioned among running threads
- Shared memory (16KB)
- On-chip, 0 cycle latency
- On-board memory (768MB)
- Large latency ( 200 cycles)
- R/W from within thread
- Un-cached
- Read-only L2 cache (128KB)
- On chip, shared among all threads
6Programming the G80
- CUDA
- C based language with parallel extensions
- GPU utilization at 100 only if
- Enough threads are present (gtgt 12K)
- Every thread uses less than 10 registers and 5
words (32 bit) of shared memory - Enough computations per transferred word of data
- Bandwidth ltlt computational power
- Adequate memory access pattern to allow read
combining
7Performance Bottlenecks
- Efficient per-thread stack implementation
- Shared memory too small will limit parallelism
- On-board memory uncached
- Need enough computations between stack ops
- Efficient memory access pattern
- Use texture caches
- However, only few words of cache / thread
- Read successive memory locations in successive
threads of a chunk - Single roundtrip to memory (read combining)
- Cover latency with enough computations
8Ray Tracing on the G80
- Map each ray to one thread
- Enough threads to keep the GPU busy
- Recursive ray tracing
- Use per-thread stack stored on on-board memory
- Efficient, since enough computations are present
- But how to do the traversal ?
- Skip pointers (Thrane) no ordered traversal
- Geometric images (Carr) single mesh only
- Shared stack traversal
9SIMD Packet Traversal of BVH
- Traverse a node with the whole packet
- At an internal node
- Intersect all rays with both children and
determine traversal order - Push far child (if any) on a stack and descend to
the near one with the packet - At a leaf
- Intersect all rays with contained geometry
- Pop next node to visit from the stack
10PRAM Basics
- The PRAM model
- Implicitly synchronized processors (threads)
- Shared memory between all processors
- Basic PRAM operations
- Parallel OR in O(1)
- Parallel reduction in O(log N)
false
true
false
true
false
true
11
9
12
32
11
9
44
20
64
20
11
9
11PRAM Packet Traversal of BVH
- The G80 PRAM machine on chunk level
- Map packet ? chunk, ray ? thread
- Threads behave as in the single ray traversal
- At leaf Intersect with geometry. Pop next node
from stack - At node Decide which children to visit and in
what order. Push far child - Difference
- How rays choose which node to visit first
- Might not be the one they want to
12PRAM Packet Traversal of BVH
- Choose child traversal order
- PRAM OR to determine if all rays agree on
visiting the same node first - The result is stored in shared memory
- In case of divergence choose child with more ray
candidates - Use PRAM SUM on /- 1 for each thread, -1 ? left
node - Look at results sign
- Guarantees synchronous traversal of BVH
13PRAM Packet Traversal of BVH
- Stack
- Near far child the same for all threads gt
store once - Keep stack in shared memory. Only few bits per
thread! - Only Thread 0 does all stack ops.
- Reading data
- All threads work with the same node / triangle
- Sequential threads bring in sequential words
- Single load operation. Single round trip to
memory - Implementable in CUDA
14Results
Scene Tris FPS Primary 1K2 FPS Shading 1K2
Conference 282K 16 (19) 6.1
Conference (with ropes) 282K 16.7 6.7
Soda Hall 2.1M 13.6 (16.2) 5.7
Power Plant Outside 12.7M 6.4 2.9
Power Plant Furnace 12.7M 1.9
15Analysis
- Coherent branch decisions / memory access
- Small footprint of the data structure
- Can trace up to 12 million triangle models
- Program becomes compute bound
- Determined by over/under-clocking the core/memory
- No frustums required
- Good for secondary rays, bad for primary
- Can use rasterization for primary rays
- Implicit SIMD easy shader programming
- Running on a GPU shading for free
16Dynamic Scenes
- Update parts / whole BVH and geometry on GPU
- Use GPU for RT and CPU for BVH construction /
refitting - Construct BVH using binning
- Similar to Wald RT07 / Popov RT06
- Bin all 3 dimensions using SIMD
- Results in gt 10 better trees
- Measured as SAH quality, not FPS
- Speed loss is almost negligible
17Results
Scene Tris Exact SAH Binning 1D Binning 1D Binning 3D Binning 3D
Speed Speed Quality Speed Quality
Conference 282K 0.8 s 0.15 s 92.5 0.2 s 99.4
Soda Hall 2.1M 8.78 s 1.28 s 103.5 1.59 s 101.6
Power Plant 12.7M 119 s 6.6 s 99.4 8.1 s 100.5
Boeing 348M 5605 s 572 s 94.8 667 s 98.1
18Conclusions
- New recursive PRAM BVH traversal algorithm
- Very well suited for the new generation of GPUs
- No additional pre-computed data required
- First GPU ray tracer to handle large models
- Previous implementations were limited to lt 300K
- Can handle dynamic scenes
- By using the CPU to update the geometry / BVH
19Future Work
- More features
- Shaders, adaptive anti-aliasing,
- Global illumination
- Code optimizations
- Current implementation uses too many registers
20 21CUDA Hello World
- __global__ void addArrays(int arr1, int arr2)
-
- unsigned t threadIdx.x blockIdx.x
blockDim.x - arr1t arr2t
-
- int main()
-
- int inArr1 malloc(4194304), inArr2
malloc(4194304) - int ta1, ta2
- cudaMalloc((void)ta1, 4194304)
cudaMalloc((void)ta2, 4194304) - for(int i 0 i lt 4194304 i)
- inArr1i rand() inArr2i rand()
-
- cudaMemcpy(ta1, inArr1, 4194304,
cudaMemcpyHostToDevice) - cudaMemcpy(ta2, inArr2, 4194304,
cudaMemcpyHostToDevice) -
- addArraysltltltdim3(4194304 / 512, 1, 1),
dim3(512, 1, 1)gtgtgt(ta1, ta2)