Real-time%20Ray%20Tracing%20on%20GPU%20with%20BVH-based%20Packet%20Traversal - PowerPoint PPT Presentation

About This Presentation

Title:

Real-time%20Ray%20Tracing%20on%20GPU%20with%20BVH-based%20Packet%20Traversal

Description:

Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes G nther, Hans-Peter Seidel, Philipp Slusallek – PowerPoint PPT presentation

Number of Views:123

Avg rating:3.0/5.0

Slides: 22

Provided by: Stefan210

Category:

more less

Transcript and Presenter's Notes

Title: Real-time%20Ray%20Tracing%20on%20GPU%20with%20BVH-based%20Packet%20Traversal

1
Real-time Ray Tracing on GPU with BVH-based
Packet Traversal

Stefan Popov, Johannes Günther, Hans-Peter
Seidel, Philipp Slusallek

2
Background

GPUs attractive for ray tracing
High computational power
Shading oriented architecture
GPU ray tracers
Carr the ray engine
Purcell Full ray tracing on the GPU, based on
grids
Ernst KD trees with parallel stack
Carr, Thrane Simonsen BVH
Foley, Horn, Popov KD trees - stackless
traversal

3
Motivation

So far
Interactive RT on GPU, but
Limited model size
No dynamic scene support
The G80 new approach to the GPU
High performance general purpose processor with
graphics extensions
PRAM architecture
BVH allow for
Dynamic/deformable scenes
Small memory footprint
Goal Recursive ordered traversal of BVH on the
G80

4
GPU Architecture (G80)

Multi-threaded scalar architecture
12K HW threads
Threads cover latencies
Off-chip memory ops
Instruction dependencies
4 or 16 cycles to issue instr.
16 (multi-)cores
8-wide SIMD
128 scalar cores in total
Cores process threads in 32 wide SIMD chunks

5
GPU Architecture (G80)

Scalar register file (8K)
Partitioned among running threads
Shared memory (16KB)
On-chip, 0 cycle latency
On-board memory (768MB)
Large latency ( 200 cycles)
R/W from within thread
Un-cached
Read-only L2 cache (128KB)
On chip, shared among all threads

6
Programming the G80

CUDA
C based language with parallel extensions
GPU utilization at 100 only if
Enough threads are present (gtgt 12K)
Every thread uses less than 10 registers and 5
words (32 bit) of shared memory
Enough computations per transferred word of data
Bandwidth ltlt computational power
Adequate memory access pattern to allow read
combining

7
Performance Bottlenecks

Efficient per-thread stack implementation
Shared memory too small will limit parallelism
On-board memory uncached
Need enough computations between stack ops
Efficient memory access pattern
Use texture caches
However, only few words of cache / thread
Read successive memory locations in successive
threads of a chunk
Single roundtrip to memory (read combining)
Cover latency with enough computations

8
Ray Tracing on the G80

Map each ray to one thread
Enough threads to keep the GPU busy
Recursive ray tracing
Use per-thread stack stored on on-board memory
Efficient, since enough computations are present
But how to do the traversal ?
Skip pointers (Thrane) no ordered traversal
Geometric images (Carr) single mesh only
Shared stack traversal

9
SIMD Packet Traversal of BVH

Traverse a node with the whole packet
At an internal node
Intersect all rays with both children and
determine traversal order
Push far child (if any) on a stack and descend to
the near one with the packet
At a leaf
Intersect all rays with contained geometry
Pop next node to visit from the stack

10
PRAM Basics

The PRAM model
Implicitly synchronized processors (threads)
Shared memory between all processors
Basic PRAM operations
Parallel OR in O(1)
Parallel reduction in O(log N)

false
true
false
true
false
true
11
9
12
32

11
9
44
20

64
20
11
9
11
PRAM Packet Traversal of BVH

The G80 PRAM machine on chunk level
Map packet ? chunk, ray ? thread
Threads behave as in the single ray traversal
At leaf Intersect with geometry. Pop next node
from stack
At node Decide which children to visit and in
what order. Push far child
Difference
How rays choose which node to visit first
Might not be the one they want to

12
PRAM Packet Traversal of BVH

Choose child traversal order
PRAM OR to determine if all rays agree on
visiting the same node first
The result is stored in shared memory
In case of divergence choose child with more ray
candidates
Use PRAM SUM on /- 1 for each thread, -1 ? left
node
Look at results sign
Guarantees synchronous traversal of BVH

13
PRAM Packet Traversal of BVH

Stack
Near far child the same for all threads gt
store once
Keep stack in shared memory. Only few bits per
thread!
Only Thread 0 does all stack ops.
Reading data
All threads work with the same node / triangle
Sequential threads bring in sequential words
Single load operation. Single round trip to
memory
Implementable in CUDA

14
Results
Scene Tris FPS Primary 1K2 FPS Shading 1K2
Conference 282K 16 (19) 6.1
Conference (with ropes) 282K 16.7 6.7
Soda Hall 2.1M 13.6 (16.2) 5.7
Power Plant Outside 12.7M 6.4 2.9
Power Plant Furnace 12.7M 1.9
15
Analysis

Coherent branch decisions / memory access
Small footprint of the data structure
Can trace up to 12 million triangle models
Program becomes compute bound
Determined by over/under-clocking the core/memory
No frustums required
Good for secondary rays, bad for primary
Can use rasterization for primary rays
Implicit SIMD easy shader programming
Running on a GPU shading for free

16
Dynamic Scenes

Update parts / whole BVH and geometry on GPU
Use GPU for RT and CPU for BVH construction /
refitting
Construct BVH using binning
Similar to Wald RT07 / Popov RT06
Bin all 3 dimensions using SIMD
Results in gt 10 better trees
Measured as SAH quality, not FPS
Speed loss is almost negligible

17
Results
Scene Tris Exact SAH Binning 1D Binning 1D Binning 3D Binning 3D
Speed Speed Quality Speed Quality
Conference 282K 0.8 s 0.15 s 92.5 0.2 s 99.4
Soda Hall 2.1M 8.78 s 1.28 s 103.5 1.59 s 101.6
Power Plant 12.7M 119 s 6.6 s 99.4 8.1 s 100.5
Boeing 348M 5605 s 572 s 94.8 667 s 98.1
18
Conclusions