Real-time%20Ray%20Tracing%20on%20GPU%20with%20BVH-based%20Packet%20Traversal - PowerPoint PPT Presentation

About This Presentation
Title:

Real-time%20Ray%20Tracing%20on%20GPU%20with%20BVH-based%20Packet%20Traversal

Description:

Real-time Ray Tracing on GPU with BVH-based Packet Traversal Stefan Popov, Johannes G nther, Hans-Peter Seidel, Philipp Slusallek – PowerPoint PPT presentation

Number of Views:123
Avg rating:3.0/5.0
Slides: 22
Provided by: Stefan210
Category:

less

Transcript and Presenter's Notes

Title: Real-time%20Ray%20Tracing%20on%20GPU%20with%20BVH-based%20Packet%20Traversal


1
Real-time Ray Tracing on GPU with BVH-based
Packet Traversal
  • Stefan Popov, Johannes Günther, Hans-Peter
    Seidel, Philipp Slusallek

2
Background
  • GPUs attractive for ray tracing
  • High computational power
  • Shading oriented architecture
  • GPU ray tracers
  • Carr the ray engine
  • Purcell Full ray tracing on the GPU, based on
    grids
  • Ernst KD trees with parallel stack
  • Carr, Thrane Simonsen BVH
  • Foley, Horn, Popov KD trees - stackless
    traversal

3
Motivation
  • So far
  • Interactive RT on GPU, but
  • Limited model size
  • No dynamic scene support
  • The G80 new approach to the GPU
  • High performance general purpose processor with
    graphics extensions
  • PRAM architecture
  • BVH allow for
  • Dynamic/deformable scenes
  • Small memory footprint
  • Goal Recursive ordered traversal of BVH on the
    G80

4
GPU Architecture (G80)
  • Multi-threaded scalar architecture
  • 12K HW threads
  • Threads cover latencies
  • Off-chip memory ops
  • Instruction dependencies
  • 4 or 16 cycles to issue instr.
  • 16 (multi-)cores
  • 8-wide SIMD
  • 128 scalar cores in total
  • Cores process threads in 32 wide SIMD chunks

5
GPU Architecture (G80)
  • Scalar register file (8K)
  • Partitioned among running threads
  • Shared memory (16KB)
  • On-chip, 0 cycle latency
  • On-board memory (768MB)
  • Large latency ( 200 cycles)
  • R/W from within thread
  • Un-cached
  • Read-only L2 cache (128KB)
  • On chip, shared among all threads

6
Programming the G80
  • CUDA
  • C based language with parallel extensions
  • GPU utilization at 100 only if
  • Enough threads are present (gtgt 12K)
  • Every thread uses less than 10 registers and 5
    words (32 bit) of shared memory
  • Enough computations per transferred word of data
  • Bandwidth ltlt computational power
  • Adequate memory access pattern to allow read
    combining

7
Performance Bottlenecks
  • Efficient per-thread stack implementation
  • Shared memory too small will limit parallelism
  • On-board memory uncached
  • Need enough computations between stack ops
  • Efficient memory access pattern
  • Use texture caches
  • However, only few words of cache / thread
  • Read successive memory locations in successive
    threads of a chunk
  • Single roundtrip to memory (read combining)
  • Cover latency with enough computations

8
Ray Tracing on the G80
  • Map each ray to one thread
  • Enough threads to keep the GPU busy
  • Recursive ray tracing
  • Use per-thread stack stored on on-board memory
  • Efficient, since enough computations are present
  • But how to do the traversal ?
  • Skip pointers (Thrane) no ordered traversal
  • Geometric images (Carr) single mesh only
  • Shared stack traversal

9
SIMD Packet Traversal of BVH
  • Traverse a node with the whole packet
  • At an internal node
  • Intersect all rays with both children and
    determine traversal order
  • Push far child (if any) on a stack and descend to
    the near one with the packet
  • At a leaf
  • Intersect all rays with contained geometry
  • Pop next node to visit from the stack

10
PRAM Basics
  • The PRAM model
  • Implicitly synchronized processors (threads)
  • Shared memory between all processors
  • Basic PRAM operations
  • Parallel OR in O(1)
  • Parallel reduction in O(log N)

false
true
false
true
false
true
11
9
12
32


11
9
44
20

64
20
11
9
11
PRAM Packet Traversal of BVH
  • The G80 PRAM machine on chunk level
  • Map packet ? chunk, ray ? thread
  • Threads behave as in the single ray traversal
  • At leaf Intersect with geometry. Pop next node
    from stack
  • At node Decide which children to visit and in
    what order. Push far child
  • Difference
  • How rays choose which node to visit first
  • Might not be the one they want to

12
PRAM Packet Traversal of BVH
  • Choose child traversal order
  • PRAM OR to determine if all rays agree on
    visiting the same node first
  • The result is stored in shared memory
  • In case of divergence choose child with more ray
    candidates
  • Use PRAM SUM on /- 1 for each thread, -1 ? left
    node
  • Look at results sign
  • Guarantees synchronous traversal of BVH

13
PRAM Packet Traversal of BVH
  • Stack
  • Near far child the same for all threads gt
    store once
  • Keep stack in shared memory. Only few bits per
    thread!
  • Only Thread 0 does all stack ops.
  • Reading data
  • All threads work with the same node / triangle
  • Sequential threads bring in sequential words
  • Single load operation. Single round trip to
    memory
  • Implementable in CUDA

14
Results
Scene Tris FPS Primary 1K2 FPS Shading 1K2
Conference 282K 16 (19) 6.1
Conference (with ropes) 282K 16.7 6.7
Soda Hall 2.1M 13.6 (16.2) 5.7
Power Plant Outside 12.7M 6.4 2.9
Power Plant Furnace 12.7M 1.9
15
Analysis
  • Coherent branch decisions / memory access
  • Small footprint of the data structure
  • Can trace up to 12 million triangle models
  • Program becomes compute bound
  • Determined by over/under-clocking the core/memory
  • No frustums required
  • Good for secondary rays, bad for primary
  • Can use rasterization for primary rays
  • Implicit SIMD easy shader programming
  • Running on a GPU shading for free

16
Dynamic Scenes
  • Update parts / whole BVH and geometry on GPU
  • Use GPU for RT and CPU for BVH construction /
    refitting
  • Construct BVH using binning
  • Similar to Wald RT07 / Popov RT06
  • Bin all 3 dimensions using SIMD
  • Results in gt 10 better trees
  • Measured as SAH quality, not FPS
  • Speed loss is almost negligible

17
Results
Scene Tris Exact SAH Binning 1D Binning 1D Binning 3D Binning 3D
Speed Speed Quality Speed Quality
Conference 282K 0.8 s 0.15 s 92.5 0.2 s 99.4
Soda Hall 2.1M 8.78 s 1.28 s 103.5 1.59 s 101.6
Power Plant 12.7M 119 s 6.6 s 99.4 8.1 s 100.5
Boeing 348M 5605 s 572 s 94.8 667 s 98.1
18
Conclusions
  • New recursive PRAM BVH traversal algorithm
  • Very well suited for the new generation of GPUs
  • No additional pre-computed data required
  • First GPU ray tracer to handle large models
  • Previous implementations were limited to lt 300K
  • Can handle dynamic scenes
  • By using the CPU to update the geometry / BVH

19
Future Work
  • More features
  • Shaders, adaptive anti-aliasing,
  • Global illumination
  • Code optimizations
  • Current implementation uses too many registers

20
  • Thank you!

21
CUDA Hello World
  • __global__ void addArrays(int arr1, int arr2)
  • unsigned t threadIdx.x blockIdx.x
    blockDim.x
  • arr1t arr2t
  • int main()
  • int inArr1 malloc(4194304), inArr2
    malloc(4194304)
  • int ta1, ta2
  • cudaMalloc((void)ta1, 4194304)
    cudaMalloc((void)ta2, 4194304)
  • for(int i 0 i lt 4194304 i)
  • inArr1i rand() inArr2i rand()
  • cudaMemcpy(ta1, inArr1, 4194304,
    cudaMemcpyHostToDevice)
  • cudaMemcpy(ta2, inArr2, 4194304,
    cudaMemcpyHostToDevice)
  • addArraysltltltdim3(4194304 / 512, 1, 1),
    dim3(512, 1, 1)gtgtgt(ta1, ta2)
Write a Comment
User Comments (0)
About PowerShow.com