Global Illumination on the GPU: Lessons Learned - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

Global Illumination on the GPU: Lessons Learned

Description:

It's all about porting and speed. How to implement on the GPU, especially NVidia's ... All cheats except for ray engine. Need more GPU global illumination algorithms ... – PowerPoint PPT presentation

Number of Views:92

Avg rating:3.0/5.0

Slides: 44

Provided by: Natha57

Category:

more less

Transcript and Presenter's Notes

Title: Global Illumination on the GPU: Lessons Learned

1
Global Illuminationon the GPULessons Learned

John C. Hart
University of Illinois

2
What do NVidia U?

Its all about porting and speed
How to implement on the GPU, especially NVidias
How to make it run as fast ( competitive) as
possible
Research papers leave out grodie details
Wont be important five years from now
But they are very important right now
This is where we discuss grodie details
Texture cache size, organization
Tricks that probably wont work next year

3
How to NVidia U.

Ask lots of questions
Chat between talks
Stuff you can know and stuff you cant know
NDA v. No _at_ Way
Dont expect to change the hardware
Dirty little secrets to getting code to run fast
Send interns
Computational Pantheism
Dont be religious, use whatever works best now
Windows, Linux, OpenGL, Direct3D

4
Local Illumination

GPU designed for efficient local illumination
computations to make video games more interesting
Bump mapping, BRDF
Vertex shader processes per-vertex attributes
(normal, texcoords, color)
Displacement mapping, skinning
Rasterization interpolates vertex attributes
across pixels
Projective, perspective correct
Pixel shader computes colors from interpolated
values and texture lookups
texture shading, perspective texturing

N
V
L
N
N
5
Modern GPU Org.
Geometry(vertex stream)
Rasterization
Vertex Shader
Setup
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Texture Memory
Pixel Shader
Tex 0
Tex 1
Tex 2
Frame Buffer
6
Global Illumination

Energy transportacross all paths fromsource to
eye
Ray tracing, radiosity, path tracing,
bidirectional ray tracing, irradiance caching,
photon mapping, subsurface scattering
Often broken into stages
Precomputation (e.g. form factors, radiosity
solution)
Display (e.g. render colored patches)
Storage/query format
Ray ray tracing
Point-to-point radiosity, subsurface scattering
Pointsphere photon map, PRT

7
Global Illumination on GPU

GPU not designed for global illumination
OpenGL (w/shadow maps, env. maps)can yield
LgDSgE path
Global illuminationapproximatedand precomputed
Cass EverittsDueling Frusta
Shadow mapresolution needsto be viewdependent

?
8
Environment Map Problems

Environment maps provide precomputed global
illumination lookup
Approximate
Reflection from behind boat
boat doesnt meet reflection
Sampled
aliasing where magnified

9
Ray Tracing

Uses GPU to intersect rays with triangles
Turns Geometry Engine into a Ray Engine
Carr et al., GH02

10
GPU Ray-Tri Intersect
Rasterization(dist. D data across pixels)
Vertex Shader(prep)
Shared Quad-VertexAttributes D
vertex normal edge0 edge1 ID (color)
D Data
edge0
edge1
Pixel Shader(ray-D intersection)
Texture Memory
Ray Origins
Ray Directions
Z-buffer holds t-valuesFrame buffer holds
triangle ID
11
What Doesnt Work

Rays in vertex stream, triangles in texture
Rays 5D, triangles 9D, more attributes than
textures
Ray can be held in two textures
10-bit RGB anchors texture
16-bit XY directions texture (bump map texture)
Ray-triangle intersection in vertex shader
Loses benefit of rasterization crossbar
Need to store rays in constant registers
Only 4.1M ray-triangle intersections per sec.
In general vertex shader not much faster than CPU
Vertex shader allows CPU to focus on other tasks

12
Quick Dirty

Implemented on ATIRadeon 8500 DX 8.1
Pixel Shaders 1.4
16-bit fixed point
114M ray-triangle ints/s
Much faster than bestsingle CPU time(20-40M on
800MHz P3, Wald et al. EGRW01)
Expect gap to widen further
Problem Dont want to intersect all rays with
all tris

13
Fast Ray Tracing

Avoid all pairs intersection
Need acceleration structure
We used ray cache (Pharr et al. S97)
Batches ray intersection queries
Organizes queries into coherent ray bundles
Triangle octree and 5D ray tree (Arvo Kirk S87)
Problem How to implement?

14
Ray Engine Organization

We have a perfectly good CPU sitting around doing
nothing Put it to work!
Let the GPU do what it does best
SIMD parallel execution
streamed ray-triangle intersections
Let the CPU to what it does best
traverse/maintain data structures
decide which rays triangles to intersect
Ray Engine CPU-side
Cache ray-triangle int. queries into coherent
buckets
When bucket large enough, send to GPU

15
The Ray Engine
CPU
Application ( Ray Tracing, Path Tracing, Photon
Mapping, Radiosity Form Factors, )
Rays To Query
Intersection Results
Geometry
Front End ( batch/queue/sort coherent rays)
Ray Data In Textures
Triangle Data as Quad Attributes
Intersection Pixel Data
Ray Triangle Intersection Pixel Shader
GPU
The Ray Engine
16
Analysis

How small can the ray/tri buckets be?
Overhead texture attr. setup, readback delay
Determined best query size by experimentation
Texture-strip ray buckets 4 texels high
Takes advantage of 2-D spatial texture cache
CPU handles small queries (using NV_FENCE)
CPU traces between 10 and 33 of rays

17
Comparison

Stanford GPU Ray Tracer (Purcell et al., S02)
State-based traversal, intersection, shading
Each pixel is ray intersection process
Four states traversal, intersection, shading,
spawning
Same state program run simultaneously on all
pixels
Result ignored when pixel was in different state
(90)
Implemented entirely on GPU, avoids readback!
All geometry must fit in texture memory
Could page geometry from host
Limited to simple grid-based ray acceleration
GPU spawns rays, can be complex (importance)

18
Results
150K rays/s
207K rays/s

Ray Engine GH02
200K rays/s for highlycoherent, small (2.5K)
scenes
115K rays/s for large (34K)complicated scenes
Wald et al. EGRW01
P3 SSE 200K 1.5M rays/s
CPU-SSE tightly coupled
Purcell et al. S02
300K rays/s large (35K)
up to 4M rays/s small (35)

115K rays/s
128K rays/s
131K rays/s
19
! Readback

Why is readback slow?
Driver uses PCI readback, even for AGP cards!
Only got 250MB/s (should get 1GB/s)
Problem goes away if readback asynchronous
Proposed in OpenGL 2.0

20
(No Transcript)
21
GPU Ray Tracing Lessons

Ray Engine only doubles ray tracing speed
Maintaining coherence expensive, 2-3x
intersection
Is coherence worth it?
Need to factor readback rate into performance
GPU(R,T) T R fill-1 R g readback-1
g 4 ? ID flat shaded triangles
g 16 ? barycentrics textured, shaded triangles
Vertex shader CPU, pixel shader gt CPU
SIGGRAPH values analysis over implementation

22
Matrix Radiosity

Given form-factor
Energy balance of scene result of linear system
solution
MB E
Matrix ? 2-D texture
Vector ? 1-D texture
Product ? Series of row vector dot products
accumulated into a 1-D texture

A
B
C
D
1
A1B2C3D4
E
F
G
H
2
E1F2G3H4

I
J
K
L
3
I1J2K3L4
M
N
O
P
4
M1N2O3P4
23
Jacobi v. Gauss-Seidel

Jacobi iteration
Classical Bi(k1) Ei Sj?i Mij Bj(k)
Decision free Bi(k1) E MB(k) B(k)
Gauss-Seidel
Needs decision Bi Ei Sj?i Mij Bj
Converges 2x Jacobi
GPU Gauss-Seidel
n passes (Kruger Westermann S03)
GPU Jacobi
n/254 passes (unrolled)

Mii 1
24
Radiosity Performance

CPU Athlon 2800
Gauss-Seidel
40 iter/s, 190M fp/s
100 mem. bw
O(n2)
GPU FX5900 Ultra
Jacobi
30 iter/s, 141M fp/s
10 mem. bw
O(n)

!
25
Radiosity Lessons

Matrix size limited
Maximum texture size 4Kx4K, maximum p-buffer size
2Kx2K
Need paged block-based solutions, or sparse (Bolz
et al. S03)
1-D texture vector non-optimal for texture
cache
2x according to Kruger Westermann S03
They pack vectors nicely into 2-D textures
Also accelerates dot product, magnitude
operations
Gouraud interpolation not so easy!
Need to interpolate 1-D texture across a 2-D mesh
KW-S03s 2-D texture vector not appropriate for
radiosity
Matrix-matrix product caches better if done
blockwise
R Upper Left, G LL, B UR, A LR
See UIUCDCS-R-2003-2328

26
Subsurface Scattering

Simulates scattering of light within a
homogeneous translucent material
Needed for all non-metallic surfaces
Skin, milk, bread, stone
Precompute scattering for real-time display
CPU implementations
Jensen et al. S02 used octree
Hao et al. I3D03 approx. vert. backscatter
Lensch et al. PG02 used atlas
GPU implementation
Carr et al. GH03, extends Lensch et al.
Sloan et al. S03, incorporates SS into PRT

27
Scattering v. Radiosity

Diffuse subsurface scattering resembles a single
radiosity transport step (Lensch et al. PG02)
Scattering factor Fij based on BSSRDF Rd
Precomputed and stored
Hierarchically clustered to avoid O(n2) evaluation

28
Multires Meshed Atlas

Quads in atlas correspond to clusters in surface
mesh
Each cluster composed of four subclusters
Allows MIP-mapping to provides multiresolution
mesh access
Allows subsurface scattering to operate at
multiple resolutions

29
Algorithm

Pass 1 Construct Radiosity Map
Illuminate each patch by external light
Scale each patch by 1 Fresnel
Pass 2 Construct Irradiance Map
Gather for each texel i all texels j
Scaled by precomputed Fij
Pass 3 Display result
Scale irradiance map by 1 Fresnel
Texture map onto surface and display
Problem how to store all j terms foreach texel
i?

VertexShader
PixelShader
30
Hierarchical Links

Each texel i needs to representa link to all
other texels j
Instead link texel i to a cluster j
Store Fij records at each texel
Fij ? factor between texel iand cluster j
Accuracy limited by of textures
16 links per texel ? 4 textures of4 components
of 16-bit floats
Per link needs 1 lookup for address and¼ lookup
(dependent) for factor

31
Adaptive Links

Construct dynamically
based on magnitude of Fij
Store u,v,LOD,Fij records at each texel
u,v ? location of cluster j
LOD ? MIP-map level of cluster j
Fij ? factor between patch i and cluster j
Needs 2 lookups (dependent) per link

32
Results
70K faces 1K texture 13 fps
Direct 18
Scattered 68
Displayed 14
33
Subsurface Lessons

Adaptive looks bad when of links small
Need to be very careful where to place links
Need to increase of links
Used vector quantization to compress link records
Increased to 64 links
Also allowed color

34
Precomputed Radiance Xfer

Radiance fns about p
Source (env. map)
Incident
Exit
Represented withspherical harmonics
25-vector of SH weights
Radiance transfer
Transfer matrix source-to-incident x BRDF
Multiply source vector w/precomputed transfer
matrix at p to get exit radiance vector
Need to store 252 625 elements at each point p

35
VQ PRT

Vector quantization
Create a codebook of typical transfer matrices
(LBG)
Pick random codebook matrices
Cluster transfer matrices nearest to each
codebook matrix
Replace codebook matrix with its cluster center
Repeat until done
Store for each transfer matrix the index of its
nearest codebook matrix

36
PCA PRT

Principal Component Analysis
Determine which few principal directions in 625-D
space have greatest transfer matrix variance
Store global origin transfer matrix and
principal direction axes transfer matrices
Store for each transfer matrix its approx.
coordinates along the axes

37
CPCA

VQPCA
Creates VQ clusters, codebook
Computes PCA on each VQ cluster
Iterative VQPCA
Computes PCA on each VQ cluster
Reclusters based on approx. error
Repeat until done
Adaptive VQPCA
Homogenize error
Give some clusters more PCA axes

Bad
Good
38
Vertex Shader Rendering

Set blending mode to ADD
For each cluster
Load clusters PCA origin and axes
(multiplied by lighting) as constants
Render only faces w/a vertex whose transfer
matrix is in the current cluster
Color non-cluster vertices black
The impact on runtime is not so nice
When faces tween clusters show twice or thrice
(sorry)

39
Cluster Coherence

Reclassification
Move some vertices to slightly worse clusters
(10) if they improve coherence
Reduces mean overdraw from 2.0 to 1.8
Superclustering
Load several clusters constants into vertex
shader simultaneously (1axes) 25/4
Greedily merge neighboring clusters into
superclusters
Limited by vertex shader constant store
Reduced mean overdraw to 1.6

40
Results

Look what we can do in

30Hz
60Hz (250Hz non-local viewer)
40Hz
41
PRT Lessons

24-vectors as good as 25-vectors, and fit nicely
into 6 RGBA registers
Adaptive VQPCA needs data-dependent looping,
eludes GPU implementation
Render one pass per channel to take advantage of
alpha channel in registers
Allows each register to hold four data elements
instead of three (one per channel)
GPU needs to interpolate high-precision textures
GeForceFX only interpolates 8-bit textures

42
Global Illumination Lessons

New algorithms data structures needed to port
global illumination efficiently to GPU
Whats good for CPU not necessarily good for GPU
Focus has been on real time display of
precomputed global illumination
All cheats except for ray engine
Need more GPU global illumination algorithms
Like Purcell et al. GH03
Leading to a GPUPACK
Library of tuned GPU algorithms

43
Thanks