Title: Global Illumination on the GPU: Lessons Learned
1Global Illuminationon the GPULessons Learned
- John C. Hart
- University of Illinois
2What do NVidia U?
- Its all about porting and speed
- How to implement on the GPU, especially NVidias
- How to make it run as fast ( competitive) as
possible - Research papers leave out grodie details
- Wont be important five years from now
- But they are very important right now
- This is where we discuss grodie details
- Texture cache size, organization
- Tricks that probably wont work next year
3How to NVidia U.
- Ask lots of questions
- Chat between talks
- Stuff you can know and stuff you cant know
- NDA v. No _at_ Way
- Dont expect to change the hardware
- Dirty little secrets to getting code to run fast
- Send interns
- Computational Pantheism
- Dont be religious, use whatever works best now
- Windows, Linux, OpenGL, Direct3D
4Local Illumination
- GPU designed for efficient local illumination
computations to make video games more interesting - Bump mapping, BRDF
- Vertex shader processes per-vertex attributes
(normal, texcoords, color) - Displacement mapping, skinning
- Rasterization interpolates vertex attributes
across pixels - Projective, perspective correct
- Pixel shader computes colors from interpolated
values and texture lookups - texture shading, perspective texturing
N
V
L
N
N
5Modern GPU Org.
Geometry(vertex stream)
Rasterization
Vertex Shader
Setup
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Texture Memory
Pixel Shader
Tex 0
Tex 1
Tex 2
Frame Buffer
6Global Illumination
- Energy transportacross all paths fromsource to
eye - Ray tracing, radiosity, path tracing,
bidirectional ray tracing, irradiance caching,
photon mapping, subsurface scattering - Often broken into stages
- Precomputation (e.g. form factors, radiosity
solution) - Display (e.g. render colored patches)
- Storage/query format
- Ray ray tracing
- Point-to-point radiosity, subsurface scattering
- Pointsphere photon map, PRT
7Global Illumination on GPU
- GPU not designed for global illumination
- OpenGL (w/shadow maps, env. maps)can yield
LgDSgE path - Global illuminationapproximatedand precomputed
- Cass EverittsDueling Frusta
- Shadow mapresolution needsto be viewdependent
?
8Environment Map Problems
- Environment maps provide precomputed global
illumination lookup - Approximate
- Reflection from behind boat
- boat doesnt meet reflection
- Sampled
- aliasing where magnified
9Ray Tracing
- Uses GPU to intersect rays with triangles
- Turns Geometry Engine into a Ray Engine
- Carr et al., GH02
10GPU Ray-Tri Intersect
Rasterization(dist. D data across pixels)
Vertex Shader(prep)
Shared Quad-VertexAttributes D
vertex normal edge0 edge1 ID (color)
D Data
edge0
edge1
Pixel Shader(ray-D intersection)
Texture Memory
Ray Origins
Ray Directions
Z-buffer holds t-valuesFrame buffer holds
triangle ID
11What Doesnt Work
- Rays in vertex stream, triangles in texture
- Rays 5D, triangles 9D, more attributes than
textures - Ray can be held in two textures
- 10-bit RGB anchors texture
- 16-bit XY directions texture (bump map texture)
- Ray-triangle intersection in vertex shader
- Loses benefit of rasterization crossbar
- Need to store rays in constant registers
- Only 4.1M ray-triangle intersections per sec.
- In general vertex shader not much faster than CPU
- Vertex shader allows CPU to focus on other tasks
12Quick Dirty
- Implemented on ATIRadeon 8500 DX 8.1
- Pixel Shaders 1.4
- 16-bit fixed point
- 114M ray-triangle ints/s
- Much faster than bestsingle CPU time(20-40M on
800MHz P3, Wald et al. EGRW01) - Expect gap to widen further
- Problem Dont want to intersect all rays with
all tris
13Fast Ray Tracing
- Avoid all pairs intersection
- Need acceleration structure
- We used ray cache (Pharr et al. S97)
- Batches ray intersection queries
- Organizes queries into coherent ray bundles
- Triangle octree and 5D ray tree (Arvo Kirk S87)
- Problem How to implement?
14Ray Engine Organization
- We have a perfectly good CPU sitting around doing
nothing Put it to work! - Let the GPU do what it does best
- SIMD parallel execution
- streamed ray-triangle intersections
- Let the CPU to what it does best
- traverse/maintain data structures
- decide which rays triangles to intersect
- Ray Engine CPU-side
- Cache ray-triangle int. queries into coherent
buckets - When bucket large enough, send to GPU
15The Ray Engine
CPU
Application ( Ray Tracing, Path Tracing, Photon
Mapping, Radiosity Form Factors, )
Rays To Query
Intersection Results
Geometry
Front End ( batch/queue/sort coherent rays)
Ray Data In Textures
Triangle Data as Quad Attributes
Intersection Pixel Data
Ray Triangle Intersection Pixel Shader
GPU
The Ray Engine
16Analysis
- How small can the ray/tri buckets be?
- Overhead texture attr. setup, readback delay
- Determined best query size by experimentation
- Texture-strip ray buckets 4 texels high
- Takes advantage of 2-D spatial texture cache
- CPU handles small queries (using NV_FENCE)
- CPU traces between 10 and 33 of rays
17Comparison
- Stanford GPU Ray Tracer (Purcell et al., S02)
- State-based traversal, intersection, shading
- Each pixel is ray intersection process
- Four states traversal, intersection, shading,
spawning - Same state program run simultaneously on all
pixels - Result ignored when pixel was in different state
(90) - Implemented entirely on GPU, avoids readback!
- All geometry must fit in texture memory
- Could page geometry from host
- Limited to simple grid-based ray acceleration
- GPU spawns rays, can be complex (importance)
18Results
150K rays/s
207K rays/s
- Ray Engine GH02
- 200K rays/s for highlycoherent, small (2.5K)
scenes - 115K rays/s for large (34K)complicated scenes
- Wald et al. EGRW01
- P3 SSE 200K 1.5M rays/s
- CPU-SSE tightly coupled
- Purcell et al. S02
- 300K rays/s large (35K)
- up to 4M rays/s small (35)
115K rays/s
128K rays/s
131K rays/s
19! Readback
- Why is readback slow?
- Driver uses PCI readback, even for AGP cards!
- Only got 250MB/s (should get 1GB/s)
- Problem goes away if readback asynchronous
- Proposed in OpenGL 2.0
20(No Transcript)
21GPU Ray Tracing Lessons
- Ray Engine only doubles ray tracing speed
- Maintaining coherence expensive, 2-3x
intersection - Is coherence worth it?
- Need to factor readback rate into performance
- GPU(R,T) T R fill-1 R g readback-1
- g 4 ? ID flat shaded triangles
- g 16 ? barycentrics textured, shaded triangles
- Vertex shader CPU, pixel shader gt CPU
- SIGGRAPH values analysis over implementation
22Matrix Radiosity
- Given form-factor
- Energy balance of scene result of linear system
solution - MB E
- Matrix ? 2-D texture
- Vector ? 1-D texture
- Product ? Series of row vector dot products
accumulated into a 1-D texture
A
B
C
D
1
A1B2C3D4
E
F
G
H
2
E1F2G3H4
I
J
K
L
3
I1J2K3L4
M
N
O
P
4
M1N2O3P4
23Jacobi v. Gauss-Seidel
- Jacobi iteration
- Classical Bi(k1) Ei Sj?i Mij Bj(k)
- Decision free Bi(k1) E MB(k) B(k)
- Gauss-Seidel
- Needs decision Bi Ei Sj?i Mij Bj
- Converges 2x Jacobi
- GPU Gauss-Seidel
- n passes (Kruger Westermann S03)
- GPU Jacobi
- n/254 passes (unrolled)
Mii 1
24Radiosity Performance
- CPU Athlon 2800
- Gauss-Seidel
- 40 iter/s, 190M fp/s
- 100 mem. bw
- O(n2)
- GPU FX5900 Ultra
- Jacobi
- 30 iter/s, 141M fp/s
- 10 mem. bw
- O(n)
!
25Radiosity Lessons
- Matrix size limited
- Maximum texture size 4Kx4K, maximum p-buffer size
2Kx2K - Need paged block-based solutions, or sparse (Bolz
et al. S03) - 1-D texture vector non-optimal for texture
cache - 2x according to Kruger Westermann S03
- They pack vectors nicely into 2-D textures
- Also accelerates dot product, magnitude
operations - Gouraud interpolation not so easy!
- Need to interpolate 1-D texture across a 2-D mesh
- KW-S03s 2-D texture vector not appropriate for
radiosity - Matrix-matrix product caches better if done
blockwise - R Upper Left, G LL, B UR, A LR
- See UIUCDCS-R-2003-2328
26Subsurface Scattering
- Simulates scattering of light within a
homogeneous translucent material - Needed for all non-metallic surfaces
- Skin, milk, bread, stone
- Precompute scattering for real-time display
- CPU implementations
- Jensen et al. S02 used octree
- Hao et al. I3D03 approx. vert. backscatter
- Lensch et al. PG02 used atlas
- GPU implementation
- Carr et al. GH03, extends Lensch et al.
- Sloan et al. S03, incorporates SS into PRT
27Scattering v. Radiosity
- Diffuse subsurface scattering resembles a single
radiosity transport step (Lensch et al. PG02) - Scattering factor Fij based on BSSRDF Rd
- Precomputed and stored
- Hierarchically clustered to avoid O(n2) evaluation
28Multires Meshed Atlas
- Quads in atlas correspond to clusters in surface
mesh - Each cluster composed of four subclusters
- Allows MIP-mapping to provides multiresolution
mesh access - Allows subsurface scattering to operate at
multiple resolutions
29Algorithm
- Pass 1 Construct Radiosity Map
- Illuminate each patch by external light
- Scale each patch by 1 Fresnel
- Pass 2 Construct Irradiance Map
- Gather for each texel i all texels j
- Scaled by precomputed Fij
- Pass 3 Display result
- Scale irradiance map by 1 Fresnel
- Texture map onto surface and display
- Problem how to store all j terms foreach texel
i?
VertexShader
PixelShader
30Hierarchical Links
- Each texel i needs to representa link to all
other texels j - Instead link texel i to a cluster j
- Store Fij records at each texel
- Fij ? factor between texel iand cluster j
- Accuracy limited by of textures
- 16 links per texel ? 4 textures of4 components
of 16-bit floats - Per link needs 1 lookup for address and¼ lookup
(dependent) for factor
31Adaptive Links
- Construct dynamically
- based on magnitude of Fij
- Store u,v,LOD,Fij records at each texel
- u,v ? location of cluster j
- LOD ? MIP-map level of cluster j
- Fij ? factor between patch i and cluster j
- Needs 2 lookups (dependent) per link
32Results
70K faces 1K texture 13 fps
Direct 18
Scattered 68
Displayed 14
33Subsurface Lessons
- Adaptive looks bad when of links small
- Need to be very careful where to place links
- Need to increase of links
- Used vector quantization to compress link records
- Increased to 64 links
- Also allowed color
34Precomputed Radiance Xfer
- Radiance fns about p
- Source (env. map)
- Incident
- Exit
- Represented withspherical harmonics
- 25-vector of SH weights
- Radiance transfer
- Transfer matrix source-to-incident x BRDF
- Multiply source vector w/precomputed transfer
matrix at p to get exit radiance vector - Need to store 252 625 elements at each point p
35VQ PRT
- Vector quantization
- Create a codebook of typical transfer matrices
(LBG) - Pick random codebook matrices
- Cluster transfer matrices nearest to each
codebook matrix - Replace codebook matrix with its cluster center
- Repeat until done
- Store for each transfer matrix the index of its
nearest codebook matrix
36PCA PRT
- Principal Component Analysis
- Determine which few principal directions in 625-D
space have greatest transfer matrix variance - Store global origin transfer matrix and
principal direction axes transfer matrices - Store for each transfer matrix its approx.
coordinates along the axes
37CPCA
- VQPCA
- Creates VQ clusters, codebook
- Computes PCA on each VQ cluster
- Iterative VQPCA
- Computes PCA on each VQ cluster
- Reclusters based on approx. error
- Repeat until done
- Adaptive VQPCA
- Homogenize error
- Give some clusters more PCA axes
Bad
Good
38Vertex Shader Rendering
- Set blending mode to ADD
- For each cluster
- Load clusters PCA origin and axes
(multiplied by lighting) as constants - Render only faces w/a vertex whose transfer
matrix is in the current cluster - Color non-cluster vertices black
- The impact on runtime is not so nice
- When faces tween clusters show twice or thrice
- (sorry)
39Cluster Coherence
- Reclassification
- Move some vertices to slightly worse clusters
(10) if they improve coherence - Reduces mean overdraw from 2.0 to 1.8
- Superclustering
- Load several clusters constants into vertex
shader simultaneously (1axes) 25/4 - Greedily merge neighboring clusters into
superclusters - Limited by vertex shader constant store
- Reduced mean overdraw to 1.6
40Results
30Hz
60Hz (250Hz non-local viewer)
40Hz
41PRT Lessons
- 24-vectors as good as 25-vectors, and fit nicely
into 6 RGBA registers - Adaptive VQPCA needs data-dependent looping,
eludes GPU implementation - Render one pass per channel to take advantage of
alpha channel in registers - Allows each register to hold four data elements
instead of three (one per channel) - GPU needs to interpolate high-precision textures
- GeForceFX only interpolates 8-bit textures
42Global Illumination Lessons
- New algorithms data structures needed to port
global illumination efficiently to GPU - Whats good for CPU not necessarily good for GPU
- Focus has been on real time display of
precomputed global illumination - All cheats except for ray engine
- Need more GPU global illumination algorithms
- Like Purcell et al. GH03
- Leading to a GPUPACK
- Library of tuned GPU algorithms
43Thanks
- Who did all the work?
- Nate Carr
- Jesse Hall
- NSF ITR Award ACI-0113968
- NVidia
- Microsoft Research
- Peter-Pike Sloan
- John Snyder