Title: GPU Program Optimization
1GPU Program Optimization
- Cliff Woolley
- University of Virginia / NVIDIA
2Overview
- Data Parallel Computing
- Computational Frequency
- Profiling and Load Balancing
3Data Parallel Computing
4Data Parallel Computing
- Instruction-Level Parallelism
- Data-Level Parallelism
5A really naïve shader
frag2frame Smooth(vert2frag IN, uniform
samplerRECT Source texunit0, uniform
samplerRECT Operator texunit1,
uniform samplerRECT Boundary texunit2, uniform
float4 params) frag2frame OUT float2
center IN.TexCoord0.xy float4 U
f4texRECT(Source, center) // Calculate
Red-Black (odd-even) masks float2 intpart
float2 place floor(1.0f - modf(round(center
float2(0.5f, 0.5f)) / 2.0f, intpart)) float2
mask float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x mask.y)
params.y) (!(mask.x mask.y)
!params.y)) float2 offset
float2(params.xcenter.x - 0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f))
... float4 neighbor float4(center.x -
1.0f, center.x 1.0f, center.y - 1.0f, center.y
1.0f) float central -2.0f(O.x
O.y) float poisson
((params.xparams.x)U.z (-O.x
f1texRECT(Source, float2(neighbor.x, center.y))
-O.x f1texRECT(Source, float2(neighbor.y,
center.y))
-O.y f1texRECT(Source,
float2(center.x, neighbor.z))
-O.z
f1texRECT(Source, float2(center.x, neighbor.w))))
/ O.w OUT.COL.x poisson
... return OUT
6A really naïve shader
frag2frame Smooth(vert2frag IN, uniform
samplerRECT Source texunit0, uniform
samplerRECT Operator texunit1,
uniform samplerRECT Boundary texunit2, uniform
float4 params) frag2frame OUT float2
center IN.TexCoord0.xy float4 U
f4texRECT(Source, center) // Calculate
Red-Black (odd-even) masks float2 intpart
float2 place floor(1.0f - modf(round(center
float2(0.5f, 0.5f)) / 2.0f, intpart)) float2
mask float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x mask.y)
params.y) (!(mask.x mask.y)
!params.y)) float2 offset
float2(params.xcenter.x - 0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f))
... float4 neighbor float4(center.x -
1.0f, center.x 1.0f, center.y - 1.0f, center.y
1.0f) float central -2.0f(O.x
O.y) float poisson
((params.xparams.x)U.z (-O.x
f1texRECT(Source, float2(neighbor.x, center.y))
-O.x f1texRECT(Source, float2(neighbor.y,
center.y))
-O.y f1texRECT(Source,
float2(center.x, neighbor.z))
-O.z
f1texRECT(Source, float2(center.x, neighbor.w))))
/ O.w OUT.COL.x poisson
... return OUT
7Instruction-Level Parallelism
float2 offset float2(params.xcenter.x -
0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f)) float4
neighbor float4(center.x - 1.0f,
center.x 1.0f,
center.y - 1.0f,
center.y 1.0f)
8Instruction-Level Parallelism
float2 offset center.xy - 0.5f offset offset
params.xx 0.5f // MADR is cool too one
// cycle, two
flops float4 neighbor center.xxyy
float4(-1.0f,1.0f,-1.0f,1.0f)
9Data-Level Parallelism
- Pack scalar data into RGBA in texture memory
10Computational Frequency
11Computational Frequency
- Think of your CPU program and your vertex and
fragment programs as different levels of nested
looping.
...foreach tri in triangles // run the
vertex program on each vertex v1
process_vertex(tri.vertex1) v2
process_vertex(tri.vertex2) v3
process_vertex(tri.vertex2) // assemble the
vertices into a triangle assembledtriangle
setup_tri(v1, v2, v3) // rasterize the
assembled triangle into 0..many fragments
fragments rasterize(assembledtriangle) //
run the fragment program on each fragment
foreach frag in fragments
outbufferfrag.position process_fragment(frag)
...
12Computational Frequency
- Branches
- Avoid these, especially in the inner loop i.e.,
the fragment program.
13Computational Frequency
- Static branch resolution
- write several variants of each fragment program
to handle boundary cases - eliminates conditionals in the fragment program
- equivalent to avoiding CPU inner-loop branching
case 1 no boundaries
case 2 accounts for boundaries
14Computational Frequency
- Dynamic branching
- Dynamic branching on NV4x and G70 hardware is
better than branching with NV3x - But still, there is a branch penalty
- Good perf requires spatial coherence in branching
15Computational Frequency
- Branches
- Ian Buck will talk more about various branching
techniques after lunch
16Computational Frequency
- Precompute
- Precompute
- Precompute
17Computational Frequency
- Precompute texture coordinates
- Take advantage of under-utilized hardware
- vertex processor
- rasterizer
- Reduce instruction count at the per-fragment
level - Avoid lookups being treated as texture
indirections
18Computational Frequency
- Precompute texture coordinates
frag2frame Smooth(vert2frag IN, uniform
samplerRECT Source texunit0, uniform
samplerRECT Operator texunit1,
uniform samplerRECT Boundary texunit2, uniform
float4 params) frag2frame OUT float2
center IN.TexCoord0.xy float4 U
f4texRECT(Source, center) // Calculate
Red-Black (odd-even) masks float2 intpart
float2 place floor(1.0f - modf(round(center
float2(0.5f, 0.5f)) / 2.0f, intpart)) float2
mask float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x mask.y)
params.y) (!(mask.x mask.y)
!params.y)) float2 offset
float2(params.xcenter.x - 0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f))
... float4 neighbor float4(center.x -
1.0f, center.x 1.0f, center.y - 1.0f, center.y
1.0f) float central -2.0f(O.x
O.y) float poisson
((params.xparams.x)U.z (-O.x
f1texRECT(Source, float2(neighbor.x, center.y))
-O.x f1texRECT(Source, float2(neighbor.y,
center.y))
-O.y f1texRECT(Source,
float2(center.x, neighbor.z))
-O.z
f1texRECT(Source, float2(center.x, neighbor.w))))
/ O.w OUT.COL.x poisson
... return OUT
19Computational Frequency
- Precompute texture coordinates
vert2frag smooth(app2vert IN, uniform float4x4
xform C0, uniform float2
srcoffset, uniform float size) vert2frag
OUT OUT.position mul(xform,IN.position)
OUT.center IN.center OUT.redblack
IN.center - srcoffset OUT.operator
size(OUT.redblack - 0.5f) 0.5f
OUT.hneighbor IN.center.xxyx float4(-1.0f,
1.0f, 0.0f, 0.0f) OUT.vneighbor
IN.center.xyyy float4(0.0f, -1.0f, 1.0f,
0.0f) return OUT
20Computational Frequency
- Precomputing other values
- Same deal! Factor other computations out
- Anything that varies linearly across the geometry
- Anything that has a complex value computed
per-vertex - Anything that is uniform across the geometry
21Computational Frequency
- Precomputing on the CPU
- Use glMultiTexCoord4f() creatively
- Extract as much uniformity from uniform
parameters as you can
22Computational Frequency
- Precomputed lookup tables
// Calculate Red-Black (odd-even) masks float2
intpart float2 place floor(1.0f -
modf(round(center 0.5f) / 2.0f,
intpart)) float2 mask
float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x
mask.y) params.y) (!(mask.x mask.y)
!params.y)) ...
23Computational Frequency
- Precomputed lookup tables
half4 mask f4texRECT(RedBlack,
IN.redblack) / mask.x and mask.w tell
whether IN.center.x and IN.center.y are both
odd or both even, respectively. either of these
two conditions indicates that the fragment is
red. params.x1 selects red params.y1
selects black. / if (dot(mask,params.xyyx))
...
24Computational Frequency
- Precomputed lookup tables
- Be careful with texture lookups cache coherence
is crucial - Use the smallest data types you can get away with
to reduce bandwidth consumption - Use swizzles or writemasks on tex ops when
possible - Computation is cheap memory accesses are
not. ...if youre memory access limited.
25Profiling and Load Balancing
26Profiling and Load Balancing
- Software profiling
- GPU pipeline profiling
- GPU load balancing
27Profiling and Load Balancing
- Run a standard software profiler!
- Rational Quantify
- Intel VTune
- AMD CodeAnalyst
28Profiling and Load Balancing
- GPU Pipeline Profiling
- This is where it gets tricky.
- Some tools exist to help you
- NVPerfKitNVIDIA exhibitor tech talk tomorrow
morning at 10am in room 404A - NVPerfHUDhttp//developer.nvidia.com/docs/IO/8343
/How-To-Profile.pdf - NVShaderPerfhttp//developer.nvidia.com/object/nv
shaderperf_home.html - Apple OpenGL Profilerhttp//developer.apple.com/o
pengl/profiler_image.html
29Profiling and Load Balancing
- GPU Load Balancing
- This is a whole talk in and of itself
- e.g., http//developer.nvidia.com/docs/IO/8343/Per
formance-Optimisation.pdf - Be sure to read the NVIDIA GPU Programming Guide
- http//developer.nvidia.com/object/gpu_programming
_guide.html - Sometimes you can get more hints from third
parties than from the vendors themselves - http//www.3dcenter.de/artikel/cinefx/index6_e.php
- http//www.3dcenter.de/artikel/nv40_technik/
30Conclusions
31Conclusions
- Get used to thinking in terms of parallel
computation - Understand how frequently each computation will
run, and reduce that frequency wherever possible - Track down bottlenecks in your application, and
shift work to other parts of the system that are
idle
32Questions?
- Acknowledgements
- Pat Brown at NVIDIA
- NVIDIA for having given me a job this summer
- Dave Luebke, my advisor
- GPGPU course presenters
33See Also