GPU Program Optimization

About This Presentation

Title:

GPU Program Optimization

Description:

Pack scalar data into RGBA in texture memory. Computational Frequency. Computational Frequency ... Be careful with texture lookups cache coherence is crucial ... – PowerPoint PPT presentation

Number of Views:118

Avg rating:3.0/5.0

Slides: 34

Provided by: steve1631

Category:

more less

Transcript and Presenter's Notes

Title: GPU Program Optimization

1
GPU Program Optimization

Cliff Woolley
University of Virginia / NVIDIA

2
Overview

Data Parallel Computing
Computational Frequency
Profiling and Load Balancing

3
Data Parallel Computing
4
Data Parallel Computing

Instruction-Level Parallelism
Data-Level Parallelism

5
A really naïve shader
frag2frame Smooth(vert2frag IN, uniform
samplerRECT Source texunit0, uniform
samplerRECT Operator texunit1,
uniform samplerRECT Boundary texunit2, uniform
float4 params) frag2frame OUT float2
center IN.TexCoord0.xy float4 U
f4texRECT(Source, center) // Calculate
Red-Black (odd-even) masks float2 intpart
float2 place floor(1.0f - modf(round(center
float2(0.5f, 0.5f)) / 2.0f, intpart)) float2
mask float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x mask.y)
params.y) (!(mask.x mask.y)
!params.y)) float2 offset
float2(params.xcenter.x - 0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f))
... float4 neighbor float4(center.x -
1.0f, center.x 1.0f, center.y - 1.0f, center.y
1.0f) float central -2.0f(O.x
O.y) float poisson
((params.xparams.x)U.z (-O.x
f1texRECT(Source, float2(neighbor.x, center.y))

-O.x f1texRECT(Source, float2(neighbor.y,
center.y))
-O.y f1texRECT(Source,
float2(center.x, neighbor.z))
-O.z
f1texRECT(Source, float2(center.x, neighbor.w))))
/ O.w OUT.COL.x poisson
... return OUT
6
A really naïve shader
frag2frame Smooth(vert2frag IN, uniform
samplerRECT Source texunit0, uniform
samplerRECT Operator texunit1,
uniform samplerRECT Boundary texunit2, uniform
float4 params) frag2frame OUT float2
center IN.TexCoord0.xy float4 U
f4texRECT(Source, center) // Calculate
Red-Black (odd-even) masks float2 intpart
float2 place floor(1.0f - modf(round(center
float2(0.5f, 0.5f)) / 2.0f, intpart)) float2
mask float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x mask.y)
params.y) (!(mask.x mask.y)
!params.y)) float2 offset
float2(params.xcenter.x - 0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f))
... float4 neighbor float4(center.x -
1.0f, center.x 1.0f, center.y - 1.0f, center.y
1.0f) float central -2.0f(O.x
O.y) float poisson
((params.xparams.x)U.z (-O.x
f1texRECT(Source, float2(neighbor.x, center.y))

-O.x f1texRECT(Source, float2(neighbor.y,
center.y))
-O.y f1texRECT(Source,
float2(center.x, neighbor.z))
-O.z
f1texRECT(Source, float2(center.x, neighbor.w))))
/ O.w OUT.COL.x poisson
... return OUT
7
Instruction-Level Parallelism
float2 offset float2(params.xcenter.x -
0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f)) float4
neighbor float4(center.x - 1.0f,
center.x 1.0f,
center.y - 1.0f,
center.y 1.0f)
8
Instruction-Level Parallelism
float2 offset center.xy - 0.5f offset offset
params.xx 0.5f // MADR is cool too one
// cycle, two
flops float4 neighbor center.xxyy
float4(-1.0f,1.0f,-1.0f,1.0f)
9
Data-Level Parallelism

Pack scalar data into RGBA in texture memory

10
Computational Frequency
11
Computational Frequency

Think of your CPU program and your vertex and
fragment programs as different levels of nested
looping.

...foreach tri in triangles // run the
vertex program on each vertex v1
process_vertex(tri.vertex1) v2
process_vertex(tri.vertex2) v3
process_vertex(tri.vertex2) // assemble the
vertices into a triangle assembledtriangle
setup_tri(v1, v2, v3) // rasterize the
assembled triangle into 0..many fragments
fragments rasterize(assembledtriangle) //
run the fragment program on each fragment
foreach frag in fragments
outbufferfrag.position process_fragment(frag)
...
12
Computational Frequency

Branches
Avoid these, especially in the inner loop i.e.,
the fragment program.

13
Computational Frequency

Static branch resolution
write several variants of each fragment program
to handle boundary cases
eliminates conditionals in the fragment program
equivalent to avoiding CPU inner-loop branching

case 1 no boundaries
case 2 accounts for boundaries
14
Computational Frequency

Dynamic branching
Dynamic branching on NV4x and G70 hardware is
better than branching with NV3x
But still, there is a branch penalty
Good perf requires spatial coherence in branching

15
Computational Frequency

Branches
Ian Buck will talk more about various branching
techniques after lunch

16
Computational Frequency

Precompute
Precompute
Precompute

17
Computational Frequency

Precompute texture coordinates
Take advantage of under-utilized hardware
vertex processor
rasterizer
Reduce instruction count at the per-fragment
level
Avoid lookups being treated as texture
indirections

18
Computational Frequency

Precompute texture coordinates

frag2frame Smooth(vert2frag IN, uniform
samplerRECT Source texunit0, uniform
samplerRECT Operator texunit1,
uniform samplerRECT Boundary texunit2, uniform
float4 params) frag2frame OUT float2
center IN.TexCoord0.xy float4 U
f4texRECT(Source, center) // Calculate
Red-Black (odd-even) masks float2 intpart
float2 place floor(1.0f - modf(round(center
float2(0.5f, 0.5f)) / 2.0f, intpart)) float2
mask float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x mask.y)
params.y) (!(mask.x mask.y)
!params.y)) float2 offset
float2(params.xcenter.x - 0.5f(params.x-1.0f),
params.xcenter.y - 0.5f(params.x-1.0f))
... float4 neighbor float4(center.x -
1.0f, center.x 1.0f, center.y - 1.0f, center.y
1.0f) float central -2.0f(O.x
O.y) float poisson
((params.xparams.x)U.z (-O.x
f1texRECT(Source, float2(neighbor.x, center.y))

-O.x f1texRECT(Source, float2(neighbor.y,
center.y))
-O.y f1texRECT(Source,
float2(center.x, neighbor.z))
-O.z
f1texRECT(Source, float2(center.x, neighbor.w))))
/ O.w OUT.COL.x poisson
... return OUT
19
Computational Frequency

Precompute texture coordinates

vert2frag smooth(app2vert IN, uniform float4x4
xform C0, uniform float2
srcoffset, uniform float size) vert2frag
OUT OUT.position mul(xform,IN.position)
OUT.center IN.center OUT.redblack
IN.center - srcoffset OUT.operator
size(OUT.redblack - 0.5f) 0.5f
OUT.hneighbor IN.center.xxyx float4(-1.0f,
1.0f, 0.0f, 0.0f) OUT.vneighbor
IN.center.xyyy float4(0.0f, -1.0f, 1.0f,
0.0f) return OUT
20
Computational Frequency

Precomputing other values
Same deal! Factor other computations out
Anything that varies linearly across the geometry
Anything that has a complex value computed
per-vertex
Anything that is uniform across the geometry

21
Computational Frequency

Precomputing on the CPU
Use glMultiTexCoord4f() creatively
Extract as much uniformity from uniform
parameters as you can

22
Computational Frequency

Precomputed lookup tables

// Calculate Red-Black (odd-even) masks float2
intpart float2 place floor(1.0f -
modf(round(center 0.5f) / 2.0f,
intpart)) float2 mask
float2((1.0f-place.x) (1.0f-place.y),
place.x place.y) if (((mask.x
mask.y) params.y) (!(mask.x mask.y)
!params.y)) ...
23
Computational Frequency

Precomputed lookup tables

half4 mask f4texRECT(RedBlack,
IN.redblack) / mask.x and mask.w tell
whether IN.center.x and IN.center.y are both
odd or both even, respectively. either of these
two conditions indicates that the fragment is
red. params.x1 selects red params.y1
selects black. / if (dot(mask,params.xyyx))
...
24
Computational Frequency

Precomputed lookup tables
Be careful with texture lookups cache coherence
is crucial
Use the smallest data types you can get away with
to reduce bandwidth consumption
Use swizzles or writemasks on tex ops when
possible
Computation is cheap memory accesses are
not. ...if youre memory access limited.

25
Profiling and Load Balancing
26
Profiling and Load Balancing

Software profiling
GPU pipeline profiling
GPU load balancing

27
Profiling and Load Balancing

Run a standard software profiler!
Rational Quantify
Intel VTune
AMD CodeAnalyst

28
Profiling and Load Balancing

GPU Pipeline Profiling
This is where it gets tricky.
Some tools exist to help you
NVPerfKitNVIDIA exhibitor tech talk tomorrow
morning at 10am in room 404A
NVPerfHUDhttp//developer.nvidia.com/docs/IO/8343
/How-To-Profile.pdf
NVShaderPerfhttp//developer.nvidia.com/object/nv
shaderperf_home.html
Apple OpenGL Profilerhttp//developer.apple.com/o
pengl/profiler_image.html

29
Profiling and Load Balancing

GPU Load Balancing
This is a whole talk in and of itself
e.g., http//developer.nvidia.com/docs/IO/8343/Per
formance-Optimisation.pdf
Be sure to read the NVIDIA GPU Programming Guide
http//developer.nvidia.com/object/gpu_programming
_guide.html
Sometimes you can get more hints from third
parties than from the vendors themselves
http//www.3dcenter.de/artikel/cinefx/index6_e.php
http//www.3dcenter.de/artikel/nv40_technik/

30
Conclusions
31
Conclusions

Get used to thinking in terms of parallel
computation
Understand how frequently each computation will
run, and reduce that frequency wherever possible
Track down bottlenecks in your application, and
shift work to other parts of the system that are
idle

32
Questions?

Acknowledgements
Pat Brown at NVIDIA
NVIDIA for having given me a job this summer
Dave Luebke, my advisor
GPGPU course presenters

GPU Program Optimization - PowerPoint PPT Presentation

GPU Program Optimization

Pack scalar data into RGBA in texture memory. Computational Frequency. Computational Frequency ... Be careful with texture lookups cache coherence is crucial ... – PowerPoint PPT presentation