Title: Brook for GPUs
1Brook for GPUs
- Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman
- Pat Hanrahan
- GCafe December 10th, 2003
2Brook general purpose streaming language
- developed for stanford streaming supercomputing
project - architecture Merrimac
- compiler RStream
- Reservoir Labs
- Center for Turbulence Research
- NASA
- DARPA PCA Program
- Stanford SmartMemories
- UT Austin TRIPS
- MIT RAW
- Brook version 0.2 spec http//merrimac.stanford.
edu
3why graphics hardware?
GeForce FX
4why graphics hardware?
- Pentium 4 SSE theoretical
- 3GHz 4 wide .5 inst / cycle 6 GFLOPS
-
- GeForce FX 5900 (NV35) fragment shader obtained
- MULR R0, R0, R0 20 GFLOPS
- equivalent to a 10 GHz P4
- and getting faster 3x improvement over NV30 (6
months)
GeForce FX
from Intel P4 Optimization Manual
5gpu data parallel arithmetic intensity
- data parallelism
- each fragment shaded independently
- better alu usage
- hide memory latency
-
6gpu data parallel arithmetic intensity
- data parallelism
- each fragment shaded independently
- better alu usage
- hide memory latency
-
- arithmetic intensity
- compute-to-bandwidth ratio
- lots of ops per word transferred
- app limited by alu performance, not off-chip
bandwidth - more chip real estate for alus, not caches
7Brook general purpose streaming language
- stream programming model
- enforce data parallel computing
- streams
- encourage arithmetic intensity
- kernels
- C with streams
8Brook for gpus
- demonstrate gpu streaming coprocessor
- explicit programming abstraction
9Brook for gpus
- demonstrate gpu streaming coprocessor
- make programming gpus easier
- hide texture/pbuffer data management
- hide graphics based constructs in CG/HLSL
- hide rendering passes
- virtualize resources
10Brook for gpus
- demonstrate gpu streaming coprocessor
- make programming gpus easier
- hide texture/pbuffer data management
- hide graphics based constructs in CG/HLSL
- hide rendering passes
- virtualize resources
- performance!
- on applications that matter
11Brook for gpus
- demonstrate gpu streaming coprocessor
- make programming gpus easier
- hide texture/pbuffer data management
- hide graphics based constructs in CG/HLSL
- hide rendering passes
- virtualize resources
- performance!
- on applications that matter
- highlight gpu areas for improvement
- features required general purpose stream computing
12system outline
- .br
- Brook source files
- brcc
- source to source compiler
- brt
- Brook run-time library
13Brook languagestreams
- streams
- collection of records requiring similar
computation - particle positions, voxels, FEM cell,
- float3 positionslt200gt
- float3 velocityfieldlt100,100,100gt
14Brook languagestreams
- streams
- collection of records requiring similar
computation - particle positions, voxels, FEM cell,
- float3 positionslt200gt
- float3 velocityfieldlt100,100,100gt
- similar to arrays, but
- index operations disallowed positioni
- read/write stream operators
- streamRead (positions, p_ptr)
- streamWrite (velocityfield, v_ptr)
- encourage data parallelism
-
15Brook languagekernels
- kernels
- functions applied to streams
- similar to for_all construct
- kernel void foo (float altgt, float bltgt,
- out float resultltgt)
- result a b
-
16Brook languagekernels
- kernels
- functions applied to streams
- similar to for_all construct
- kernel void foo (float altgt, float bltgt,
- out float resultltgt)
- result a b
-
- float alt100gt
- float blt100gt
- float clt100gt
- foo(a,b,c)
17Brook languagekernels
- kernels
- functions applied to streams
- similar to for_all construct
- kernel void foo (float altgt, float bltgt,
- out float resultltgt)
- result a b
-
- float alt100gt
- float blt100gt
- float clt100gt
- foo(a,b,c)
for (i0 ilt100 i) ci aibi
18Brook languagekernels
- kernels
- functions applied to streams
- similar to for_all construct
- kernel void foo (float altgt, float bltgt,
- out float resultltgt)
- result a b
-
- no dependencies between stream elements
- encourage high arithmetic intensity
19Brook languagekernels
- kernels arguments
- input/output streams
- kernel void foo (float altgt, float bltgt,
- out float resultltgt)
- result a b
-
a,b Read-only input streams result
Write-only output stream
20Brook languagekernels
- kernels arguments
- input/output streams
- constant parameters
- kernel void foo (float altgt, float bltgt,
- float t,
- out float resultltgt)
- result a tb
-
- float alt100gt
- float blt100gt
- float clt100gt
- foo(a,b,3.2f,c)
21Brook languagekernels
- kernels arguments
- input/output streams
- constant paramters
- gather streams
- kernel void foo (float altgt, float bltgt,
- float t, float array,
- out float resultltgt)
- result arraya tb
-
- float alt100gt
- float blt100gt
- float clt100gt
- float arraylt25gt
- foo(a,b,3.2f,array,c)
gpu bonus
22Brook languagekernels
- kernels arguments
- input/output streams
- constant parameters
- gather streams
- iterator streams
- kernel void foo (float altgt, float bltgt,
- float t, float array,
- iter float nltgt,
- out float resultltgt)
- result arraya tb n
-
- float alt100gt
- float blt100gt
- float clt100gt
- float arraylt25gt
- iter float nlt100gt iter(0, 10)
gpu bonus
23Brook languagekernels
- example
- position update in velocity field
- kernel void updatepos (float2 posltgt,
- float2 vel100100,
- float timestep,
- out float2 newposltgt)
- newpos pos velpostimestep
-
- updatepos (positions, velocityfield,
- 10.0f, positions)
24Brook languagereductions
25Brook languagereductions
- reductions
- compute single value from a stream
- reduce void sum (float altgt,
- reduce float rltgt)
- r a
-
- float alt100gt
- float r
- sum(a,r)
26Brook languagereductions
- reductions
- compute single value from a stream
- reduce void sum (float altgt,
- reduce float rltgt)
- r a
-
- float alt100gt
- float r
- sum(a,r)
r a0 for (int i1 ilt100 i) r ai
27Brook languagereductions
- reductions
- associative operations only
- (ab)c a(bc)
- sum, multiply, max, min, OR, AND, XOR
- matrix multiply
28Brook languagereductions
- multi-dimension reductions
- stream shape differences resolved by reduce
function
29Brook languagereductions
- multi-dimension reductions
- stream shape differences resolved by reduce
function - reduce void sum (float altgt,
- reduce float rltgt)
- r a
-
- float alt20gt
- float rlt5gt
- sum(a,r)
30Brook languagereductions
- multi-dimension reductions
- stream shape differences resolved by reduce
function - reduce void sum (float altgt,
- reduce float rltgt)
- r a
-
- float alt20gt
- float rlt5gt
- sum(a,r)
for (int i0 ilt5 i) ri ai4 for
(int j1 jlt4 j) ri ai4 j
31Brook languagereductions
- multi-dimension reductions
- stream shape differences resolved by reduce
function - reduce void sum (float altgt,
- reduce float rltgt)
- r a
-
- float alt20gt
- float rlt5gt
- sum(a,r)
for (int i0 ilt5 i) ri ai4 for
(int j1 jlt4 j) ri ai4 j
32Brook languagestream repeat stride
- kernel arguments of different shape
- resolved by repeat and stride
33Brook languagestream repeat stride
- kernel arguments of different shape
- resolved by repeat and stride
- kernel void foo (float altgt, float bltgt,
- out float resultltgt)
- float alt20gt
- float blt5gt
- float clt10gt
- foo(a,b,c)
34Brook languagestream repeat stride
- kernel arguments of different shape
- resolved by repeat and stride
- kernel void foo (float altgt, float bltgt,
- out float resultltgt)
- float alt20gt
- float blt5gt
- float clt10gt
- foo(a,b,c)
foo(a0, b0, c0) foo(a2, b0,
c1) foo(a4, b1, c2) foo(a6, b1,
c3) foo(a8, b2, c4) foo(a10, b2,
c5) foo(a12, b3, c6) foo(a14, b3,
c7) foo(a16, b4, c8) foo(a18, b4,
c9)
35Brook languagematrix vector multiply
- kernel void mul (float altgt, float bltgt,
- out float resultltgt)
- result ab
-
- reduce void sum (float altgt,
- reduce float resultltgt)
- result a
-
- float matrixlt20,10gt
- float vectorlt1, 10gt
- float tempmvlt20,10gt
- float resultlt20, 1gt
- mul(matrix,vector,tempmv)
- sum(tempmv,result)
M
T
V
V
V
36Brook languagematrix vector multiply
- kernel void mul (float altgt, float bltgt,
- out float resultltgt)
- result ab
-
- reduce void sum (float altgt,
- reduce float resultltgt)
- result a
-
- float matrixlt20,10gt
- float vectorlt1, 10gt
- float tempmvlt20,10gt
- float resultlt20, 1gt
- mul(matrix,vector,tempmv)
- sum(tempmv,result)
R
T
sum
37brcc compilerinfrastructure
38brcc compilerinfrastructure
- based on ctool
- http//ctool.sourceforge.net
- parser
- build code tree
- extend C grammar to accept Brook
- convert
- tree transformations
- codegen
- generate cg hlsl code
- call cgc, fxc
- generate stub function
39brcc compilerkernel compilation
- kernel void updatepos (float2 posltgt,
- float2 vel100100,
- float timestep,
- out float2 newposltgt)
- newpos pos velpostimestep
float4 main (uniform float4 _workspace
register (c0), uniform sampler
_tex_pos register (s0), float2
_tex_pos_pos TEXCOORD0,
uniform sampler vel register (s1),
uniform float4 vel_scalebias register
(c1), uniform float timestep
register (c2)) COLOR0 float4 _OUT float2
pos float2 newpos pos tex2D(_tex_pos,
_tex_pos_pos).xy newpos pos
tex2D(vel,(pos).xyvel_scalebias.xyvel_scalebias.
zw).xy timestep _OUT.x newpos.x _OUT.y
newpos.y _OUT.z newpos.y _OUT.w
newpos.y return _OUT
40brcc compilerkernel compilation
- static const char __updatepos_ps20 "ps_2_0
..... - static const char __updatepos_fp30 "!!fp30
..... - void updatepos (const __BRTStream pos,
- const __BRTStream vel,
- const float timestep,
- const __BRTStream newpos)
- static const void __updatepos_fp
- "fp30", __updatepos_fp30,
- "ps20", __updatepos_ps20,
- "cpu", (void ) __updatepos_cpu,
- "combine", 0,
- NULL, NULL
- static __BRTKernel k(__updatepos_fp)
- k-gtPushStream(pos)
- k-gtPushGatherStream(vel)
- k-gtPushConstant(timestep)
- k-gtPushOutput(newpos)
- k-gtMap()
41brcc runtimestreams
42brt runtimestreams
separate texture per stream
vel
texture 1
pos
texture 2
43brt runtimekernels
- kernel execution
- set stream texture as render target
- bind inputs to texture units
- issue screen size quad
- texture coords provide stream positions
vel
a
kernel void foo (float altgt, float bltgt, out
float resultltgt) result a b
foo
b
result
44brt runtimereductions
- reduction execution
- multipass execution
- associativity required
45research directions
- demonstrate gpu streaming coprocessor
- compiling Brook to gpus
- evaluation
- applications
46research directions
- applications
- linear algebra
- image processing
- molecular dynamics (gromacs)
- FEM
- multigrid
- raytracer
- volume renderer
- SIGGRAPH / GH papers
47research directions
- virtualize gpu resources
- texture size and formats
- packing streams to fit in 2D segmented memory
space
float matrixlt8096,10,30,5gt
48research directions
- virtualize gpu resources
- texture size and formats
- support complex formats
typedef struct float3 pos float3 vel
float mass particle kernel void foo
(particle altgt, float timestep,
out particle bltgt) float alt100gt82
49research directions
- virtualize gpu resources
- multiple outputs
- simple let cgc or fxc do dead code elimination
- better compute intermediates separately
kernel void foo (float3 altgt, float3 bltgt,
, out float3 xltgt, out float3 yltgt)
kernel void foo1(float3 altgt, float3 bltgt,
, out float3 xltgt)
kernel void foo2(float3 altgt, float3 bltgt,
, out float3 yltgt)
50research directions
- virtualize gpu resources
- limited instructions per kernel
- generalize RDS algorithm for kernels
- compute ideal of passes for intermediate
results - hard ???
51research directions
kernel void foo (float altgt, float bltgt, out
float resultltgt) result a b kernel
void foo_faster (float4 altgt, float4 bltgt,
out float4 resultltgt) result a b
52research directions
- Brook v0.2 support
- stream operators
- stencil, group, domain, repeat, stride, merge,
- building and manipulating data structures
- scatterOp ai p
- gatherOp p ai
- gpu primitives
53research directions
- gpu areas of improvement
- reduction registers
- texture constraints
- scatter capabilities
- programmable blending
- gatherOp, scatterOp
54Brook status
- team
- Jeremy Sugerman
- Daniel Horn
- Tim Foley
- Ian Buck
- beta release
- December 15th
- sourceforge
55Questions?
Fly-fishing fly images from The English Fly
Fishing Shop