Brook for GPUs - PowerPoint PPT Presentation

About This Presentation
Title:

Brook for GPUs

Description:

hide memory latency. arithmetic intensity. compute-to-bandwidth ratio ... hide rendering passes. virtualize resources. December 10th, 2003. 10. Brook for gpus ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 56
Provided by: ianb154
Category:
Tags: brook | gpus

less

Transcript and Presenter's Notes

Title: Brook for GPUs


1
Brook for GPUs
  • Ian Buck, Tim Foley, Daniel Horn, Jeremy Sugerman
  • Pat Hanrahan
  • GCafe December 10th, 2003

2
Brook general purpose streaming language
  • developed for stanford streaming supercomputing
    project
  • architecture Merrimac
  • compiler RStream
  • Reservoir Labs
  • Center for Turbulence Research
  • NASA
  • DARPA PCA Program
  • Stanford SmartMemories
  • UT Austin TRIPS
  • MIT RAW
  • Brook version 0.2 spec http//merrimac.stanford.
    edu

3
why graphics hardware?
GeForce FX
4
why graphics hardware?
  • Pentium 4 SSE theoretical
  • 3GHz 4 wide .5 inst / cycle 6 GFLOPS
  • GeForce FX 5900 (NV35) fragment shader obtained
  • MULR R0, R0, R0 20 GFLOPS
  • equivalent to a 10 GHz P4
  • and getting faster 3x improvement over NV30 (6
    months)

GeForce FX
from Intel P4 Optimization Manual
5
gpu data parallel arithmetic intensity
  • data parallelism
  • each fragment shaded independently
  • better alu usage
  • hide memory latency

6
gpu data parallel arithmetic intensity
  • data parallelism
  • each fragment shaded independently
  • better alu usage
  • hide memory latency
  • arithmetic intensity
  • compute-to-bandwidth ratio
  • lots of ops per word transferred
  • app limited by alu performance, not off-chip
    bandwidth
  • more chip real estate for alus, not caches

7
Brook general purpose streaming language
  • stream programming model
  • enforce data parallel computing
  • streams
  • encourage arithmetic intensity
  • kernels
  • C with streams

8
Brook for gpus
  • demonstrate gpu streaming coprocessor
  • explicit programming abstraction

9
Brook for gpus
  • demonstrate gpu streaming coprocessor
  • make programming gpus easier
  • hide texture/pbuffer data management
  • hide graphics based constructs in CG/HLSL
  • hide rendering passes
  • virtualize resources

10
Brook for gpus
  • demonstrate gpu streaming coprocessor
  • make programming gpus easier
  • hide texture/pbuffer data management
  • hide graphics based constructs in CG/HLSL
  • hide rendering passes
  • virtualize resources
  • performance!
  • on applications that matter

11
Brook for gpus
  • demonstrate gpu streaming coprocessor
  • make programming gpus easier
  • hide texture/pbuffer data management
  • hide graphics based constructs in CG/HLSL
  • hide rendering passes
  • virtualize resources
  • performance!
  • on applications that matter
  • highlight gpu areas for improvement
  • features required general purpose stream computing

12
system outline
  • .br
  • Brook source files
  • brcc
  • source to source compiler
  • brt
  • Brook run-time library

13
Brook languagestreams
  • streams
  • collection of records requiring similar
    computation
  • particle positions, voxels, FEM cell,
  • float3 positionslt200gt
  • float3 velocityfieldlt100,100,100gt

14
Brook languagestreams
  • streams
  • collection of records requiring similar
    computation
  • particle positions, voxels, FEM cell,
  • float3 positionslt200gt
  • float3 velocityfieldlt100,100,100gt
  • similar to arrays, but
  • index operations disallowed positioni
  • read/write stream operators
  • streamRead (positions, p_ptr)
  • streamWrite (velocityfield, v_ptr)
  • encourage data parallelism

15
Brook languagekernels
  • kernels
  • functions applied to streams
  • similar to for_all construct
  • kernel void foo (float altgt, float bltgt,
  • out float resultltgt)
  • result a b

16
Brook languagekernels
  • kernels
  • functions applied to streams
  • similar to for_all construct
  • kernel void foo (float altgt, float bltgt,
  • out float resultltgt)
  • result a b
  • float alt100gt
  • float blt100gt
  • float clt100gt
  • foo(a,b,c)

17
Brook languagekernels
  • kernels
  • functions applied to streams
  • similar to for_all construct
  • kernel void foo (float altgt, float bltgt,
  • out float resultltgt)
  • result a b
  • float alt100gt
  • float blt100gt
  • float clt100gt
  • foo(a,b,c)

for (i0 ilt100 i) ci aibi
18
Brook languagekernels
  • kernels
  • functions applied to streams
  • similar to for_all construct
  • kernel void foo (float altgt, float bltgt,
  • out float resultltgt)
  • result a b
  • no dependencies between stream elements
  • encourage high arithmetic intensity

19
Brook languagekernels
  • kernels arguments
  • input/output streams
  • kernel void foo (float altgt, float bltgt,
  • out float resultltgt)
  • result a b

a,b Read-only input streams result
Write-only output stream
20
Brook languagekernels
  • kernels arguments
  • input/output streams
  • constant parameters
  • kernel void foo (float altgt, float bltgt,
  • float t,
  • out float resultltgt)
  • result a tb
  • float alt100gt
  • float blt100gt
  • float clt100gt
  • foo(a,b,3.2f,c)

21
Brook languagekernels
  • kernels arguments
  • input/output streams
  • constant paramters
  • gather streams
  • kernel void foo (float altgt, float bltgt,
  • float t, float array,
  • out float resultltgt)
  • result arraya tb
  • float alt100gt
  • float blt100gt
  • float clt100gt
  • float arraylt25gt
  • foo(a,b,3.2f,array,c)

gpu bonus
22
Brook languagekernels
  • kernels arguments
  • input/output streams
  • constant parameters
  • gather streams
  • iterator streams
  • kernel void foo (float altgt, float bltgt,
  • float t, float array,
  • iter float nltgt,
  • out float resultltgt)
  • result arraya tb n
  • float alt100gt
  • float blt100gt
  • float clt100gt
  • float arraylt25gt
  • iter float nlt100gt iter(0, 10)

gpu bonus
23
Brook languagekernels
  • example
  • position update in velocity field
  • kernel void updatepos (float2 posltgt,
  • float2 vel100100,
  • float timestep,
  • out float2 newposltgt)
  • newpos pos velpostimestep
  • updatepos (positions, velocityfield,
  • 10.0f, positions)

24
Brook languagereductions
25
Brook languagereductions
  • reductions
  • compute single value from a stream
  • reduce void sum (float altgt,
  • reduce float rltgt)
  • r a
  • float alt100gt
  • float r
  • sum(a,r)

26
Brook languagereductions
  • reductions
  • compute single value from a stream
  • reduce void sum (float altgt,
  • reduce float rltgt)
  • r a
  • float alt100gt
  • float r
  • sum(a,r)

r a0 for (int i1 ilt100 i) r ai
27
Brook languagereductions
  • reductions
  • associative operations only
  • (ab)c a(bc)
  • sum, multiply, max, min, OR, AND, XOR
  • matrix multiply

28
Brook languagereductions
  • multi-dimension reductions
  • stream shape differences resolved by reduce
    function

29
Brook languagereductions
  • multi-dimension reductions
  • stream shape differences resolved by reduce
    function
  • reduce void sum (float altgt,
  • reduce float rltgt)
  • r a
  • float alt20gt
  • float rlt5gt
  • sum(a,r)

30
Brook languagereductions
  • multi-dimension reductions
  • stream shape differences resolved by reduce
    function
  • reduce void sum (float altgt,
  • reduce float rltgt)
  • r a
  • float alt20gt
  • float rlt5gt
  • sum(a,r)

for (int i0 ilt5 i) ri ai4 for
(int j1 jlt4 j) ri ai4 j
31
Brook languagereductions
  • multi-dimension reductions
  • stream shape differences resolved by reduce
    function
  • reduce void sum (float altgt,
  • reduce float rltgt)
  • r a
  • float alt20gt
  • float rlt5gt
  • sum(a,r)

for (int i0 ilt5 i) ri ai4 for
(int j1 jlt4 j) ri ai4 j
32
Brook languagestream repeat stride
  • kernel arguments of different shape
  • resolved by repeat and stride

33
Brook languagestream repeat stride
  • kernel arguments of different shape
  • resolved by repeat and stride
  • kernel void foo (float altgt, float bltgt,
  • out float resultltgt)
  • float alt20gt
  • float blt5gt
  • float clt10gt
  • foo(a,b,c)

34
Brook languagestream repeat stride
  • kernel arguments of different shape
  • resolved by repeat and stride
  • kernel void foo (float altgt, float bltgt,
  • out float resultltgt)
  • float alt20gt
  • float blt5gt
  • float clt10gt
  • foo(a,b,c)

foo(a0, b0, c0) foo(a2, b0,
c1) foo(a4, b1, c2) foo(a6, b1,
c3) foo(a8, b2, c4) foo(a10, b2,
c5) foo(a12, b3, c6) foo(a14, b3,
c7) foo(a16, b4, c8) foo(a18, b4,
c9)
35
Brook languagematrix vector multiply
  • kernel void mul (float altgt, float bltgt,
  • out float resultltgt)
  • result ab
  • reduce void sum (float altgt,
  • reduce float resultltgt)
  • result a
  • float matrixlt20,10gt
  • float vectorlt1, 10gt
  • float tempmvlt20,10gt
  • float resultlt20, 1gt
  • mul(matrix,vector,tempmv)
  • sum(tempmv,result)

M
T
V

V
V
36
Brook languagematrix vector multiply
  • kernel void mul (float altgt, float bltgt,
  • out float resultltgt)
  • result ab
  • reduce void sum (float altgt,
  • reduce float resultltgt)
  • result a
  • float matrixlt20,10gt
  • float vectorlt1, 10gt
  • float tempmvlt20,10gt
  • float resultlt20, 1gt
  • mul(matrix,vector,tempmv)
  • sum(tempmv,result)

R
T
sum
37
brcc compilerinfrastructure
38
brcc compilerinfrastructure
  • based on ctool
  • http//ctool.sourceforge.net
  • parser
  • build code tree
  • extend C grammar to accept Brook
  • convert
  • tree transformations
  • codegen
  • generate cg hlsl code
  • call cgc, fxc
  • generate stub function

39
brcc compilerkernel compilation
  • kernel void updatepos (float2 posltgt,
  • float2 vel100100,
  • float timestep,
  • out float2 newposltgt)
  • newpos pos velpostimestep

float4 main (uniform float4 _workspace
register (c0), uniform sampler
_tex_pos register (s0), float2
_tex_pos_pos TEXCOORD0,
uniform sampler vel register (s1),
uniform float4 vel_scalebias register
(c1), uniform float timestep
register (c2)) COLOR0 float4 _OUT float2
pos float2 newpos pos tex2D(_tex_pos,
_tex_pos_pos).xy newpos pos
tex2D(vel,(pos).xyvel_scalebias.xyvel_scalebias.
zw).xy timestep _OUT.x newpos.x _OUT.y
newpos.y _OUT.z newpos.y _OUT.w
newpos.y return _OUT
40
brcc compilerkernel compilation
  • static const char __updatepos_ps20 "ps_2_0
    .....
  • static const char __updatepos_fp30 "!!fp30
    .....
  • void updatepos (const __BRTStream pos,
  • const __BRTStream vel,
  • const float timestep,
  • const __BRTStream newpos)
  • static const void __updatepos_fp
  • "fp30", __updatepos_fp30,
  • "ps20", __updatepos_ps20,
  • "cpu", (void ) __updatepos_cpu,
  • "combine", 0,
  • NULL, NULL
  • static __BRTKernel k(__updatepos_fp)
  • k-gtPushStream(pos)
  • k-gtPushGatherStream(vel)
  • k-gtPushConstant(timestep)
  • k-gtPushOutput(newpos)
  • k-gtMap()

41
brcc runtimestreams
42
brt runtimestreams
  • streams

separate texture per stream
vel
texture 1
pos
texture 2
43
brt runtimekernels
  • kernel execution
  • set stream texture as render target
  • bind inputs to texture units
  • issue screen size quad
  • texture coords provide stream positions

vel
a
kernel void foo (float altgt, float bltgt, out
float resultltgt) result a b
foo
b
result
44
brt runtimereductions
  • reduction execution
  • multipass execution
  • associativity required

45
research directions
  • demonstrate gpu streaming coprocessor
  • compiling Brook to gpus
  • evaluation
  • applications

46
research directions
  • applications
  • linear algebra
  • image processing
  • molecular dynamics (gromacs)
  • FEM
  • multigrid
  • raytracer
  • volume renderer
  • SIGGRAPH / GH papers

47
research directions
  • virtualize gpu resources
  • texture size and formats
  • packing streams to fit in 2D segmented memory
    space

float matrixlt8096,10,30,5gt
48
research directions
  • virtualize gpu resources
  • texture size and formats
  • support complex formats

typedef struct float3 pos float3 vel
float mass particle kernel void foo
(particle altgt, float timestep,
out particle bltgt) float alt100gt82
49
research directions
  • virtualize gpu resources
  • multiple outputs
  • simple let cgc or fxc do dead code elimination
  • better compute intermediates separately

kernel void foo (float3 altgt, float3 bltgt,
, out float3 xltgt, out float3 yltgt)
kernel void foo1(float3 altgt, float3 bltgt,
, out float3 xltgt)
kernel void foo2(float3 altgt, float3 bltgt,
, out float3 yltgt)
50
research directions
  • virtualize gpu resources
  • limited instructions per kernel
  • generalize RDS algorithm for kernels
  • compute ideal of passes for intermediate
    results
  • hard ???

51
research directions
  • auto vectorization

kernel void foo (float altgt, float bltgt, out
float resultltgt) result a b kernel
void foo_faster (float4 altgt, float4 bltgt,
out float4 resultltgt) result a b
52
research directions
  • Brook v0.2 support
  • stream operators
  • stencil, group, domain, repeat, stride, merge,
  • building and manipulating data structures
  • scatterOp ai p
  • gatherOp p ai
  • gpu primitives

53
research directions
  • gpu areas of improvement
  • reduction registers
  • texture constraints
  • scatter capabilities
  • programmable blending
  • gatherOp, scatterOp

54
Brook status
  • team
  • Jeremy Sugerman
  • Daniel Horn
  • Tim Foley
  • Ian Buck
  • beta release
  • December 15th
  • sourceforge

55
Questions?
Fly-fishing fly images from The English Fly
Fishing Shop
Write a Comment
User Comments (0)
About PowerShow.com