Title: Streaming Architectures and GPUs
1Streaming Architectures and GPUs
- Ian Buck
- Bill Dally Pat Hanrahan
- Stanford University
- February 11, 2004
2To Exploit VLSI Technology We Need
- Parallelism
- To keep 100s of ALUs per chip (thousands/board
millions/system) busy - Latency tolerance
- To cover 500 cycle remote memory access time
- Locality
- To match 20Tb/s ALU bandwidth to 100Gb/s chip
bandwidth - Moores Law
- Growth of transistors, not performance
Courtesy of Bill Dally
Arithmetic is cheap, global bandwidth is
expensive Local ltlt global on-chip ltlt off-chip ltlt
global system
3Arithmetic Intensity
- Lots of ops per word transferred
- Compute-to-Bandwidth ratio
- High Arithmetic Intensity desirable
- App limited by ALU performance, not off-chip
bandwidth - More chip real estate for ALUs, not caches
Courtesy of Pat Hanrahan
4Brook Stream programming Model
- Enforce Data Parallel computing
- Encourage Arithmetic Intensity
- Provide fundamental ops for stream computing
5Streams Kernels
- Streams
- Collection of records requiring similar
computation - Vertex positions, voxels, FEM cell,
- Provide data parallelism
- Kernels
- Functions applied to each element in stream
- transforms, PDE,
- No dependencies between stream elements
- Encourage high Arithmetic Intensity
6Vectors vs. Streams
- Vectors
- v array of floats
- Instruction sequence
- LD v0
- LD v1
- ADD v0, v1, v2
- ST v2
- ? Large set of temps
- Streams
- s stream of records
- Instruction sequence
- LD s0
- LD s1
- CALLS f, s0, s1, s2
- ST s2
- ? Small set of temps
Higher arithmetic intensity f/s ?? /v
7Imagine
- Imagine
- Stream processor for image and signal processing
- 16mm die in 0.18um TI process
- 21M transistors
8Merrimac Processor
- 90nm tech (1 V)
- ASIC technology
- 1 GHz (37 FO4)
- 128 GOPs
- Inter-cluster switch between clusters
- 127.5 mm2 (small 12x10)
- Stanford Imagine is 16mm x 16mm
- MIT Raw is 18mm x 18mm
- 25 Watts (P4 75 W)
- 41W with memories
r
r
r
r
e
e
e
e
t
t
t
t
s
s
s
s
u
u
u
u
Mips64
Mips64
l
l
l
l
C
C
C
C
20kc
20kc
r
r
r
r
e
e
e
e
t
t
t
t
s
s
s
s
bank
u
u
u
u
r
l
l
l
l
n
e
n
C
C
C
C
f
e
e
s
f
u
G
G
bank
e
m
B
c
s
s
m
r
s
s
a
e
e
e
f
C
Microcontroller
d
bank
r
r
2
r
r
d
d
.
e
C
o
d
d
h
0
t
e
E
A
A
c
n
R
bank
t
1
i
I
d
r
r
r
r
w
e
e
e
e
r
s
M
t
t
t
t
a
m
s
s
s
s
bank
A
w
r
e
u
u
u
u
n
e
n
R
r
l
l
l
l
f
e
e
M
f
o
C
C
C
C
D
u
bank
G
G
F
B
R
s
s
r
s
s
e
e
e
6
bank
d
r
r
1
d
r
d
o
r
r
r
r
d
d
e
e
e
e
e
A
A
R
t
t
t
t
bank
s
s
s
s
u
u
u
u
l
l
l
l
C
C
C
C
Network
12.5 mm
9Merrimac Streaming Supercomputer
10Streaming Applications
- Finite volume StreamFLO (from TFLO)
- Finite element - StreamFEM
- Molecular dynamics code (ODEs) - StreamMD
- Model (elliptic, hyperbolic and parabolic) PDEs
- PCA Applications FFT, Matrix Mul, SVD, Sort
11StreamFLO
- StreamFLO is the Brook version of FLO82, a
FORTRAN code written by Prof. Jameson, for the
solution of the inviscid flow around an airfoil. - The code uses a cell centered finite volume
formulation with a multigrid acceleration to
solve the 2D Euler equations. - The structure of the code is similar to TFLO and
the algorithm is found in many compressible flow
solvers.
12StreamFEM
- A Brook implementation of the Discontinuous
Galerkin (DG) Finite Element - Method (FEM) in 2D triangulated domains.
13StreamMD motivation
- Application study the folding of human proteins.
- Molecular Dynamics computer simulation of the
dynamics of macro molecules. - Why this application?
- Expect high arithmetic intensity.
- Requires variable length neighborlists.
- Molecular Dynamics can be used in engine
simulation to model spray, e.g. droplet formation
and breakup, drag, deformation of droplet. - Test case chosen for initial evaluation box of
water molecules.
DNA molecule
Human immunodeficiency virus (HIV)
14Summary of Application Results
1. Simulated on a machine with 64GFLOPS peak
performance 2. The low numbers are a result of
many divide and square-root operations
15Streaming on graphics hardware?
- Pentium 4 SSE theoretical
- 3GHz 4 wide .5 inst / cycle 6 GFLOPS
-
- GeForce FX 5900 (NV35) fragment shader observed
- MULR R0, R0, R0 20 GFLOPS
- equivalent to a 10 GHz P4
- and getting faster 3x improvement over NV30 (6
months)
GeForce FX
NV35
NV30
Pentium 4
from Intel P4 Optimization Manual
16GPU Program Architecture
Input Registers
Texture
Program
Constants
Registers
Output Registers
17Example Program
Simple Specular and Diffuse Lighting !!VP1.0
c0-3 modelview projection (composite)
matrix c4-7 modelview inverse transpose
c32 eye-space light direction c33
constant eye-space half-angle vector (infinite
viewer) c35.x pre-multiplied monochromatic
diffuse light color diffuse mat. c35.y
pre-multiplied monochromatic ambient light color
diffuse mat. c36 specular color
c38.x specular power outputs homogenous
position and color DP4 oHPOS.x, c0,
vOPOS Compute position. DP4
oHPOS.y, c1, vOPOS DP4 oHPOS.z, c2,
vOPOS DP4 oHPOS.w, c3, vOPOS DP3
R0.x, c4, vNRML Compute
normal. DP3 R0.y, c5, vNRML DP3 R0.z,
c6, vNRML R0 N' transformed
normal DP3 R1.x, c32, R0
R1.x Ldir DOT N' DP3 R1.y, c33, R0
R1.y H DOT N' MOV R1.w, c38.x
R1.w specular power LIT R2, R1
Compute lighting
values MAD R3, c35.x, R2.y, c35.y
diffuse ambient MAD oCOL0.xyz, c36, R2.z,
R3 specular END
18Cg/HLSL High level language for GPUs
- Specular Lighting
- // Lookup the normal map
- float4 normal 2 (tex2D(normalMap,
I.texCoord0.xy) - 0.5) - // Multiply 3 X 2 matrix generated using
lightDir and halfAngle with - // scaled normal followed by lookup in
intensity map with the result. - float2 intensCoord float2(dot(I.lightDir.xyz
, normal.xyz), -
dot(I.halfAngle.xyz, normal.xyz)) - float4 intensity tex2D(intensityMap,
intensCoord) - // Lookup color
- float4 color tex2D(colorMap,
I.texCoord3.xy) - // Blend/Modulate intensity with color
- return color intensity
19GPU Data Parallel
- Each fragment shaded independently
- No dependencies between fragments
- Temporary registers are zeroed
- No static variables
- No Read-Modify-Write textures
- Multiple pixel pipes
- Data Parallelism
- Support ALU heavy architectures
- Hide Memory Latency
- Torborg and Kajiya 96, Anderson et al. 97,
Igehy et al. 98
20GPU Arithmetic Intensity
- Lots of ops per word transferred
- Graphics pipeline
- Vertex
- BW 1 triangle 32 bytes
- OP 100-500 f32-ops / triangle
- Rasterization
- Create 16-32 fragments per triangle
- Fragment
- BW 1 fragment 10 bytes
- OP 300-1000 i8-ops/fragment
- Shader Programs
Courtesy of Pat Hanrahan
21Streaming Architectures
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
22Streaming Architectures
Kernel Execution Unit
MAD R3, R1, R2 MAD R5, R2, R3
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
23Streaming Architectures
Kernel Execution Unit
MAD R3, R1, R2 MAD R5, R2, R3
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
Parallel Fragment Pipelines
24Streaming Architectures
Kernel Execution Unit
MAD R3, R1, R2 MAD R5, R2, R3
- Stream Register File
- Texture Cache?
- F-Buffer Mark et al.
Parallel Fragment Pipelines
25Conclusions
- The problem is bandwidth arithmetic is cheap
- Stream processing architectures can provide
VLSI-efficient scientific computing - Imagine
- Merrimac
- GPUs are first generation streaming architectures
- Apply same stream programming model for general
purpose computing on GPUs
GeForce FX