Streaming Architectures and GPUs - PowerPoint PPT Presentation

About This Presentation

Title:

Streaming Architectures and GPUs

Description:

Streaming Architectures and GPUs – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 26

Provided by: william507

Learn more at: http://graphics.stanford.edu

Category:

more less

Transcript and Presenter's Notes

Title: Streaming Architectures and GPUs

1
Streaming Architectures and GPUs

Ian Buck
Bill Dally Pat Hanrahan
Stanford University
February 11, 2004

2
To Exploit VLSI Technology We Need

Parallelism
To keep 100s of ALUs per chip (thousands/board
millions/system) busy
Latency tolerance
To cover 500 cycle remote memory access time
Locality
To match 20Tb/s ALU bandwidth to 100Gb/s chip
bandwidth
Moores Law
Growth of transistors, not performance

Courtesy of Bill Dally
Arithmetic is cheap, global bandwidth is
expensive Local ltlt global on-chip ltlt off-chip ltlt
global system
3
Arithmetic Intensity

Lots of ops per word transferred
Compute-to-Bandwidth ratio
High Arithmetic Intensity desirable
App limited by ALU performance, not off-chip
bandwidth
More chip real estate for ALUs, not caches

Courtesy of Pat Hanrahan
4
Brook Stream programming Model

Enforce Data Parallel computing
Encourage Arithmetic Intensity
Provide fundamental ops for stream computing

5
Streams Kernels

Streams
Collection of records requiring similar
computation
Vertex positions, voxels, FEM cell,
Provide data parallelism
Kernels
Functions applied to each element in stream
transforms, PDE,
No dependencies between stream elements
Encourage high Arithmetic Intensity

6
Vectors vs. Streams

Vectors
v array of floats
Instruction sequence
LD v0
LD v1
ADD v0, v1, v2
ST v2
? Large set of temps

Streams
s stream of records
Instruction sequence
LD s0
LD s1
CALLS f, s0, s1, s2
ST s2
? Small set of temps

Higher arithmetic intensity f/s ?? /v
7
Imagine

Imagine
Stream processor for image and signal processing
16mm die in 0.18um TI process
21M transistors

8
Merrimac Processor

90nm tech (1 V)
ASIC technology
1 GHz (37 FO4)
128 GOPs
Inter-cluster switch between clusters
127.5 mm2 (small 12x10)
Stanford Imagine is 16mm x 16mm
MIT Raw is 18mm x 18mm
25 Watts (P4 75 W)
41W with memories

r
r
r
r
e
e
e
e
t
t
t
t
s
s
s
s
u
u
u
u
Mips64
Mips64
l
l
l
l
C
C
C
C
20kc
20kc
r
r
r
r
e
e
e
e
t
t
t
t
s
s
s
s
bank
u
u
u
u
r
l
l
l
l
n
e
n
C
C
C
C
f
e
e
s
f
u
G
G
bank
e
m

B

c
s

s
m
r
s
s
a
e

e
e
f
C
Microcontroller
d
bank
r
r
2
r
r
d
d
.
e
C
o
d
d
h
0
t
e
E
A
A
c
n
R
bank
t
1

i
I
d
r
r
r
r

w
e
e
e
e
r
s
M

t
t
t
t
a
m
s
s
s
s
bank
A
w
r
e
u
u
u
u
n
e
n
R
r
l
l
l
l
f
e
e
M
f
o
C
C
C
C
D
u
bank
G
G
F

B

R
s

s
r
s
s

e
e
e
6
bank
d
r
r
1
d
r
d
o
r
r
r
r
d
d
e
e
e
e
e
A
A
R
t
t
t
t
bank
s
s
s
s
u
u
u
u
l
l
l
l
C
C
C
C
Network
12.5 mm
9
Merrimac Streaming Supercomputer
10
Streaming Applications

Finite volume StreamFLO (from TFLO)
Finite element - StreamFEM
Molecular dynamics code (ODEs) - StreamMD
Model (elliptic, hyperbolic and parabolic) PDEs
PCA Applications FFT, Matrix Mul, SVD, Sort

11
StreamFLO

StreamFLO is the Brook version of FLO82, a
FORTRAN code written by Prof. Jameson, for the
solution of the inviscid flow around an airfoil.
The code uses a cell centered finite volume
formulation with a multigrid acceleration to
solve the 2D Euler equations.
The structure of the code is similar to TFLO and
the algorithm is found in many compressible flow
solvers.

12
StreamFEM

A Brook implementation of the Discontinuous
Galerkin (DG) Finite Element
Method (FEM) in 2D triangulated domains.

13
StreamMD motivation

Application study the folding of human proteins.
Molecular Dynamics computer simulation of the
dynamics of macro molecules.
Why this application?
Expect high arithmetic intensity.
Requires variable length neighborlists.
Molecular Dynamics can be used in engine
simulation to model spray, e.g. droplet formation
and breakup, drag, deformation of droplet.
Test case chosen for initial evaluation box of
water molecules.

DNA molecule
Human immunodeficiency virus (HIV)
14
Summary of Application Results
1. Simulated on a machine with 64GFLOPS peak
performance 2. The low numbers are a result of
many divide and square-root operations
15
Streaming on graphics hardware?

Pentium 4 SSE theoretical
3GHz 4 wide .5 inst / cycle 6 GFLOPS
GeForce FX 5900 (NV35) fragment shader observed
MULR R0, R0, R0 20 GFLOPS
equivalent to a 10 GHz P4
and getting faster 3x improvement over NV30 (6
months)

GeForce FX
NV35
NV30
Pentium 4
from Intel P4 Optimization Manual
16
GPU Program Architecture
Input Registers
Texture
Program
Constants
Registers
Output Registers
17
Example Program
Simple Specular and Diffuse Lighting !!VP1.0
c0-3 modelview projection (composite)
matrix c4-7 modelview inverse transpose
c32 eye-space light direction c33
constant eye-space half-angle vector (infinite
viewer) c35.x pre-multiplied monochromatic
diffuse light color diffuse mat. c35.y
pre-multiplied monochromatic ambient light color
diffuse mat. c36 specular color
c38.x specular power outputs homogenous
position and color DP4 oHPOS.x, c0,
vOPOS Compute position. DP4
oHPOS.y, c1, vOPOS DP4 oHPOS.z, c2,
vOPOS DP4 oHPOS.w, c3, vOPOS DP3
R0.x, c4, vNRML Compute
normal. DP3 R0.y, c5, vNRML DP3 R0.z,
c6, vNRML R0 N' transformed
normal DP3 R1.x, c32, R0
R1.x Ldir DOT N' DP3 R1.y, c33, R0
R1.y H DOT N' MOV R1.w, c38.x
R1.w specular power LIT R2, R1
Compute lighting
values MAD R3, c35.x, R2.y, c35.y
diffuse ambient MAD oCOL0.xyz, c36, R2.z,
R3 specular END
18
Cg/HLSL High level language for GPUs

Specular Lighting
// Lookup the normal map
float4 normal 2 (tex2D(normalMap,
I.texCoord0.xy) - 0.5)
// Multiply 3 X 2 matrix generated using
lightDir and halfAngle with
// scaled normal followed by lookup in
intensity map with the result.
float2 intensCoord float2(dot(I.lightDir.xyz
, normal.xyz),
dot(I.halfAngle.xyz, normal.xyz))
float4 intensity tex2D(intensityMap,
intensCoord)
// Lookup color
float4 color tex2D(colorMap,
I.texCoord3.xy)
// Blend/Modulate intensity with color
return color intensity

19
GPU Data Parallel

Each fragment shaded independently
No dependencies between fragments
Temporary registers are zeroed
No static variables
No Read-Modify-Write textures
Multiple pixel pipes
Data Parallelism
Support ALU heavy architectures
Hide Memory Latency
Torborg and Kajiya 96, Anderson et al. 97,
Igehy et al. 98

20
GPU Arithmetic Intensity

Lots of ops per word transferred
Graphics pipeline
Vertex
BW 1 triangle 32 bytes
OP 100-500 f32-ops / triangle
Rasterization
Create 16-32 fragments per triangle
Fragment
BW 1 fragment 10 bytes
OP 300-1000 i8-ops/fragment
Shader Programs

Courtesy of Pat Hanrahan
21
Streaming Architectures
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
22
Streaming Architectures
Kernel Execution Unit
MAD R3, R1, R2 MAD R5, R2, R3
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
23
Streaming Architectures
Kernel Execution Unit
MAD R3, R1, R2 MAD R5, R2, R3
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Stream Register File
SDRAM
SDRAM
ALU Cluster
Parallel Fragment Pipelines
24
Streaming Architectures
Kernel Execution Unit
MAD R3, R1, R2 MAD R5, R2, R3