Brook for GPUs

About This Presentation

Title:

Brook for GPUs

Description:

New ATI (420) and NVIDIA (NV40) hardware. Linux and Windows. DX and OpenGL ... ATI: 18% of peak performance, 99% of peak cache bandwidth. ... – PowerPoint PPT presentation

Number of Views:125

Avg rating:3.0/5.0

Slides: 25

Provided by: IanB86

Learn more at: http://gamma.cs.unc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Brook for GPUs

1
Brook for GPUs

Ian Buck, Tim Foley, Daniel Horn, Jeremy
Sugerman, Kayvon Fatahalian, Mike Houston, Pat
Hanrahan
Stanford University
DARPA Site Visit, UNC
May 6th, 2004

2
Motivation

GPUs are faster than CPUs
GPUs are getting faster, faster
Why?
Massive parallelism (1000s of ALUs)
Choreographed communication
Efficiently utilize VLSI resources DIS/PCA
mantra
Programmable GPUs stream processors
Many streaming applications beyond graphics
Buy desktop supercomputer for 50!
Revolutionize computing?

3
Recent Performance Trends
4
(No Transcript)
5
CPU vs GPU

Intel 3 Ghz Pentium 4
12 GFLOPS peak performance (via SSE2)
5.96 GB/sec peak memory bandwidth
44 GB/sec peak bandwidth from 8K L1 data cache
NVIDIA GeForce 6800
45 GFLOPS peak performance
36 GB/sec peak memory bandwidth
Texture cache bandwidth and size (undisclosed)?

6
Deliverables

Develop version of PCA Brook for GPUs
Programmer need not know GL
Versions
New ATI (420) and NVIDIA (NV40) hardware
Linux and Windows
DX and OpenGL
Release as open source V1.0 Dec 2003
Support OneSAF LOS, collision detection and route
planning algorithms

7
Research Issues

Brook semantics
E.g. variable length streams vout
Compilation techniques
Virtualization of GPU
Splitting kernels (MRDS)
Explore streaming application space
Scientific computing RT, MD, BLAS, FFT,
Machine learning HMM, linear mod., Bayes,

8
Brook Update

Ian Buck

9
(No Transcript)
10
Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix Multiplication

Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan

11
Dense Matrix-Matrix Multiplication

Atlas on the Intel P4 wins!

12
CPU vs GPU

Intel 3 Ghz Pentium 4
12 GFLOPS peak performance (via SSE2)
5.96 GB/sec peak memory bandwidth
44 GB/sec peak bandwidth from 8K L1 data cache
NVIDIA GeForce 6800
43 GFLOPS peak performance
36 GB/sec peak memory bandwidth
Texture cache bandwidth and size (undisclosed)?
Why is graphics hardware so slow?

13
(No Transcript)
14
Why is Graphics Hardware so Slow?
Microbenchmark (MAD)
GFLOPS Cache BW Seq Read BW
NV35 39.99 11.08 4.40
NV40 43.00 18.9 3.85
ATI 9800XT 26.14 12.20 7.33
ATI X800 33.4 30.7 18.4

NVIDIA 8 compute efficiency, 82 of cache
bandwidth.
Arithmetic intensity 12 math operations per
float fetched from cache
ATI 18 of peak performance, 99 of peak cache
bandwidth.
Arithimetic intensity 8 to 1 math to cache-fetch
ratio

15
Why is Graphics Hardware so Slow?
Matrix-Matrix Multiplication
GFLOPS Bandwidth
NV35 3.04 9.07
NV40 7.24 14.88
ATI 9800XT 4.83 12.06
ATI X800 12 30
P4 7.78 27.68

Matrix-matrix multiplication is bandwidth limited
on GPU.
Memory blocking to increase cache utilization
does not help
Architectural problem, not programming model
problem
PCA stream processing architectures (Imagine)
will do much better!

16
Variable Output Shaders
Daniel Horn, Ian Buck, Pat Hanrahan
17
Motivation Enabling Algorithms

Not all algorithms map to the 1-in 1-out
semantics of GPUs
Other classes of algorithms require data
filtering (1-in 0-out) and amplification (1-in
n-out).
Vout is conditional write on Imagine

18
Algorithms

Ray Tracing terrains
Marching Cubes
Adaptive Subdivision Surfaces
Collision Detection OBB
Graph traversal

19
Implementation on GPU

Push output (sentinel if no push)
Options to consolidate sentinels
Sort O(n (log n)2)
Sort sentinels to the end, truncate
Scan/Search O(n log n)
Perform a running sum, then search for gather loc
Scan/Scatter O(n log n)
Perform a running sum, scatter to destination
Constant time hardware implementation

20
Timing and Bandwidth Numbers
21
Future Work

Brook semantics, compiling, virtualization
Support new GPU features (branching, FB ops, )
Predication
Integration with graphics pipeline
Documented path to texture for rendering
Access to other GPU features e.g. occlusion
culling
Interactive simulation new algorithms
Collision detection and line of sight
calculations
Merge ray tracer with UNC/SAIC algorithm
Machine learning HMM, GLM, K-means, ...
Protein folding (StreamMD) and docking
Virtual surgery

22
Distributed Brook

Stream- and thread-level parallelism
UPC distributed memory semantics
PCI-express system for fast readback

23
GPU Cluster DOE

16 node cluster
Each node 3U half depth
32 2.4GHz P4 Xeons
16GB DDR
1.2TB disk
Infiniband 4X interconnect
Dual 2.4GHz P4 Xeons
Intel E7505 chipset
1GB DDR
ATI Radeon 9800 Pro 256MB
GigE
80 GB IDE

24
Questions?
Fly-fishing fly images from The English Fly
Fishing Shop

Write a Comment

User Comments (0)

About PowerShow.com

Brook for GPUs - PowerPoint PPT Presentation

Brook for GPUs

New ATI (420) and NVIDIA (NV40) hardware. Linux and Windows. DX and OpenGL ... ATI: 18% of peak performance, 99% of peak cache bandwidth. ... – PowerPoint PPT presentation