Title: Brook for GPUs
1Brook for GPUs
- Ian Buck, Tim Foley, Daniel Horn, Jeremy
Sugerman, Kayvon Fatahalian, Mike Houston, Pat
Hanrahan - Stanford University
- DARPA Site Visit, UNC
- May 6th, 2004
2Motivation
- GPUs are faster than CPUs
- GPUs are getting faster, faster
- Why?
- Massive parallelism (1000s of ALUs)
- Choreographed communication
- Efficiently utilize VLSI resources DIS/PCA
mantra - Programmable GPUs stream processors
- Many streaming applications beyond graphics
- Buy desktop supercomputer for 50!
- Revolutionize computing?
3Recent Performance Trends
4(No Transcript)
5CPU vs GPU
- Intel 3 Ghz Pentium 4
- 12 GFLOPS peak performance (via SSE2)
- 5.96 GB/sec peak memory bandwidth
- 44 GB/sec peak bandwidth from 8K L1 data cache
- NVIDIA GeForce 6800
- 45 GFLOPS peak performance
- 36 GB/sec peak memory bandwidth
- Texture cache bandwidth and size (undisclosed)?
6Deliverables
- Develop version of PCA Brook for GPUs
- Programmer need not know GL
- Versions
- New ATI (420) and NVIDIA (NV40) hardware
- Linux and Windows
- DX and OpenGL
- Release as open source V1.0 Dec 2003
- Support OneSAF LOS, collision detection and route
planning algorithms
7Research Issues
- Brook semantics
- E.g. variable length streams vout
-
- Compilation techniques
- Virtualization of GPU
- Splitting kernels (MRDS)
- Explore streaming application space
- Scientific computing RT, MD, BLAS, FFT,
- Machine learning HMM, linear mod., Bayes,
8Brook Update
9(No Transcript)
10Understanding the Efficiency of GPU Algorithms
for Matrix-Matrix Multiplication
- Kayvon Fatahalian, Jeremy Sugerman, Pat Hanrahan
11Dense Matrix-Matrix Multiplication
- Atlas on the Intel P4 wins!
12CPU vs GPU
- Intel 3 Ghz Pentium 4
- 12 GFLOPS peak performance (via SSE2)
- 5.96 GB/sec peak memory bandwidth
- 44 GB/sec peak bandwidth from 8K L1 data cache
- NVIDIA GeForce 6800
- 43 GFLOPS peak performance
- 36 GB/sec peak memory bandwidth
- Texture cache bandwidth and size (undisclosed)?
- Why is graphics hardware so slow?
13(No Transcript)
14Why is Graphics Hardware so Slow?
Microbenchmark (MAD)
GFLOPS Cache BW Seq Read BW
NV35 39.99 11.08 4.40
NV40 43.00 18.9 3.85
ATI 9800XT 26.14 12.20 7.33
ATI X800 33.4 30.7 18.4
- NVIDIA 8 compute efficiency, 82 of cache
bandwidth. - Arithmetic intensity 12 math operations per
float fetched from cache - ATI 18 of peak performance, 99 of peak cache
bandwidth. - Arithimetic intensity 8 to 1 math to cache-fetch
ratio
15Why is Graphics Hardware so Slow?
Matrix-Matrix Multiplication
GFLOPS Bandwidth
NV35 3.04 9.07
NV40 7.24 14.88
ATI 9800XT 4.83 12.06
ATI X800 12 30
P4 7.78 27.68
- Matrix-matrix multiplication is bandwidth limited
on GPU. - Memory blocking to increase cache utilization
does not help - Architectural problem, not programming model
problem - PCA stream processing architectures (Imagine)
will do much better!
16Variable Output Shaders
Daniel Horn, Ian Buck, Pat Hanrahan
17Motivation Enabling Algorithms
- Not all algorithms map to the 1-in 1-out
semantics of GPUs - Other classes of algorithms require data
filtering (1-in 0-out) and amplification (1-in
n-out). - Vout is conditional write on Imagine
18Algorithms
- Ray Tracing terrains
- Marching Cubes
- Adaptive Subdivision Surfaces
- Collision Detection OBB
- Graph traversal
19Implementation on GPU
- Push output (sentinel if no push)
- Options to consolidate sentinels
- Sort O(n (log n)2)
- Sort sentinels to the end, truncate
- Scan/Search O(n log n)
- Perform a running sum, then search for gather loc
- Scan/Scatter O(n log n)
- Perform a running sum, scatter to destination
- Constant time hardware implementation
20Timing and Bandwidth Numbers
21Future Work
- Brook semantics, compiling, virtualization
- Support new GPU features (branching, FB ops, )
- Predication
- Integration with graphics pipeline
- Documented path to texture for rendering
- Access to other GPU features e.g. occlusion
culling - Interactive simulation new algorithms
- Collision detection and line of sight
calculations - Merge ray tracer with UNC/SAIC algorithm
- Machine learning HMM, GLM, K-means, ...
- Protein folding (StreamMD) and docking
- Virtual surgery
22Distributed Brook
- Stream- and thread-level parallelism
- UPC distributed memory semantics
- PCI-express system for fast readback
23GPU Cluster DOE
- 16 node cluster
- Each node 3U half depth
- 32 2.4GHz P4 Xeons
- 16GB DDR
- 1.2TB disk
- Infiniband 4X interconnect
- Dual 2.4GHz P4 Xeons
- Intel E7505 chipset
- 1GB DDR
- ATI Radeon 9800 Pro 256MB
- GigE
- 80 GB IDE
24Questions?
Fly-fishing fly images from The English Fly
Fishing Shop