Title: GPU Cluster for High Performance Computing
1GPU Cluster for High Performance Computing
- Zhe Fan, Feng Qiu, Arie Kaufman,
- Suzanne Yoakum-Stover
- Stony Brook University
2Outline
- Background
- GPU graphics processing unit
- GPGPU general-purpose computation using GPU
- The Computational GPU Cluster
- The Lattice Boltzmann Computation
- Performance Evaluation
- Application for Urban Dispersion Modeling
- Conclusion Future Work
3Background Whats GPU?
- Graphics processing units
- nVIDIA NV40 ATI R420
- In COTS 3D graphics cards
- GeForce 6800 Ultra Radeon X800 XT
- Modern GPU 600M vertices, 6G pixels / second
4Background GPU Growth Rate
- Driven by booming market for games
- Rendering task is embarrassingly parallel. Can
Efficiently use large number of computational
units
5Background Graphics Pipeline
Vertex Processing
Transform 3D coords to 2D coords
Rasterization
Texture Memory
Framgment Processing
Combine colors to final image
Composite
6Background GPU Compute Power
- High compute parallelism
- 6 16 pipelines
- 4D vector fp32 instructions
- More than 100 FLOPs/cycle !
- Fast on-board memory
- Bandwidth 35.2 GB/sec
- Size 256 - 512 MB
- Low price
6 Vertex Pipelines
Vertex Processing
Rasterization
256-Bit GDDR3
Texture Memory
16 Fragment Pipelines
Framgment Processing
Composite
7Background GPGPU
- Recent development of GPUs
- Programmability
- High-level language, Cg
- 32-bit floating point
- General-purpose computation using GPU
-
- GPU accelerates certain computations 10 times
- Data parallelism
- Computational intensive
- Relatively simple data structures and control flow
8Background GPGPU Examples
- Scientific computation
- Physically-based simulations
- Linear algebra, conjugate gradient solver
- Level set
- Fast Fourier transform
- Image and volume processing
- Computational geometry
- Database
- Flow visualization
- Global illumination
- GPGPU language
- Many papers. See http//www.gpgpu.org
9Motivations
- Scale-up to GPU cluster for large-scale problems
- Explore the use for high-performance computing
- Outperform COTS CPU clusters for certain
computations - Motivated by
- PlayStation2 computational cluster at NCSA
- Graphics PC clusters
- Humphrey et al., 02, Govindaraju et al., 03,
etc
10Stony Brook Visual Computing Cluster
- Gigabit Ethernet
- 32 HP PCs
- 64 Pentium Xeon 2.4GHz
- 32 2GB DDR Memory
- 32 120GB Hard Disk
- 32 GeForce FX 5800 Ultra
- 32 VolumePro 1000
- (volume rendering)
- 9 HP Sepia-2A cards (composite)
- ServerNet (compositing network)
- Gomputational
- GPU cluster
- Visualization
- cluster
11Computational GPU Cluster
- Gigabit Ethernet
- 32 HP PCs
- 64 Pentium Xeon 2.4GHz
- 32 2GB DDR Memory
- 32 120GB Hard Disk
- 32 GeForce FX 5800 Ultra
1,621 x 32 349 x 32 607 x 32 150 x
32
399 x 32
16 GFLOPS x 32 (Fragment
pipelines capability)
Plugging 32 GPUs, theoretical peak has been
increased by 512 GFLOPS at a price of only
12,768. 1 GFLOPS for 25
12GPU Cluster Architecture
- Currently, read from GPU is slow
13Lattice Boltzmann Model (LBM)
- CFD method on the lattice
- Numerical calculations
- Stream
- Collision
- Yields Navier-Stokes for incompressible fluid
- Greatly flexible in specifying complex boundaries
A cell of the D3Q19 lattice
14LBM on a Single GPU
- Li et al., Visual Computer 03
- Program fragment processing stage
15Store LBM Data in Textures
16Scale-up LBM
- Domain decomposing
- Communication
- Read out from GPU
- Network transfer
- Write into GPUs
17GPU lt-gt PC data transfer
- Read data from GPU
- Aggregate necessary boundary data together into a
texture - Read them out in a single operation
- Write data into GPU
- Reverse the above procedure
18Network Transfer
- MPI
- To minimize communication cost
- Overlap network transfer time with computation
time - Simplify communication pattern
19To Compare with CPU Cluster Code
- Backgrounds
- For our CPU cluster, each node use 1 CPU to
compute - The CPU cluster code hasnt used SSE
- (SSE is expected to be about 3 times faster)
CPU cluster code
GPU cluster code
Communication
Read from GPU overhead N / A
17 Network transfer time
Fully overlapped Mostly overlapped
with computation with computation
20Performance Comparison
Scale-up test (Each node computes 803 sub-domain)
21Performance Comparison (cont.)
22For Further Improvement
- Use newer generation GPUs
- Already 2.2 times faster
- Use PCI-Express bus
- Much faster GPU lt-gt PC communication
- 4 GB / sec either way
- Multiple GPUs on each PC to be feasible soon
- Code can be optimized
- Faster network
23Urban Application
- Use LBM to simulate airborne contaminant
dispersion in complex urban environments - Simulation and visualization on a single GPU
- Qiu et al., Visualization 2004
24Simulation Area for GPU Cluster
Times Square Area of New York City
- 1.66 km x 1.13 km
- 91 blocks
- 850 buildings
25Air Flow, Times Square Area, NYC
- 480 x 400 x 80
- On 30 GPUs
- 0.31 sec/step
26Dispersion Plume, Volume Rendered
27Future Work
- Other computations
- E.g., PDE, FEM
- GPUs as co-processors of CPUs
- carr et al., 2002, etc
- Online-visualization computational steering
- Potential advantage of GPU cluster
28Conclusions
- Cluster of commodity GPUs for high-performance
computing - LBM to simulate airborne dispersion in the urban
environment - Compared with the same model on CPU cluster, GPU
cluster is much faster, and better performance is
anticipated - GPU cluster is very promising for scientific
computation
29Acknowledgements
- NSF CCR0306438
- Department of Homeland Security _at_ Environment
Measurement Lab - Hewlett Packard
- TeraRecon
- Reviewers
- Bin Zhang, Wei Li, Ye Zhao, Xiaoming Wei, Klaus
Mueller, Brent Lindquist
30Thank You !