Title: GPU Acceleration of Scientific Applications Using CUDA
1GPU Acceleration of Scientific Applications Using
CUDA
- John E. StoneTheoretical and Computational
Biophysics GroupNIH Resource for Macromolecular
Modeling and BioinformaticsBeckman Institute for
Advanced Science and Technology, - University of Illinois at Urbana-Champaign
2What Speedups Can GPUs Achieve?
- Speedups of 8x to 30x are quite common
- Best speedups (100x!) are attained on codes that
are skewed towards floating point arithmetic,
esp. CPU-unfriendly operations that prevent use
of SSE, vectorization - Amdahls Law can prevent legacy codes from
achieving peak speedups with only shallow GPU
acceleration efforts
3Some GPU Speedup Examples(vs. SSE-vectorized CPU
code)
- Fluorescence microscopy simulation 12x
- Molecular dynamics
- Non-bonded force calc (no pairlist) 8x
- Non-bonded force calc (pairlist) 16x
- Electrostatics, ion placement
- Direct Coulomb summation 40-120x
- Multilevel Coulomb summation (short range lattice
cutoff) 7x
4How Difficult is CUDA Programming?
- Parallel algorithms are the hard part, not the
programming API - CUDA is as easy to learn as any other parallel
programming API Ive used - Easy to mix with other parallel programming APIs
(e.g. POSIX threads, MPI, etc..) - CUDAs fine-grained parallelism nicely
complements the comparatively coarse-grained
parallelism available in other APIs - GPU hardware constraints present their own
challenges
5CUDA Class at Illinois
- ECE498 Programming Massively Parallel
Processors - Wen-mei Hwu (ECE Professor, UIUC)
- David Kirk (Chief Scientist, NVIDIA)
- Several guest lecturers
- Attended by both students and interested
researchers on campus - Class projects are supported by research groups
on campus - MRI Processing
- Biomolecular Simulations
- Scientific Visualization
- Many more.
- Class home page, lectures, MP3 audio
- http//courses.ece.uiuc.edu/ece498/al1/
6An Approach to Writing CUDA Kernels
- Use algorithms that can expose substantial
parallelism, youll need thousands of threads - Identify ideal GPU memory system to use for
kernel data for best performance - Minimize host/GPU DMA transfers, use pinned
memory buffers when appropriate - Optimal kernels involve many trade-offs, easier
to explore through experimentation with
microbenchmarks based key components of the real
science code, without the baggage - Analyze the real-world use cases and select the
kernel(s) that best match, by size, parameters,
etc
7Be Open-Minded
- Experienced programmers have a hard time getting
used to the idea that GPUs can actually do
arithmetic 100x faster than CPUs - CPU-centric programming idioms are often frugal
with arithmetic ops but cavalier with memory
references/locality/register spills, GPU hardware
prefers almost the opposite approach - Pretend like youve never written optimized code
before and to learn the GPU on its own terms,
dont force it to run CPU-centric code
8Potentially Beneficial Trade-offs
- Additional arithmetic for reduced memory
references, lower register count - Example CPU codes often precalculate values to
reduce arithmetic. On the GPU arithmetic is
cheaper than memory accesses or register use - Additional arithmetic/memory to avoid branching,
and especially branch divergence - Example pad input data to full block sizes
rather than handling boundaries specially - Additional arithmetic for more parallelism
- Example decrease computational tile size by
forgoing loop optimizations that reduce redundant
arithmetic yields better performance on very
small datasets
9Fluorescence Microscopy
- 2-D reaction-diffusion simulation used to predict
results of fluorescence microphotolysis
experiments - Simulate 1-10 second microscopy experiments,
0.1ms integration timesteps - Goal lt 1 min per simulation on commodity PC
hardware - Project home page
- http//www.ks.uiuc.edu/Research/microscope/
10Fluorescence Microscopy (2)
- Challenges for CPU
- Efficient handling of boundary conditions
- Large number of floating point operations per
timestep - Challenges for GPU/CUDA
- Hiding global memory latency, improving memory
access patterns, controlling register use - Few arithmetic operations per memory reference
(for a GPU)
11Fluorescence Microscopy (3)
- Simulation runtime, software development time
- Original research code (CPU) 80 min
- Optimized algorithm (CPU) 27 min
- 40 hours of work
- SSE-vectorized (CPU) 8 min
- 20 hours of work
- CUDA w/ 8800GTX 38 sec, 12 times faster than
SSE! - 12 hours of work, should be possible to improve
further, but it is already fast enough for real
use - CUDA code was more similar to the original than
to the SSE vectorized version arithmetic is
almost free on the GPU
12Biomolecular Simulation Process
- Prepare model
- Assemble structure
- Add ions
- Add solvent
- Perform molecular dynamics simulation
- Analyze simulation trajectories
- GPUs can accelerate many of the steps in this
process
Satellite Tobacco Mosaic Virus
13Molecular Dynamics Initial NAMD GPU Performance
- Full NAMD, not test harness, Amdahls Law
applies - Useful performance boost
- 8x speedup for nonbonded
- 5x speedup overall w/o PME
- 3.5x speedup overall w/ PME
- Plans for better performance
- Overlap GPU and CPU work.
- Tune or port remaining work.
- PME, bonded, integration, etc.
ApoA1 Performance
faster
2.67 GHz Core 2 Quad Extreme GeForce 8800 GTX
14Overview of Ion Placement Process
- Calculate initial electrostatic potential map
around the simulated structure considering the
contributions of all atoms (most costly step!) - Ions are then placed one at a time
- Find the voxel containing the minimum potential
value - Add a new ion atom at location of minimum
potential - Add the potential contribution of the newly
placed ion to the entire map - Repeat until the required number of ions have
been added
15GPU Accelerated Ion PlacementElectrostatic
Potential Calculations
- Direct Coulomb Summation (DCS)
- Brute force arithmetic, no approximations, O(MN)
- GPU 40-120x faster than CPU-SSE
- Outperforms MCS for small to medium sized
structures, and ion placement map updates - Template for inner loop of other grid-evaluated
kernels (e.g. MCS) - Multilevel Coulomb Summation (MCS)
- Efficient hierarchical approximation, O(MN)
- GPU short-range lattice cutoff part 7x faster
than CPU-SSE - Supports periodic boundary conditions
16Runtime of Coulomb Summation Algorithms on CPU
and GPU
17Direct Coulomb Summation Algorithm
- At each lattice point, sum potential
contributions for all atoms in the simulated
structure - potential chargei / (distance to atomi)
Distance to Atomi
Lattice point being evaluated
Atomi
18Comparison of Direct Coulomb Summation Kernels on
CPU and GPU
19DCS CUDA Block/Grid Decomposition (non-unrolled)
Grid of thread blocks
Thread blocks 64-256 threads
Threads compute 1 potential each
Padding waste
20DCS CUDA Algorithm Unrolling Loops
- Add each atoms contribution to several lattice
points at a time, where distances only differ in
one component - potentialA chargei / (distanceA to atomi)
- potentialB chargei / (distanceB to atomi)
Distances to Atomi
Atomi
21DCS Loop Unrolling (CUDA-Unroll4x), Multiple
Lattice Points Per Iteration
-
- for (atomid0 atomidltnumatoms atomid)
- float dy coory - atominfoatomid.y
- float dysqpdzsq (dy dy)
atominfoatomid.z - float dx1 coorx1 - atominfoatomid.x
- float dx2 coorx2 - atominfoatomid.x
- float dx3 coorx3 - atominfoatomid.x
- float dx4 coorx4 - atominfoatomid.x
- energyvalx1 atominfoatomid.w (1.0f /
sqrtf(dx1dx1 dysqpdzsq)) - energyvalx2 atominfoatomid.w (1.0f /
sqrtf(dx2dx2 dysqpdzsq)) - energyvalx3 atominfoatomid.w (1.0f /
sqrtf(dx3dx3 dysqpdzsq)) - energyvalx4 atominfoatomid.w (1.0f /
sqrtf(dx4dx4 dysqpdzsq)) -
22DCS CUDA Block/Grid Decomposition
(unrolled, coalesced)
Unrolling increases computational tile size
Grid of thread blocks
Thread blocks 64-256 threads
0,0
0,1
1,0
1,1
Threads compute up to 8 potentials, skipping by
half-warps
Padding waste
23Questions?
8 min exposure, central Illinois
24References and Acknowledgements
- Additional Information and References
- http//www.ks.uiuc.edu/Research/gpu/
- http//www.ks.uiuc.edu/Research/vmd/
- Questions, source code requests
- John Stone johns_at_ks.uiuc.edu
- Acknowledgements
- J. Phillips, P. Freddolino, D. Hardy, L. Trabuco,
K. Schulten (UIUC TCB Group) - Prof. Wen-mei Hwu (UIUC)
- David Kirk and the CUDA team at NVIDIA
- NIH support P41-RR05969