GPU Acceleration of Scientific Applications Using CUDA - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

GPU Acceleration of Scientific Applications Using CUDA

Description:

Theoretical and Computational ... Beckman Institute for Advanced Science and Technology, ... based key components of the real science code, without the baggage ... – PowerPoint PPT presentation

Number of Views:37

Avg rating:3.0/5.0

Slides: 25

Provided by: astr87

Category:

more less

Transcript and Presenter's Notes

Title: GPU Acceleration of Scientific Applications Using CUDA

1
GPU Acceleration of Scientific Applications Using
CUDA

John E. StoneTheoretical and Computational
Biophysics GroupNIH Resource for Macromolecular
Modeling and BioinformaticsBeckman Institute for
Advanced Science and Technology,
University of Illinois at Urbana-Champaign

2
What Speedups Can GPUs Achieve?

Speedups of 8x to 30x are quite common
Best speedups (100x!) are attained on codes that
are skewed towards floating point arithmetic,
esp. CPU-unfriendly operations that prevent use
of SSE, vectorization
Amdahls Law can prevent legacy codes from
achieving peak speedups with only shallow GPU
acceleration efforts

3
Some GPU Speedup Examples(vs. SSE-vectorized CPU
code)

Fluorescence microscopy simulation 12x
Molecular dynamics
Non-bonded force calc (no pairlist) 8x
Non-bonded force calc (pairlist) 16x
Electrostatics, ion placement
Direct Coulomb summation 40-120x
Multilevel Coulomb summation (short range lattice
cutoff) 7x

4
How Difficult is CUDA Programming?

Parallel algorithms are the hard part, not the
programming API
CUDA is as easy to learn as any other parallel
programming API Ive used
Easy to mix with other parallel programming APIs
(e.g. POSIX threads, MPI, etc..)
CUDAs fine-grained parallelism nicely
complements the comparatively coarse-grained
parallelism available in other APIs
GPU hardware constraints present their own
challenges

5
CUDA Class at Illinois

ECE498 Programming Massively Parallel
Processors
Wen-mei Hwu (ECE Professor, UIUC)
David Kirk (Chief Scientist, NVIDIA)
Several guest lecturers
Attended by both students and interested
researchers on campus
Class projects are supported by research groups
on campus
MRI Processing
Biomolecular Simulations
Scientific Visualization
Many more.
Class home page, lectures, MP3 audio
http//courses.ece.uiuc.edu/ece498/al1/

6
An Approach to Writing CUDA Kernels

Use algorithms that can expose substantial
parallelism, youll need thousands of threads
Identify ideal GPU memory system to use for
kernel data for best performance
Minimize host/GPU DMA transfers, use pinned
memory buffers when appropriate
Optimal kernels involve many trade-offs, easier
to explore through experimentation with
microbenchmarks based key components of the real
science code, without the baggage
Analyze the real-world use cases and select the
kernel(s) that best match, by size, parameters,
etc

7
Be Open-Minded

Experienced programmers have a hard time getting
used to the idea that GPUs can actually do
arithmetic 100x faster than CPUs
CPU-centric programming idioms are often frugal
with arithmetic ops but cavalier with memory
references/locality/register spills, GPU hardware
prefers almost the opposite approach
Pretend like youve never written optimized code
before and to learn the GPU on its own terms,
dont force it to run CPU-centric code

8
Potentially Beneficial Trade-offs

Additional arithmetic for reduced memory
references, lower register count
Example CPU codes often precalculate values to
reduce arithmetic. On the GPU arithmetic is
cheaper than memory accesses or register use
Additional arithmetic/memory to avoid branching,
and especially branch divergence
Example pad input data to full block sizes
rather than handling boundaries specially
Additional arithmetic for more parallelism
Example decrease computational tile size by
forgoing loop optimizations that reduce redundant
arithmetic yields better performance on very
small datasets

9
Fluorescence Microscopy

2-D reaction-diffusion simulation used to predict
results of fluorescence microphotolysis
experiments
Simulate 1-10 second microscopy experiments,
0.1ms integration timesteps
Goal lt 1 min per simulation on commodity PC
hardware
Project home page
http//www.ks.uiuc.edu/Research/microscope/

10
Fluorescence Microscopy (2)

Challenges for CPU
Efficient handling of boundary conditions
Large number of floating point operations per
timestep
Challenges for GPU/CUDA
Hiding global memory latency, improving memory
access patterns, controlling register use
Few arithmetic operations per memory reference
(for a GPU)

11
Fluorescence Microscopy (3)

Simulation runtime, software development time
Original research code (CPU) 80 min
Optimized algorithm (CPU) 27 min
40 hours of work
SSE-vectorized (CPU) 8 min
20 hours of work
CUDA w/ 8800GTX 38 sec, 12 times faster than
SSE!
12 hours of work, should be possible to improve
further, but it is already fast enough for real
use
CUDA code was more similar to the original than
to the SSE vectorized version arithmetic is
almost free on the GPU

12
Biomolecular Simulation Process

Prepare model
Assemble structure
Add ions
Add solvent
Perform molecular dynamics simulation
Analyze simulation trajectories
GPUs can accelerate many of the steps in this
process

Satellite Tobacco Mosaic Virus
13
Molecular Dynamics Initial NAMD GPU Performance

Full NAMD, not test harness, Amdahls Law
applies
Useful performance boost
8x speedup for nonbonded
5x speedup overall w/o PME
3.5x speedup overall w/ PME
Plans for better performance
Overlap GPU and CPU work.
Tune or port remaining work.
PME, bonded, integration, etc.

ApoA1 Performance
faster
2.67 GHz Core 2 Quad Extreme GeForce 8800 GTX
14
Overview of Ion Placement Process

Calculate initial electrostatic potential map
around the simulated structure considering the
contributions of all atoms (most costly step!)
Ions are then placed one at a time
Find the voxel containing the minimum potential
value
Add a new ion atom at location of minimum
potential
Add the potential contribution of the newly
placed ion to the entire map
Repeat until the required number of ions have
been added

15
GPU Accelerated Ion PlacementElectrostatic
Potential Calculations

Direct Coulomb Summation (DCS)
Brute force arithmetic, no approximations, O(MN)
GPU 40-120x faster than CPU-SSE
Outperforms MCS for small to medium sized
structures, and ion placement map updates
Template for inner loop of other grid-evaluated
kernels (e.g. MCS)
Multilevel Coulomb Summation (MCS)
Efficient hierarchical approximation, O(MN)
GPU short-range lattice cutoff part 7x faster
than CPU-SSE
Supports periodic boundary conditions

16
Runtime of Coulomb Summation Algorithms on CPU
and GPU
17
Direct Coulomb Summation Algorithm

At each lattice point, sum potential
contributions for all atoms in the simulated
structure
potential chargei / (distance to atomi)

Distance to Atomi
Lattice point being evaluated
Atomi
18
Comparison of Direct Coulomb Summation Kernels on
CPU and GPU
19
DCS CUDA Block/Grid Decomposition (non-unrolled)
Grid of thread blocks
Thread blocks 64-256 threads
Threads compute 1 potential each
Padding waste
20
DCS CUDA Algorithm Unrolling Loops

Add each atoms contribution to several lattice
points at a time, where distances only differ in
one component
potentialA chargei / (distanceA to atomi)
potentialB chargei / (distanceB to atomi)

Distances to Atomi
Atomi
21
DCS Loop Unrolling (CUDA-Unroll4x), Multiple
Lattice Points Per Iteration

for (atomid0 atomidltnumatoms atomid)
float dy coory - atominfoatomid.y
float dysqpdzsq (dy dy)
atominfoatomid.z
float dx1 coorx1 - atominfoatomid.x
float dx2 coorx2 - atominfoatomid.x
float dx3 coorx3 - atominfoatomid.x
float dx4 coorx4 - atominfoatomid.x
energyvalx1 atominfoatomid.w (1.0f /
sqrtf(dx1dx1 dysqpdzsq))
energyvalx2 atominfoatomid.w (1.0f /
sqrtf(dx2dx2 dysqpdzsq))
energyvalx3 atominfoatomid.w (1.0f /
sqrtf(dx3dx3 dysqpdzsq))
energyvalx4 atominfoatomid.w (1.0f /
sqrtf(dx4dx4 dysqpdzsq))

22
DCS CUDA Block/Grid Decomposition
(unrolled, coalesced)
Unrolling increases computational tile size
Grid of thread blocks
Thread blocks 64-256 threads
0,0
0,1

1,0
1,1

Threads compute up to 8 potentials, skipping by
half-warps
Padding waste
23
Questions?
8 min exposure, central Illinois
24
References and Acknowledgements