Title: Massively Parallel Computing and GPU Hardware
1Massively Parallel Computing and GPU Hardware
2Types of Parallel Computing
- Parallel on Device(Supercomputer /
Co-processors)? - Cray built the first supercomputers. Economies
of scale eventually priced out nearly all custom
supercomputing solutions. - Modern supercomputers are most often many
(thousands) of commodity processors in an
extremely highly tuned cluster with special built
custom interconnects. - Famous Supercomputers / Co-processors include-
Cray- GRAPE Board (GRAvity PipE)- IBM Blue
Gene
- Parallel not on device(cluster)?
- Types of Clusters- Beowulf Clusters TCP/IP
networked (usually) identical computers built in
cluster designed specifically for
supercomputing- Grid Computing distributed
computing systems that are more loosely coupled,
heterogeneous in nature and geographically
disperse - Famous Distributed Projects include-
Folding_at_home- SETI_at_home
3Distributed Computing vs Parallel Computing
- OpenMP (Open Multi-Processing)?
- OpenMP is an application programming interface
(API) that supports multi-platform shared memory
multiprocessing programming in C/C and Fortran
on many architectures. - Supercomputers / Co-processors are typically
Shared Memory / Distributed Shared Memory
parallel multiprocessing platforms. - An application built with the hybrid model of
parallel programming can run on a computer
cluster using both OpenMP and MPI.
- MPI(Message Passing Interface)?
- MPI is a specification for an application
programming interface (API) that allows many
computers to communicate with one another. - Clusters are Distributed Memory parallel
multiprocessing platforms.Both types of
parallel computing offer unique programming and
optimization challenges.
4Pros and Cons of clusters and supercomputers
PRO
CON
- 0 to write a proposal for time on a cluster /
supercomputer. - If properly programmed, can do amazing amounts of
PC wall clock work in very short amount of time. - Cluster's are well suited to running the same
operation on many discrete data sets - Supercomputers can be extremely effective at
running massive simulations very quickly.
- Need to write a proposal for cluster /
supercomputer time. Writing a new proposal for
each new set of simulations lacks flexibility for
those that run complex code. - Extremely difficult to get maximum potential out
of clusters with MPI - Supercomputer time may be very difficult to come
by. - Supercomputers / large clusters are massively
expensive to build. - Clusters are not necessarily well suited to
larger data sets which can not be broken apart
and processed separately.
5On-device Parallel computing
CON
PRO
- On device memory limitations will keep
simulations necessarily small as compared to
possibilities with a cluster or supercomputer. -
Can not compete with the scale of clusters and
supercomputers. The sheer amount of computing
power provided by a massive cluster or
supercomputer is unrivaled.
Keeping calculations on the device rather than
transferring data over some type of network saves
massive amounts of compute time. Amount of
programming required to take advantage of an
on-board device is significantly less than
MPI/OpenMP. Much of the architectural nuances
of a parallel device can be hidden in the
compiler. Or, conversely, more optimization can
occur at significantly less cost in terms of
development time.
6Why the GPU and why now?
- There are two primary answers to this question
- 1.) Economies of Scale
- 2.) Unified Shader Model / Stream Processing
Economies of Scale - Where Cray and many other
supercomputing solutions failed is in the realm
of profit. GPU's are graphics cards for computer
video game players and graphics professionals
first for HPC second. Technological advances
have come at a rapid rate due to the ever
increasing graphical demands of modern computer
gaming. Because of the constricted budget of the
average pc gamer as compared to a
professional/scientist seeking an expensive
solution to a HPC problem, graphics cards were
necessarily produced at much higher volume with
lower margins.
7Classical Graphics Pipeline
Each stage of the graphics pipeline traditionally
had dedicated shader processors for each
operation. For example, a traditional pipeline
hypothetical video card could have 8 vertex
shader pipelines24 pixel shader pipelines 24
texture filtering units 8 vertex texture
addressing units 8 rendering output units
8Unified Shader Model
- By unifying the pixel/vertex shaders into a large
programmable shader core, there are no longer
discrete/separate shader core types. Since the
data is no longer flowing down the pipeline in
sequential fashion, bottlenecks were removed for
gaming/video processing. - However, the advent of programmable shaders is
important for the sheer fact that each shader is
no longer forced to perform only graphics
operations. They can be give a kernel of our
choice and can process information as the data is
streamed through the device.
9 Sample Device Architecture
NVIDIA G80 architecture
10NVIDIA / ATI / Intel GPU Solutions
- NVIDIA - GeForce (gaming/movie
playback) Quadro (graphics) Tesla (HPC)?
- ATI - Radeon (gaming/movie playback) FireSt
ream (HPC)? - Intel - Larabee (HPC/gaming)?
11ATI
Pros
Cons
- SDK still in beta - SDK only supports Windows
XP (ATI says linux support will come within
the next calendar year)? - Brook/CTM more general
than CUDA ---gt more code development is
required to achieve similar functionality. -
CUDA considered to be more fully featured and
1 years ahead
Open Source driver CAL (compute abstraction
layer) and CTM (close to metal) allow for a
high degree of optimization. Slightly better
price for non- FireStream GPUs (Radeon)?
To run the CAL/Brook SDK, you need a platform
based on the AMD R600 GPU or later. R600 and
newer GPUs are found with ATI Radeon HD2400,
HD2600, HD2900 and HD3800 graphics board.
12Intel
Tera-Scale
Larabee
1st Teraflops Research Chip named Polaris
80 Core prototype developed Achieved 1.01
teraflops computing performance at 3.16 GHz and
65 W Later achieved 2 teraflops at 5.7 GHz and
265W In order to take advantage of the
numerous cores of planned processors, Intel
is developing Ct, a programming model that ease
SIMD and multi- threading programming. - No
plans for mass production. Currently,
Tera-Scale is just a research initiative.
GPU to compete with NVIDIA/ATI Will run at
1.7 2.5 GHz 16 to 24 in order cores
supporting 4 simultaneous threads of execution
TDP 150 300 W - Different from NVIDIA/ATI
solutions in that it will not use a custom
instruction set designed for graphics,
rather an extension of the x86 instruction
set. - Public release late 09 / 10
13NVIDIA
CUDA SDK 2.0 is no longer beta CUDA SDK
supports 32bit/64bit Windows, Linux, OSX CUDA
driver supports 32bit/64bit Windows, Linux, OSX
Large selection of CUDA enabled GPUs Double
Precision Support Greater performance than ATI
cards CUDA adds functionality and ease of use
as compared to Brook and CAL. Ready for
deployment in the hear and now Rocks 4.0/5.0
Operating System has CUDA rolls ready Sun Grid
Engine (SGE) already deployed Decent
documentation
14The Problem with CPU Architecture
- The Central Processing Unit contains one to
several multiprocessors which run serial
operations extremely quickly in order to make
tasks appear to run simultaneously. (Newer dual
and quad core processors obviously operate with
true simultaneity)? - As can be seen in the block diagram, a
significant portion of the transistors on the die
are designated for flow control and data
caching.These CPU's are smart but narrow.
Of the 820 million transistors on a Core2, only
a small fraction are dedicated to ALU's.
An arithmetic logic unit (ALU) is a digital
circuit that performs arithmetic and logical
operations. It is the basic building block of
modern microprocessors.
15Graphics Processing Units
- A large portion of the processor die is spent on
ALU's 80 of 1.4 billion transistors. - Because the same function is executed on each
element of data with high arithmetic intensity,
there is a much reduced requirement of flow
control for the hardware. - Data-parallel processing maps data elements to
parallel processing threads. Many applications
that process large data sets such as arrays can
use a data-parallel programming model to speed up
the computations. In 3D rendering large sets of
pixels and vertices are mapped to parallel
threads.
16Hardware Architecture
- Each multiprocessor has a Single Instruction,
Multiple Data architecture (SIMD) At any given
clock cycle, each stream processor of the
multiprocessor executes the same instruction, but
operates on different data. - On the GTX 280 there are 10 multiprocessors with
24 Stream Processors each (240 SP's total). - On the 8800 GTX there are 8 multiprocessors with
16 Stream Processors each (128 SP's total).
17Memory Management Model
- Access to registers is nearly instantaneous and
read-write - Access to Shared Memory is very fast and
read-write. - Access to Texture Memory is very fast and read
only - Access to Constant Memory is very fast and read
only - Access to Global Memory is slow and read-write.
- There number of registers per multiprocessor is
8192. - The amount of shared memory available per
multiprocessor is 16 KB organized - into 16 banks.
- The total amount of constant memory is 64 KB.
- The cache working set for constant memory is 8 KB
per multiprocessor. - The cache working set for texture memory is 8 KB
per multiprocessor. - The total amount of global memory varies by card
from 256 MB to 4 GB. - The maximum number of active blocks per
multiprocessor is 8. - The maximum number of active threads per
multiprocessor is 768.
18- A function that is compiled to the instruction
set of the device results in a program, called a
kernel, which is downloaded to the device. - Each multiprocessor stated before is capable of
processing a maximum of 512 threads at a time. - The actual number of threads used per block
depends on the complexity of the calculation.
The more sophisticated the calculation, the more
registers that are required. The more registers
required, the fewer instances of the thread that
can be run simultaneously.
19CPU vs GPU Performance per Watt
- Intel Xeon/Core2 Processors (Core
microarchitecture, x86 ISA)Operating Speed
2.0 - 3.0GHzx86 Core2/Xeon arch 4 flops
single precision, 2 flops double precision
(above only true if optimized with
SSE) 1 flops single/double precision
non-SSETDP 50 150WNumber of Cores
2 - 4 - For a 2.83 GHz Intel Quad Core (95W) processor
running SSE optimized code4 cores x 2.83 GHZ
4 flops 45.2 GFLOPS theoretical peak (single
precision) 22.6 GFLOPS theoretical peak
(double precision) 45.2 GFLOPS/ 95 W 475
MFLOPS/W (sp) 238 MFLOPS/W (dp)? - For a 3.0 GHz Intel Core 2 (65W) processor with
no multithreading / SSE1 core x 3.00 GHz 1
flops 3.0 Gflops (single/double precision)3.0
Gflops / 65 W 46 Mflops/W (single/double
precision)?
20CPU vs GPU Performance per Watt
- Nvidia 8800GTX / 9800GTX (G80, G92 architecture
respectively)Shader Clock Freq. 1.35 / 1.688
GHzG80/G92 architecture dual issue MAD/MUL
---gt 3 flops single precisionTDP 145 / 156
W (maximum 225W)Number of Stream Proc
1288800 GTX 518.4 Gflops (single
precision) 3.575 Gflops/W9800 GTX
648.2 Gflops (single precision) 4.155
Gflops/W - Nvidia 280 GTX (GT200 architecture)Shader Clock
Freq. 1.296 GHzG80/G92 architecture dual
issue MAD/MUL ---gt 3 flops single
precisionTDP 236 W (maximum 300W)Number
of Stream Proc 240GTX 280 933.1 Gflops
(single precision) 3.95 Gflops/W
gt 90 Gflops (double precision) 380
Mflops/W
21CPU vs GPU Performance per Watt
- Typical CPU programming would result in 3 Gflops
of single/double precision performance. - Multi-threaded and Streaming SIMD Extension
enhanced code could perform at 50 and 25 Gflops
for single and double precision respectively. - Full enhanced GPU code could perform at 1
Tflops and 100 Gflops for single and double
precision respectively - Typically, simply mapping CPU code to a GPU
programming with little or no memory management
would result in at least 10x performance boost.
This corresponds to a minimum 20x speed up in
single precision over fully enhanced SSE
code maximum 350x speed up in single precision
over regular code minimum 4x speed up in double
precision over fully enhanced SSEmaximum 30x
speed up in double precision over regular code
22CPU vs GPU Price vs Performance
- GeForce GTX 280 (236W) 933 Gflops single
precision gt90 Gflops double precision 649.99
(newegg)? - GeForce 9800 GX2 (197W) 1.15 Tflops single
precision 449.99 (newegg)? - Intel Core 2 Quad Yorkfield 2.83 GHz 95W 45.2
Gflops single precision 22.6 Glops double
precision 559.99 (newegg)? - Intel Core 2 Duo E7200 Wolfdale 2.53GHz 65W 20.2
Gflops single precision 10.1 Gflops double
precision 129.99
1.44 Gflops / sp 0.14 Gflops / dp
2.56 Gflops / sp
0.08 Gflops / sp 0.04 Gflops / dp
0.16 Gflops / sp 0.08 Gflops / dp
Even when we consider the ultimate metric of
Performance/W per the GPU comes out ahead. The
wattage use is roughly double or triple for the
GPUs as compared to the CPU's, however they offer
orders of magnitude better performance per dollar.
23Drawbacks
- Accuracy - single precision - double
precision avaliable but at a significant
performance penalty - Portability of code - CUDA vs Brook CUDA
and Brook share a common heritage in that CUDA is
partly based on Brook. Code is not direclty
portable, however, programming concepts between
the two implementation are similar. - Future proofing code - Increasing register
space could result in different kernels being
useful. - Increasing amounts of shared memory
would allow different sized/more data sets to
be handled.
24Brook/CUDA
- At this point, you basically have three options
CUDA, BrookGPU, or PeakStream. - Brook has backends for CTM, D3D9, and a generic
CPU backend, so it'll run on pretty much
anything. However, there's not a huge amount of
documentation (compared to the other two
options). - CUDA is G8x,G9x,G2x-only and is similar to
Brook--it was primarily designed by Ian Buck, who
also one of the principle authors of Brook. It's
kind of a superset of Brook, as it exposes more
functionality than can be exposed in Brook in a
cross-platform way and has very similar syntax.
It has some libraries included, like an FFT (Fast
Fourier Transform) and a BLAS (Basic Linear
Algebra Subprograms) implementation. CUDA tries
to hide GPU implementation details except for
fundamentals (warps, blocks, grids, etc), however
it retains three different memory regions, which
can initially be confusing. - PS is relatively new and is commercial, but it's
the only way (besides writing D3D9 HLSL and
compiling it) to write CTM in a high-level
language. Its syntax model is somewhat like
OpenMP, where particular code regions are marked
to be executed on the GPU. Dispatching programs
to the GPU is usually handled automatically, and
the memory management model is considered
somewhat more straightforward than CUDA's. Like
CUDA, it has FFT and BLAS libraries included.
Right now, it only compiles to CTM or to a CPU
backend, but that will most likely not be the
case in the future. Even though it is commercial,
PS is a good cross-platform API that seems to be
pretty sensibly designed and doesn't require
programmers to learn totally new programming
paradigms. It's also got good documentation.
25How to build your own cluster
- Useful builder tips
- Troubleshooting
26Power Supply Unit
Case
ATX formfactor case is required to fit typical
motherboard. Motherboards are typically ATX, but
other form factors exist. However, your case
only truly needs to match the motherboard type,
whatever that may be. Check to make sure your
case comes with at least 1 fan to help evacuate
heat. Practically speaking though, you will most
likely end up with an ATX motherboard and will
need an ATX case.
- Most high end NVIDIA GPU's suggest 550 or 600 W
power supplies.---gt We recommend 700 - 850W
PSUs. - (More) Reliable Brands Seasonic, Cooler Master,
BFG, Antec - YOU GET WHAT YOU PAY FOR---gt Do not skimp on
your PSU. They WILL burn out under heavy use if
they are cheap / underpowered.
27Motherboards
- PCI-Express 2.0 16x slots are required
for 2xx and 9xxx series NVIDIA GPUs (2xx series
cards support double precision). - Need at least 1 PCI-express 16x (PCI-e 16x) for
the GPU. - Intel Socket Type ---gt LGA 775 (mostly Core 2
duo/quad), LGA 771 (Xeon)? - AMD Socket Type ---gt AM2, AM2 (mostly
Athlon/Phenom), 940, F (Opteron). - Socket type must match motherboard otherwise the
pins on the bottom of the processor will not line
up. - Make sure that it has dual integrated networking
ports. - Most come with 4 DIMM slots
- Most motherboards can support between 8 and 32 GB
of system memory.
Hard Drives
Make sure your hard drive is SATA 1.5 Gb/s or
SATA II 3.0 Gb/s. Many newer motherboards have
limited or nonexistent IDE ATA 100/133 support.
This drive does not need to be very large at all
if you plan on using NAS to provide additional
hard disk storage space.
28Processor
AMD Athlon/Phenom and Intel Core 2 Duo/Quad
processors are all excellent choices. Intel
offers faster processors at the top end, but the
middle range processors are all very
competent Server processors do offer superior
reliability, however, do not increase performance
(typically). They are also more expensive and
harder to find (Xeons in particular).
Memory
- The jury is still out on the performance of DDR3
memory, although it may be easier to find a
motherboard that supports DDR3 memory with high
FSB speeds. This warning is only to tell you
that DDR2 memory performance is not poor. - Buy a minimum of 4GB in 2 x 2GB sticks. This
will allow for upgrades later on (if required). - The major key here is to make sure that you buy
RAM that performs at the maximum allowable speed
your motherboard supports. CPU's front side bus
speeds often exceed that of your RAM. Buying RAM
that is faster than your motherboard can support
will yield no practical benefits. - For example, most Core 2's have an 800, 1066 or
1333 MHz Front Side Bus. This corresponds to
memory support of DDR2 800 / 1066 RAM or DDR3
1066 / 1333 / RAM for high end in most high end
motherboards.
29Troubleshooting / Testing your Build
- Run MemTest86 for 24 hrs.
- Be sure that there is good contact between your
cpu and heatsink. - If you were required to apply thermal paste, do
not use too much. Only enough to cover thinly.
Most suggest a drop about the size of a pea. Do
not allow large amounts of dust particulate to
touch the thermal paste. - Make sure there is enough space in your pc for
good ventilation and air flow. - Make sure you tie down you cables so that they do
not get jammed in active fans. - Make sure addition 6 pin and 8 pin power
connectors are attached to your GPU (or else it
will run at very low efficiency). - If your system memory is dual channel, make sure
that the pairs are plugged in and correspond to
the appropriate channels on your motherboard
(They are usually color coordinated or slightly
different heights). - Motherboard manual will provide all necessary
information on where fans can be plugged in, the
power supply connect to the motherboard, and how
if any jumpers need to be attached. LED
connections are also labeled.
30CUDA enabled GPUs
- http//www.nvidia.com/object/cuda_learn_products.h
tml - 8xxx, 9xxx, and 2xx series GeForce GPUs
- Tesla C,S,D 870 (old) C,S,D 1060/1070 (new/fall
08)? - Nvidia Quadro FX/NVS series (nearly all mobile
versions supported as well)? - 2xx series GPU's add double precision support
- The card is considerably better if the second
number is higher.i.e. 8800 GT gtgt 8600 GT - GTX Ultra gt GTX gt GTS gt GT gtgt GS gt G (an O on
anything means overclocked, I think)? - GX2 is literally two boards fused together.
2x of a single card (slightly less)? - Some 8xxx and all 9xxx / 2xx series cards are
PCI-e 2.0 16x compatible. Some 8xxx cards only
support PCI-e 1.0 16x. However, these cards will
work on a motherboard which supports PCI-express
2.0
31Info on Cluster OS
- ROCKS 5.0 lt---- Cent OS
- Sun Grid Engine (SGE)?
- May need to roll your own installation depending
on what features you desire. However, it is as
simple as selecting the CUDA roll during a ROCKS
install to get GPU job scheduling support.
32Future
- Open Computer Language (Apple)?
- Intel Ct
- IBM / HP
- PeakStream --- gt a unified language?
33Sources
- Nvidia CUDA Programming Guide 2.0
- Nvidia Tesla Technical Brief
- Nvidia GeForce 8800 GPU Architecture Overview
- Nvidia CUDA SDK 2.0
- Nvidia Compiler Documentation
- GPGPU.org
- Intel Whitepaper(s)?
- www.nvidia.com
- www.ati.com
- www.intel.com
- www.arstechnica.com
- www.cnet.com