Massively Parallel Computing and GPU Hardware - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

Massively Parallel Computing and GPU Hardware

Description:

... down the pipeline in sequential fashion, bottlenecks were removed for gaming/video processing. ... Useful builder tips. Troubleshooting. Power Supply Unit ... – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 34
Provided by: alexand58
Category:

less

Transcript and Presenter's Notes

Title: Massively Parallel Computing and GPU Hardware


1
Massively Parallel Computing and GPU Hardware
2
Types of Parallel Computing
  • Parallel on Device(Supercomputer /
    Co-processors)?
  • Cray built the first supercomputers. Economies
    of scale eventually priced out nearly all custom
    supercomputing solutions.
  • Modern supercomputers are most often many
    (thousands) of commodity processors in an
    extremely highly tuned cluster with special built
    custom interconnects.
  • Famous Supercomputers / Co-processors include-
    Cray- GRAPE Board (GRAvity PipE)- IBM Blue
    Gene
  • Parallel not on device(cluster)?
  • Types of Clusters- Beowulf Clusters TCP/IP
    networked (usually) identical computers built in
    cluster designed specifically for
    supercomputing- Grid Computing distributed
    computing systems that are more loosely coupled,
    heterogeneous in nature and geographically
    disperse
  • Famous Distributed Projects include-
    Folding_at_home- SETI_at_home

3
Distributed Computing vs Parallel Computing
  • OpenMP (Open Multi-Processing)?
  • OpenMP is an application programming interface
    (API) that supports multi-platform shared memory
    multiprocessing programming in C/C and Fortran
    on many architectures.
  • Supercomputers / Co-processors are typically
    Shared Memory / Distributed Shared Memory
    parallel multiprocessing platforms.
  • An application built with the hybrid model of
    parallel programming can run on a computer
    cluster using both OpenMP and MPI.
  • MPI(Message Passing Interface)?
  • MPI is a specification for an application
    programming interface (API) that allows many
    computers to communicate with one another.
  • Clusters are Distributed Memory parallel
    multiprocessing platforms.Both types of
    parallel computing offer unique programming and
    optimization challenges.

4
Pros and Cons of clusters and supercomputers
PRO
CON
  • 0 to write a proposal for time on a cluster /
    supercomputer.
  • If properly programmed, can do amazing amounts of
    PC wall clock work in very short amount of time.
  • Cluster's are well suited to running the same
    operation on many discrete data sets
  • Supercomputers can be extremely effective at
    running massive simulations very quickly.
  • Need to write a proposal for cluster /
    supercomputer time. Writing a new proposal for
    each new set of simulations lacks flexibility for
    those that run complex code.
  • Extremely difficult to get maximum potential out
    of clusters with MPI
  • Supercomputer time may be very difficult to come
    by.
  • Supercomputers / large clusters are massively
    expensive to build.
  • Clusters are not necessarily well suited to
    larger data sets which can not be broken apart
    and processed separately.

5
On-device Parallel computing
CON
PRO
- On device memory limitations will keep
simulations necessarily small as compared to
possibilities with a cluster or supercomputer. -
Can not compete with the scale of clusters and
supercomputers. The sheer amount of computing
power provided by a massive cluster or
supercomputer is unrivaled.
Keeping calculations on the device rather than
transferring data over some type of network saves
massive amounts of compute time. Amount of
programming required to take advantage of an
on-board device is significantly less than
MPI/OpenMP. Much of the architectural nuances
of a parallel device can be hidden in the
compiler. Or, conversely, more optimization can
occur at significantly less cost in terms of
development time.
6
Why the GPU and why now?
  • There are two primary answers to this question
  • 1.) Economies of Scale
  • 2.) Unified Shader Model / Stream Processing

Economies of Scale - Where Cray and many other
supercomputing solutions failed is in the realm
of profit. GPU's are graphics cards for computer
video game players and graphics professionals
first for HPC second. Technological advances
have come at a rapid rate due to the ever
increasing graphical demands of modern computer
gaming. Because of the constricted budget of the
average pc gamer as compared to a
professional/scientist seeking an expensive
solution to a HPC problem, graphics cards were
necessarily produced at much higher volume with
lower margins.
7
Classical Graphics Pipeline
Each stage of the graphics pipeline traditionally
had dedicated shader processors for each
operation. For example, a traditional pipeline
hypothetical video card could have 8 vertex
shader pipelines24 pixel shader pipelines 24
texture filtering units 8 vertex texture
addressing units 8 rendering output units
8
Unified Shader Model
  • By unifying the pixel/vertex shaders into a large
    programmable shader core, there are no longer
    discrete/separate shader core types. Since the
    data is no longer flowing down the pipeline in
    sequential fashion, bottlenecks were removed for
    gaming/video processing.
  • However, the advent of programmable shaders is
    important for the sheer fact that each shader is
    no longer forced to perform only graphics
    operations. They can be give a kernel of our
    choice and can process information as the data is
    streamed through the device.

9
Sample Device Architecture
NVIDIA G80 architecture
10
NVIDIA / ATI / Intel GPU Solutions
  • NVIDIA - GeForce (gaming/movie
    playback) Quadro (graphics) Tesla (HPC)?
  • ATI - Radeon (gaming/movie playback) FireSt
    ream (HPC)?
  • Intel - Larabee (HPC/gaming)?

11
ATI
Pros
Cons
- SDK still in beta - SDK only supports Windows
XP (ATI says linux support will come within
the next calendar year)? - Brook/CTM more general
than CUDA ---gt more code development is
required to achieve similar functionality. -
CUDA considered to be more fully featured and
1 years ahead
Open Source driver CAL (compute abstraction
layer) and CTM (close to metal) allow for a
high degree of optimization. Slightly better
price for non- FireStream GPUs (Radeon)?
To run the CAL/Brook SDK, you need a platform
based on the AMD R600 GPU or later. R600 and
newer GPUs are found with ATI Radeon HD2400,
HD2600, HD2900 and HD3800 graphics board.
12
Intel
Tera-Scale
Larabee
1st Teraflops Research Chip named Polaris
80 Core prototype developed Achieved 1.01
teraflops computing performance at 3.16 GHz and
65 W Later achieved 2 teraflops at 5.7 GHz and
265W In order to take advantage of the
numerous cores of planned processors, Intel
is developing Ct, a programming model that ease
SIMD and multi- threading programming. - No
plans for mass production. Currently,
Tera-Scale is just a research initiative.
GPU to compete with NVIDIA/ATI Will run at
1.7 2.5 GHz 16 to 24 in order cores
supporting 4 simultaneous threads of execution
TDP 150 300 W - Different from NVIDIA/ATI
solutions in that it will not use a custom
instruction set designed for graphics,
rather an extension of the x86 instruction
set. - Public release late 09 / 10
13
NVIDIA
CUDA SDK 2.0 is no longer beta CUDA SDK
supports 32bit/64bit Windows, Linux, OSX CUDA
driver supports 32bit/64bit Windows, Linux, OSX
Large selection of CUDA enabled GPUs Double
Precision Support Greater performance than ATI
cards CUDA adds functionality and ease of use
as compared to Brook and CAL. Ready for
deployment in the hear and now Rocks 4.0/5.0
Operating System has CUDA rolls ready Sun Grid
Engine (SGE) already deployed Decent
documentation
14
The Problem with CPU Architecture
  • The Central Processing Unit contains one to
    several multiprocessors which run serial
    operations extremely quickly in order to make
    tasks appear to run simultaneously. (Newer dual
    and quad core processors obviously operate with
    true simultaneity)?
  • As can be seen in the block diagram, a
    significant portion of the transistors on the die
    are designated for flow control and data
    caching.These CPU's are smart but narrow.
    Of the 820 million transistors on a Core2, only
    a small fraction are dedicated to ALU's.

An arithmetic logic unit (ALU) is a digital
circuit that performs arithmetic and logical
operations. It is the basic building block of
modern microprocessors.
15
Graphics Processing Units
  • A large portion of the processor die is spent on
    ALU's 80 of 1.4 billion transistors.
  • Because the same function is executed on each
    element of data with high arithmetic intensity,
    there is a much reduced requirement of flow
    control for the hardware.
  • Data-parallel processing maps data elements to
    parallel processing threads. Many applications
    that process large data sets such as arrays can
    use a data-parallel programming model to speed up
    the computations. In 3D rendering large sets of
    pixels and vertices are mapped to parallel
    threads.

16
Hardware Architecture
  • Each multiprocessor has a Single Instruction,
    Multiple Data architecture (SIMD) At any given
    clock cycle, each stream processor of the
    multiprocessor executes the same instruction, but
    operates on different data.
  • On the GTX 280 there are 10 multiprocessors with
    24 Stream Processors each (240 SP's total).
  • On the 8800 GTX there are 8 multiprocessors with
    16 Stream Processors each (128 SP's total).

17
Memory Management Model
  • Access to registers is nearly instantaneous and
    read-write
  • Access to Shared Memory is very fast and
    read-write.
  • Access to Texture Memory is very fast and read
    only
  • Access to Constant Memory is very fast and read
    only
  • Access to Global Memory is slow and read-write.
  • There number of registers per multiprocessor is
    8192.
  • The amount of shared memory available per
    multiprocessor is 16 KB organized
  • into 16 banks.
  • The total amount of constant memory is 64 KB.
  • The cache working set for constant memory is 8 KB
    per multiprocessor.
  • The cache working set for texture memory is 8 KB
    per multiprocessor.
  • The total amount of global memory varies by card
    from 256 MB to 4 GB.
  • The maximum number of active blocks per
    multiprocessor is 8.
  • The maximum number of active threads per
    multiprocessor is 768.

18
  • A function that is compiled to the instruction
    set of the device results in a program, called a
    kernel, which is downloaded to the device.
  • Each multiprocessor stated before is capable of
    processing a maximum of 512 threads at a time.
  • The actual number of threads used per block
    depends on the complexity of the calculation.
    The more sophisticated the calculation, the more
    registers that are required. The more registers
    required, the fewer instances of the thread that
    can be run simultaneously.

19
CPU vs GPU Performance per Watt
  • Intel Xeon/Core2 Processors (Core
    microarchitecture, x86 ISA)Operating Speed
    2.0 - 3.0GHzx86 Core2/Xeon arch 4 flops
    single precision, 2 flops double precision
    (above only true if optimized with
    SSE) 1 flops single/double precision
    non-SSETDP 50 150WNumber of Cores
    2 - 4
  • For a 2.83 GHz Intel Quad Core (95W) processor
    running SSE optimized code4 cores x 2.83 GHZ
    4 flops 45.2 GFLOPS theoretical peak (single
    precision) 22.6 GFLOPS theoretical peak
    (double precision) 45.2 GFLOPS/ 95 W 475
    MFLOPS/W (sp) 238 MFLOPS/W (dp)?
  • For a 3.0 GHz Intel Core 2 (65W) processor with
    no multithreading / SSE1 core x 3.00 GHz 1
    flops 3.0 Gflops (single/double precision)3.0
    Gflops / 65 W 46 Mflops/W (single/double
    precision)?

20
CPU vs GPU Performance per Watt
  • Nvidia 8800GTX / 9800GTX (G80, G92 architecture
    respectively)Shader Clock Freq. 1.35 / 1.688
    GHzG80/G92 architecture dual issue MAD/MUL
    ---gt 3 flops single precisionTDP 145 / 156
    W (maximum 225W)Number of Stream Proc
    1288800 GTX 518.4 Gflops (single
    precision) 3.575 Gflops/W9800 GTX
    648.2 Gflops (single precision) 4.155
    Gflops/W
  • Nvidia 280 GTX (GT200 architecture)Shader Clock
    Freq. 1.296 GHzG80/G92 architecture dual
    issue MAD/MUL ---gt 3 flops single
    precisionTDP 236 W (maximum 300W)Number
    of Stream Proc 240GTX 280 933.1 Gflops
    (single precision) 3.95 Gflops/W
    gt 90 Gflops (double precision) 380
    Mflops/W

21
CPU vs GPU Performance per Watt
  • Typical CPU programming would result in 3 Gflops
    of single/double precision performance.
  • Multi-threaded and Streaming SIMD Extension
    enhanced code could perform at 50 and 25 Gflops
    for single and double precision respectively.
  • Full enhanced GPU code could perform at 1
    Tflops and 100 Gflops for single and double
    precision respectively
  • Typically, simply mapping CPU code to a GPU
    programming with little or no memory management
    would result in at least 10x performance boost.

This corresponds to a minimum 20x speed up in
single precision over fully enhanced SSE
code maximum 350x speed up in single precision
over regular code minimum 4x speed up in double
precision over fully enhanced SSEmaximum 30x
speed up in double precision over regular code
22
CPU vs GPU Price vs Performance
  • GeForce GTX 280 (236W) 933 Gflops single
    precision gt90 Gflops double precision 649.99
    (newegg)?
  • GeForce 9800 GX2 (197W) 1.15 Tflops single
    precision 449.99 (newegg)?
  • Intel Core 2 Quad Yorkfield 2.83 GHz 95W 45.2
    Gflops single precision 22.6 Glops double
    precision 559.99 (newegg)?
  • Intel Core 2 Duo E7200 Wolfdale 2.53GHz 65W 20.2
    Gflops single precision 10.1 Gflops double
    precision 129.99

1.44 Gflops / sp 0.14 Gflops / dp
2.56 Gflops / sp
0.08 Gflops / sp 0.04 Gflops / dp
0.16 Gflops / sp 0.08 Gflops / dp
Even when we consider the ultimate metric of
Performance/W per the GPU comes out ahead. The
wattage use is roughly double or triple for the
GPUs as compared to the CPU's, however they offer
orders of magnitude better performance per dollar.
23
Drawbacks
  • Accuracy - single precision - double
    precision avaliable but at a significant
    performance penalty
  • Portability of code - CUDA vs Brook CUDA
    and Brook share a common heritage in that CUDA is
    partly based on Brook. Code is not direclty
    portable, however, programming concepts between
    the two implementation are similar.
  • Future proofing code - Increasing register
    space could result in different kernels being
    useful. - Increasing amounts of shared memory
    would allow different sized/more data sets to
    be handled.

24
Brook/CUDA
  • At this point, you basically have three options
    CUDA, BrookGPU, or PeakStream.
  • Brook has backends for CTM, D3D9, and a generic
    CPU backend, so it'll run on pretty much
    anything. However, there's not a huge amount of
    documentation (compared to the other two
    options).
  • CUDA is G8x,G9x,G2x-only and is similar to
    Brook--it was primarily designed by Ian Buck, who
    also one of the principle authors of Brook. It's
    kind of a superset of Brook, as it exposes more
    functionality than can be exposed in Brook in a
    cross-platform way and has very similar syntax.
    It has some libraries included, like an FFT (Fast
    Fourier Transform) and a BLAS (Basic Linear
    Algebra Subprograms) implementation. CUDA tries
    to hide GPU implementation details except for
    fundamentals (warps, blocks, grids, etc), however
    it retains three different memory regions, which
    can initially be confusing.
  • PS is relatively new and is commercial, but it's
    the only way (besides writing D3D9 HLSL and
    compiling it) to write CTM in a high-level
    language. Its syntax model is somewhat like
    OpenMP, where particular code regions are marked
    to be executed on the GPU. Dispatching programs
    to the GPU is usually handled automatically, and
    the memory management model is considered
    somewhat more straightforward than CUDA's. Like
    CUDA, it has FFT and BLAS libraries included.
    Right now, it only compiles to CTM or to a CPU
    backend, but that will most likely not be the
    case in the future. Even though it is commercial,
    PS is a good cross-platform API that seems to be
    pretty sensibly designed and doesn't require
    programmers to learn totally new programming
    paradigms. It's also got good documentation.

25
How to build your own cluster
  • Useful builder tips
  • Troubleshooting

26
Power Supply Unit
Case
ATX formfactor case is required to fit typical
motherboard. Motherboards are typically ATX, but
other form factors exist. However, your case
only truly needs to match the motherboard type,
whatever that may be. Check to make sure your
case comes with at least 1 fan to help evacuate
heat. Practically speaking though, you will most
likely end up with an ATX motherboard and will
need an ATX case.
  • Most high end NVIDIA GPU's suggest 550 or 600 W
    power supplies.---gt We recommend 700 - 850W
    PSUs.
  • (More) Reliable Brands Seasonic, Cooler Master,
    BFG, Antec
  • YOU GET WHAT YOU PAY FOR---gt Do not skimp on
    your PSU. They WILL burn out under heavy use if
    they are cheap / underpowered.

27
Motherboards
  • PCI-Express 2.0 16x slots are required
    for 2xx and 9xxx series NVIDIA GPUs (2xx series
    cards support double precision).
  • Need at least 1 PCI-express 16x (PCI-e 16x) for
    the GPU.
  • Intel Socket Type ---gt LGA 775 (mostly Core 2
    duo/quad), LGA 771 (Xeon)?
  • AMD Socket Type ---gt AM2, AM2 (mostly
    Athlon/Phenom), 940, F (Opteron).
  • Socket type must match motherboard otherwise the
    pins on the bottom of the processor will not line
    up.
  • Make sure that it has dual integrated networking
    ports.
  • Most come with 4 DIMM slots
  • Most motherboards can support between 8 and 32 GB
    of system memory.

Hard Drives
Make sure your hard drive is SATA 1.5 Gb/s or
SATA II 3.0 Gb/s. Many newer motherboards have
limited or nonexistent IDE ATA 100/133 support.
This drive does not need to be very large at all
if you plan on using NAS to provide additional
hard disk storage space.
28
Processor
AMD Athlon/Phenom and Intel Core 2 Duo/Quad
processors are all excellent choices. Intel
offers faster processors at the top end, but the
middle range processors are all very
competent Server processors do offer superior
reliability, however, do not increase performance
(typically). They are also more expensive and
harder to find (Xeons in particular).
Memory
  • The jury is still out on the performance of DDR3
    memory, although it may be easier to find a
    motherboard that supports DDR3 memory with high
    FSB speeds. This warning is only to tell you
    that DDR2 memory performance is not poor.
  • Buy a minimum of 4GB in 2 x 2GB sticks. This
    will allow for upgrades later on (if required).
  • The major key here is to make sure that you buy
    RAM that performs at the maximum allowable speed
    your motherboard supports. CPU's front side bus
    speeds often exceed that of your RAM. Buying RAM
    that is faster than your motherboard can support
    will yield no practical benefits.
  • For example, most Core 2's have an 800, 1066 or
    1333 MHz Front Side Bus. This corresponds to
    memory support of DDR2 800 / 1066 RAM or DDR3
    1066 / 1333 / RAM for high end in most high end
    motherboards.

29
Troubleshooting / Testing your Build
  • Run MemTest86 for 24 hrs.
  • Be sure that there is good contact between your
    cpu and heatsink.
  • If you were required to apply thermal paste, do
    not use too much. Only enough to cover thinly.
    Most suggest a drop about the size of a pea. Do
    not allow large amounts of dust particulate to
    touch the thermal paste.
  • Make sure there is enough space in your pc for
    good ventilation and air flow.
  • Make sure you tie down you cables so that they do
    not get jammed in active fans.
  • Make sure addition 6 pin and 8 pin power
    connectors are attached to your GPU (or else it
    will run at very low efficiency).
  • If your system memory is dual channel, make sure
    that the pairs are plugged in and correspond to
    the appropriate channels on your motherboard
    (They are usually color coordinated or slightly
    different heights).
  • Motherboard manual will provide all necessary
    information on where fans can be plugged in, the
    power supply connect to the motherboard, and how
    if any jumpers need to be attached. LED
    connections are also labeled.

30
CUDA enabled GPUs
  • http//www.nvidia.com/object/cuda_learn_products.h
    tml
  • 8xxx, 9xxx, and 2xx series GeForce GPUs
  • Tesla C,S,D 870 (old) C,S,D 1060/1070 (new/fall
    08)?
  • Nvidia Quadro FX/NVS series (nearly all mobile
    versions supported as well)?
  • 2xx series GPU's add double precision support
  • The card is considerably better if the second
    number is higher.i.e. 8800 GT gtgt 8600 GT
  • GTX Ultra gt GTX gt GTS gt GT gtgt GS gt G (an O on
    anything means overclocked, I think)?
  • GX2 is literally two boards fused together.
    2x of a single card (slightly less)?
  • Some 8xxx and all 9xxx / 2xx series cards are
    PCI-e 2.0 16x compatible. Some 8xxx cards only
    support PCI-e 1.0 16x. However, these cards will
    work on a motherboard which supports PCI-express
    2.0

31
Info on Cluster OS
  • ROCKS 5.0 lt---- Cent OS
  • Sun Grid Engine (SGE)?
  • May need to roll your own installation depending
    on what features you desire. However, it is as
    simple as selecting the CUDA roll during a ROCKS
    install to get GPU job scheduling support.

32
Future
  • Open Computer Language (Apple)?
  • Intel Ct
  • IBM / HP
  • PeakStream --- gt a unified language?

33
Sources
  • Nvidia CUDA Programming Guide 2.0
  • Nvidia Tesla Technical Brief
  • Nvidia GeForce 8800 GPU Architecture Overview
  • Nvidia CUDA SDK 2.0
  • Nvidia Compiler Documentation
  • GPGPU.org
  • Intel Whitepaper(s)?
  • www.nvidia.com
  • www.ati.com
  • www.intel.com
  • www.arstechnica.com
  • www.cnet.com
Write a Comment
User Comments (0)
About PowerShow.com