ME964 High Performance Computing for Engineering Applications - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

ME964 High Performance Computing for Engineering Applications

Description:

The fast-growing video game industry exerts strong economic ... Use host native debug support (breakpoints, variable QuickWatch ... hazard-free shared ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 28
Provided by: sbel3
Learn more at: http://sbel.wisc.edu
Category:

less

Transcript and Presenter's Notes

Title: ME964 High Performance Computing for Engineering Applications


1
ME964High Performance Computing for Engineering
Applications
  • Execution Model and Its Hardware Support
  • Sept. 25, 2008

2
Before we get started
  • Last Time
  • The CUDA execution model
  • Wrapped up overview the CUDA API
  • Read CUDA Programming Guide 1.1 (for next Tu)
  • Today
  • Review of concepts discussed over the previous
    two lectures
  • More on the CUDA execution model and its hardware
    support
  • Focus on thread scheduling
  • HW4 assigned
  • Due on Thursday, Oct. 2 at 1159 PM
  • Timing Kernel Call Overhead, Matrix-Matrix
    multiplication (tiled, arbitrary size matrices),
    Vector Reduction operation
  • Please Note On Nov 11 and 13 well have a Guest
    Lecturer, Dr. Darius Buntinas, of Argonne
    National Lab
  • Lectures will cover MPI, a different parallel
    computational model
  • The two lectures will run two hours long
  • Youll get a free Tu or Th afterwards

2
3
Why Use the GPU for Computing ?
  • The GPU has evolved into a flexible and powerful
    processor
  • Its programmable using high-level languages
    (soon in FORTRAN)
  • It supports 32-bit floating point precision and
    dbl precision (2.0)
  • Capable of GFLOP-level crunching number speed
  • GPU in each of todays PC and workstation

3
4
What is Driving this Evolution?
  • The GPU is specialized for compute-intensive,
    highly data parallel computation (owing to its
    graphics rendering origin)
  • More transistors can be devoted to data
    processing rather than data caching and flow
    control
  • The fast-growing video game industry exerts
    strong economic pressure that forces constant
    innovation

CPU
GPU
4
HK-UIUC
5
ALU Arithmetic Logic Unit
  • Digital circuit that performs arithmetic and
    logical operations
  • Fundamental building block of a processing unit
    (CPU and GPU)
  • A and B operands (the data, coming from input
    registers)
  • F is an operator (, -, etc.) specified by
    the control unit
  • R is the result, stored in output register
  • D is an output flag passed back to the control
    unit

5
6
Some Useful Information on Tools (short detour)
6
7
Compilation
  • Any source file containing CUDA language
    extensions must be compiled with nvcc
  • You spot such a file by its .cu suffix
  • nvcc is a compile driver
  • Works by invoking all the necessary tools and
    compilers like cudacc, g, cl, ...
  • Assignment Read the nvcc document available on
    the class website
  • nvcc can output
  • C code
  • Must then be compiled with the rest of the
    application using another tool
  • Assembly code (ptx)
  • Or directly object code

7
HK-UIUC
8
Linking
  • Any executable with CUDA code requires two
    dynamic libraries
  • The CUDA runtime library (cudart)
  • The CUDA core library (cuda)

8
HK-UIUC
9
Debugging Using the Device Emulation Mode
  • An executable compiled in device emulation mode
    (using the nvcc -deviceemu) runs entirely on the
    host using the CUDA runtime
  • No need of any device and CUDA driver
  • Each device thread is emulated with a host
    thread
  • For your assignments in Developer Studio
    project select the EmuDebug or EmuRelease
    build configurations
  • When running in device emulation mode, one can
  • Use host native debug support (breakpoints,
    variable QuickWatch and edit, etc.)
  • Access any device-specific data from host code
    and vice-versa
  • Call any host function from device code (e.g.
    printf) and vice-versa
  • Detect deadlock situations caused by improper
    usage of __syncthreads

9
10
Device Emulation Mode Pitfalls
  • Emulated device threads execute sequentially, so
    simultaneous accesses of the same memory location
    by multiple threads could produce different
    results
  • Dereferencing device pointers on the host or host
    pointers on the device can produce correct
    results in device emulation mode, but will
    generate an error in device execution mode
  • Results of floating-point computations will
    slightly differ because of
  • Different compiler outputs, instruction sets
  • Use of extended precision for intermediate
    results
  • There are various options to force strict single
    precision on the host

10
HK-UIUC
11
End Information on Tools Begin Discussion on
Block/Thread Scheduling
11
12
Review The CUDA Programming Model
  • GPU Architecture Paradigm Single Instruction
    Multiple Data (SIMD)
  • CUDA perspective Single Program Multiple Threads
  • Whats the overall software (application)
    development model?
  • CUDA integrated CPU GPU application C program
  • Serial C code executes on CPU
  • Parallel Kernel C code executes on GPU thread
    blocks

CPU Serial Code
GPU Parallel Kernel KernelAltltlt nBlkA, nTidA
gtgtgt(args)
Grid 0
CPU Serial Code
GPU Parallel Kernel KernelBltltlt nBlkB, nTidB
gtgtgt(args)
Grid 1
12
13
Execution Configuration Grids and Blocks (Review)
  • A kernel is executed as a grid of blocks of
    threads
  • All threads in a kernel can access several device
    data memory spaces
  • A block of threads is a batch of threads that
    can cooperate with each other by
  • Synchronizing their execution
  • For hazard-free shared memory accesses
  • Efficiently sharing data through a low latency
    shared memory
  • Threads from two different blocks cannot
    cooperate!!!
  • This has important software design implications

13
HK-UIUC
Courtesy NDVIA
14
CUDA Thread Block Review
  • In relation to a Block, the programmer decides
  • Block size from 1 to 512 concurrent threads
  • Block dimension (shape) 1D, 2D, or 3D
  • of threads in each dimension
  • All threads in a Block execute the same thread
    code
  • Threads have thread id numbers within Block
  • Threads share data and synchronize while doing
    their share of the work
  • Thread program uses thread id to select work and
    address shared data

CUDA Thread Block
Thread Id 0 1 2 3 m
Thread code
Courtesy John Nickolls, NVIDIA
14
15
GeForce-8 Series HW Overview
Stream Processor Array

TPC
TPC
TPC
TPC
TPC
TPC
Texture Processor Cluster
Stream Multiprocessor
Instruction L1
Data L1
Instruction Fetch/Dispatch
SM
Shared Memory
TEX
SP
SP
SP
SP
SM
SFU
SFU
SP
SP
SP
SP
15
HK-UIUC
16
CUDA Processor Terminology
  • SPA
  • Stream Processor Array (variable across GeForce
    8-series, 8 in GeForce8800 GTX)
  • TPC
  • Texture Processor Cluster (2 SM TEX)
  • SM
  • Stream Multiprocessor (8 SP)
  • Multi-threaded processor core
  • Fundamental processing unit for CUDA thread block
  • SP
  • Scalar Stream Processor (SP)
  • Scalar ALU for a single CUDA thread

16
HK-UIUC
17
Stream Multiprocessor (SM)
  • Stream Multiprocessor (SM)
  • 8 Scalar Processors (SP)
  • 2 Special Function Units (SFU)
  • Its where a block lands for execution
  • Multi-threaded instruction dispatch
  • 1 to 768 (!) threads active
  • Shared instruction fetch per 32 threads
  • 20 GFLOPS on G80
  • 16 KB shared memory
  • DRAM texture and memory access

Stream Multiprocessor
Instruction L1
Data L1
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
17
HK-UIUC
18
Scheduling on the HW
  • Grid is launched on the SPA
  • Thread Blocks are serially distributed to all the
    SMs
  • Potentially gt1 Thread Block per SM
  • Each SM launches Warps of Threads
  • SM schedules and executes Warps that are ready to
    run
  • As Warps and Thread Blocks complete, resources
    are freed
  • SPA can launch next Blocks in line
  • NOTE Two levels of scheduling
  • For running desirably a large number of blocks
    on a small number of SMs (16/14/etc.)
  • For running up to 24 warps of threads on the 8
    SPs available on each SM

18
19
SM Executes Blocks
SM 1
SM 0
Blocks
Blocks
  • Threads are assigned to SMs in Block granularity
  • Up to 8 Blocks to each SM (doesnt mean youll
    have eight though)
  • SM in G80 can take up to 768 threads
  • This is 24 warps (occupancy calculator!!)
  • Could be 256 (threads/block) 3 blocks
  • Or 128 (threads/block) 6 blocks, etc.
  • Threads run concurrently but time slicing is
    involved
  • SM assigns/maintains thread id s
  • SM manages/schedules thread execution

Texture L1
L2
Memory
19
HK-UIUC
20
Thread Scheduling/Execution
  • Each Thread Block is divided in 32-thread Warps
  • This is an implementation decision, not part of
    the CUDA programming model
  • Warps are the basic scheduling units in SM
  • If 3 blocks are assigned to an SM and each Block
    has 256 threads, how many Warps are there in an
    SM?
  • Each Block is divided into 256/32 8 Warps
  • There are 8 3 24 Warps
  • At any point in time, only one of the 24 Warps
    will be selected for instruction fetch and
    execution.



Block 1 Warps
Block 2 Warps



Streaming Multiprocessor
Instruction L1
Data L1
Instruction Fetch/Dispatch
Shared Memory
SP
SP
SP
SP
SFU
SFU
SP
SP
SP
SP
20
HK-UIUC
21
SM Warp Scheduling
  • SM hardware implements zero-overhead Warp
    scheduling
  • Warps whose next instruction has its operands
    ready for consumption are eligible for execution
  • Eligible Warps are selected for execution on a
    prioritized scheduling policy
  • All threads in a Warp execute the same
    instruction when selected
  • 4 clock cycles needed to dispatch the same
    instruction for all threads in a Warp in G80
  • Side-comment
  • Suppose your code has one global memory access
    every four instructions
  • Then, a minimal of 13 Warps are needed to fully
    tolerate 200-cycle memory latency

SM multithreaded Warp scheduler
time
...
21
HK-UIUC
22
SM Instruction Buffer Warp Scheduling
  • Fetch one warp instruction/cycle
  • from instruction L1 cache
  • into any instruction buffer slot
  • Issue one ready-to-go warp instruction/4 cycle
  • from any warp - instruction buffer slot
  • operand scoreboarding used to prevent hazards
  • Issue selection based on round-robin/age of warp
  • SM broadcasts the same instruction to 32 Threads
    of a Warp

I

L
1
Multithreaded
Instruction Buffer
R
C

Shared
F
L
1
Mem
Operand Select
MAD
SFU
22
HK-UIUC
23
Scoreboarding
  • All register operands of all instructions in the
    Instruction Buffer are scoreboarded
  • Status becomes ready after the needed values
    are deposited
  • Prevents hazards
  • Cleared instructions are eligible for issue
  • Decoupled Memory/Processor pipelines
  • Any thread can continue to issue instructions
    until scoreboarding prevents issue

23
HK-UIUC
24
Granularity Considerations
  • For Matrix Multiplication, should I use 8X8,
    16X16 or 32X32 tiles?
  • For 8X8, we have 64 threads per Block. Since each
    SM can take up to 768 threads, it can take up to
    12 Blocks. However, each SM can only take up to 8
    Blocks, only 512 threads will go into each SM!
  • For 16X16, we have 256 threads per Block. Since
    each SM can take up to 768 threads, it can take
    up to 3 Blocks and achieve full capacity unless
    other resource considerations overrule.
  • For 32X32, we have 1024 threads per Block. This
    is not an option anyway (we need less then 512
    per block, and less than 768 per SM)

24
HK-UIUC
25
How would you scale up the GPU?
  • Scaling up here means beefing it up
  • Two issues
  • As a company, you dont want to rock the boat a
    lot when scaling up
  • You dont want to have legacy code re-written to
    take advantage of new HW
  • You can beef up the memory, not discussed here
  • Increase the number of TCP
  • Easy to do, basically more HW
  • Implications on our side If you have enough
    blocks, you rise with the tide too
  • Increase the number of SMs on each TCP
  • Easy to do, basically more HW
  • Implications on our side If you have enough
    blocks, you rise with the tide too
  • Increase the number of SP
  • This is tricky, youd have to fiddle with the
    control unit of the SM
  • The Warp size would change, most likely this
    would require more threads in a block to be
    efficient, but that requires more memory on the
    chip (shared registers)
  • It snowballs, this is probably going to stay like
    this for a while

25
26
New GT200 GPU Architecture
Stream Processor Array
G80 up to 8 TCP in SPA GT200 10 TCP in SPA
Texture Processing Cluster
26
27
End Discussion on Block/Thread Scheduling Begin
Discussion on Memory Access
27
Write a Comment
User Comments (0)
About PowerShow.com