Introduction to GPU Programming for EDA

About This Presentation

Title:

Introduction to GPU Programming for EDA

Description:

Introduction to GPU Programming for EDA John F. Croix Cadence Design Systems, Inc. Sunil P. Khatri Texas A&M University Acknowledgements: NVIDIA, Nascentric Inc ... – PowerPoint PPT presentation

Number of Views:338

Avg rating:3.0/5.0

Slides: 114

Provided by: eceTamuE7

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to GPU Programming for EDA

1
Introduction to GPU Programming for EDA

John F. Croix
Cadence Design Systems, Inc.
Sunil P. Khatri
Texas AM University
Acknowledgements NVIDIA, Nascentric Inc.,
Accelicon Inc.
Students Kanupriya Gulati, Vinay Karkala,
Kalyana Bollapalli

2
Outline

GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Conclusion
QA

2
3
Outline

GPU Architecture Overview
Evolution and architecture
Peak performance
GPU and CPU interaction practical
considerations
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Conclusion
QA

3
4
GPU Evolution

In the early days, graphics accelerators were
primitive
Acceleration of graphics rendering tasks for
(CRT) displays
Many hardwired graphics acceleration units
With VLSI technology scaling, the GPU was born
Many programmable processors to handle graphics
rendering tasks
Increased peak memory bandwidths and peak
performance
Goal was faster and more realistic rendering for
gaming applications
Recently, several scientific communities began to
leverage these GPUs
Initially used graphics APIs like OpenGL and
DirectX for these tasks
GPU vendors recognized this interest
Development of C-like programming environments
such as CUDA
Development of GPU architectures tuned for
scientific computations

4
5
GPU Introduction

A GPU is essentially a commodity stream processor
Highly parallel (100s of processor cores)
Very fast (gt900 GFLOPS of peak performance)
Operates in a SIMD manner. This is a key
restriction
Multiple processors operate in lock-step (same
instruction) but on different data
GPUs, owing to their massively parallel
architecture, have been used to accelerate
Image/stream processing, data compression,
numerical algorithms
Recently they have been used to accelerate CAD
algorithms as well.
Inexpensive, off-the-shelf cards like the NVIDIA
Quadro FX / 280 GTX GPU achieve impressive
performance
933 GFLOPs peak performance
240 SIMD cores partitioned into 30
Multiprocessors (MPs)
4GB (Quadro) and 1GB (GTX 280) device memory with
142 GB/s bandwidth
1.4 GHz GPU operating frequency
Programmed with Compute Unified Device
Architecture (CUDA) framework

6
GPU Architecture

In the GTX 280, there are 10 Thread Processing
Clusters (TPCs)
Each has 3 Streaming Multiprocessors (SMs), which
we will refer to as multiprocessors (MPs)
Each MP has 8 Streaming Processors (SPs) or
Thread Processors (TPs). We will refer to these
as processors.
240 processors and 30 MPs in all!
One double-precision FP unit per SM

6
Source NVIDIA
7
GPU vs CPUNVIDIA 280 vs Intel i7 860
1http//ark.intel.com/Product.aspx?id41316 2TPC
Thread Processing Cluster (24 cores) 330
multi-processors in a 280
7
8
GPU vs CPU Peak Performance Trends

GPU peak performance has grown aggressively.
Hardware has kept up with Moores law

8
Source NVIDIA
9
GPU Programming Model

The GPU is viewed as a compute device that
Is a coprocessor (slave) to the CPU (host)
Has its own DRAM (device memory) but no virtual
memory
Entire design instance may not fit on the GPU!
Kernel is a CPU-callable function. Thread is an
instance of a kernel.
GPU runs many threads in parallel.

Device
Host
(CPU)
(GPU)
Kernel
Threads (instances of the kernel)
PCIe
Device
Memory
10
Data Transfers (CPU?GPU)

GPUs and CPUs communicate via a PCIe bus
This communication is expensive and should be
minimized for target applications
Graphics applications usually require
Initial data to be sent from CPU to GPU
Single transfer of processed data from GPU to CPU
General purpose computations usually require
Multiple transfers between CPU and GPU (since
conditional checks on CPU)
Possibility of saturating the PCIe bus and
reducing the achievable performance

10
11
GPU Threads v/s CPU Threads

GPU threads
Lightweight, small creation and scheduling
overhead, extremely fast hardware context
switching
Need to issue 1000s of GPU threads to hide global
memory latencies (600-800 cycles)
CPU threads
Heavyweight, large scheduling overhead, slow
context switching
Multi-GPU usage requires invocation of multiple
CPU threads
Each CPU thread creates a GPU context
Context swapping is required for a CPU thread to
access GPU memory allocated by another CPU thread

11
12
Device Memory Space Overview

Each thread runs on a SP and has
R/W per-thread registers (on-chip)
Limit usage (max 16K/MP)
R/W per-thread local memory (off)
R/W per-block shared memory (on)
Need to avoid bank conflicts
R/W per-grid global memory (off)
Not cached, 600-800 cycle read
Latency hidden by parallelism
and fast context switches
Main means for data transfer from host and device
Coalescing recommended
RO per-grid cached constant and texture memory
(off)
The host can R/W global, constant and texture
memories (visible to all threads)

(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Local Memory
Local Memory
Local Memory
Local Memory
Global Memory
Host
Constant Memory
Texture Memory
Source NVIDIA CUDA Programming Guide version
1.1
13
Outline

GPU Architecture Overview
GPU Programming
CPU threads
Conditional and Loop processing
Floating point
General GPU program structure
CUDA and OpenCL
Algorithm Acceleration Guidelines
Case Studies
Conclusion
QA

13
14
CPU Threading

CPU
All threads are equivalent
Read/write concurrently to the same memory
Synchronization primitives required to avoid
collisions
GPU (NVIDIA)
Each CPU thread maintains a unique context
GPU resources (e.g. memory, code modules, address
space) are context-specific
Each CPU thread can access a single context at
once
Contexts must be exchanged between CPU threads to
share GPU resources between CPU threads
Contexts use reference counting and are
automatically destroyed

14
15
SIMD Conditional Processing

Unlike threads in a CPU-based program, SIMD
programs cannot follow different execution paths
Ideal scenario
All GPU threads follow the same execution path
All processors active continuously
In divergent paths, some processors execute the
then-block and others the else-block
Program flow cannot actually diverge. All
instructions are executed
The then- and else- blocks are both executed
A bit is used to enable/disable processors based
on the block being executed
Parallelism is reduced, impacting performance

15
16
Idle Processors

Idle CPU processors can be dynamically
rescheduled by OS
SIMD processors are not actually idle
All processors scheduled are following identical
execution paths
Disabled (idle) processors are unavailable for
other work and cannot be rescheduled
Effective utilization of processors is the
programmers responsibility
Scheduling is an art, not necessarily a science
Techniques will vary from chip to chip

16
17
Conditional Processing

If (condition)
else

17
18
Nested Conditional Processing

If (condition)
if (condition2)
else
else

18
19
Loop Processing

while (condition)
if (cond2)

19
20
The Cost of Memory Access

Registers are extremely fast, but are a limited
resource
Cached memories also tend to be small
For large data sets, global memory provides read
write access
Accesses take between 600 and 800 clock cycles
Accesses are not cached
To hide memory latency, the hardware provides
fast context switches when memory is accessed
However, there must be enough computational work
to do to hide the high cost of memory access
Programmers need to be smart
Compilers often dont provide the necessary
optimizations when optimizing for speed instead
of code size
It can sometimes be cheaper to recompute a result
than perform a memory read/write

20
21
Conditional Processing

float a someVar
if (condition)
else

...
if (condition)
...
float a someVar
...
else
...
float a someVar
...
...

Access Swap
Access Swap
Access Swap
21
22
Floating Point

GPUs are optimized for 32-bit accesses
64-bit double-precision values fetched from
memory as two 32-bit quantities
May impact performance in the event of memory
bank conflicts
One double-precision unit per multi-processor1

1http//www.ddj.com/hpc-high-performance-computing
/210102115
22
23
OpenCL vs CUDA

CUDA uses early code binding
Code is compiled with normal C/C/FORTRAN (beta)
source code
Need CUDA occupancy calculator to determine
number of threads based on resource utilization
Library support BLAS FFT DPT
OpenCL
Late binding of OpenCL code to executable
OpenCL compiler/linker embedded within
application
No need for CUDA occupancy calculator
Only supports C
No libraries

23
24
CUDA Occupancy Calculator
24
25
OpenCL vs CUDA
25
26
General Program Structure

Initialize GPU
Create GPU context
Build GPU program
Allocate GPU memory
Transfer data from CPU to GPU
Invoke GPU functions
Transfer data from GPU to CPU
Deallocate GPU memory
Finalize GPU usage

26
27
Create GPU Context

CUDA
Context creation is implicit in single-threaded
programs
Multiple contexts can be explicitly created
Each thread maintains a context stack
Top context is current context
Threads
Contexts can be swapped between threads
A thread can only have one context active at a
time (stack)
A context cannot be shared simultaneously between
threads
OpenCL
All commands explicitly associated with a context
Must create a command queue to invoke

27
28
Initialize GPU

CUDA
cudaGetDeviceCount()
cudaSetDevice()
cudaGetDeviceProperties()

CUDACUDA(int Device) Base() mValid
false int DeviceCount cudaGetDeviceCount(
DeviceCount ) if (!DeviceCount)
return Device Device -1 ?
DeviceCount - 1 Device cudaSetDevice(
Device ) mValid true
28
29
Initialize GPU

OpenCL
Context must be built before anything can be done
on the GPU
All commands are with respect to a given context

OpenCLOpenCL(int Device) Base() init()
// Initialize class pointers to NULL
cl_int RC mGPUContext clCreateContextFromTyp
e( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, RC )
size_t Bytes RC clGetContextInfo(
mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, Bytes
) int NumDevices Bytes / sizeof(
cl_device_id ) cl_device_id Devices
new cl_device_id NumDevices RC
clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES,
Bytes, Devices, NULL )
mCommandQueue clCreateCommandQueue(
mGPUContext, Devices Device , 0, RC )
size_t MaxWorkItemSizes 256 RC
clGetDeviceInfo( Devices Device ,
CL_DEVICE_MAX_WORK_ITEM_SIZES,
sizeof( MaxWorkItemSizes ),
MaxWorkItemSizes, NULL ) mMaxWorkItems
MaxWorkItemSizes 0 mMaxWorkItemsMask
(mMaxWorkItems - 1)
29
30
Build GPU Program

CUDA
GPU code is compiled using nvcc compiler
Object code is statically bound to CPU executable
GPU code is intrinsically part of the program
Mapping of problem to threads performed at
compile time

30
31
Build GPU Program

OpenCL
GPU code is bound at runtime to the GPU
OpenCL compiler is part of executable
Code can be source code or object code
Source code can be dynamically generated by the
program
Can be stored in an external file

// Continued from constructor char code
shrFindFilePath( code.cl", "." ) size_t
CodeLength 0 char Source
oclLoadProgSource( myCode, "", CodeLength )
const char SourceCode Source mProgram
clCreateProgramWithSource( mGPUContext, 1,
SourceCode,
CodeLength, RC ) RC
clBuildProgram( mProgram, 0, NULL, NULL, NULL,
NULL ) stdfree( code )
stdfree( Source ) mValid RC
CL_SUCCESS
31
32
Allocate/Deallocate GPU Memory

CUDA
Most frequently used allocator cudaMalloc()
Returns a memory pointer to GPU memory
Memory pointer cannot be used by CPU directly
Passed to GPU calls

void CUDAmalloc(size_t Bytes) void
Memory cudaError_t RC cudaMalloc( Memory,
Bytes ) return( RC cudaSuccess ? Memory
NULL ) void CUDAfree(void Memory) if
(Memory) cudaFree( Memory )
32
33
Allocate/Deallocate GPU Memory

OpenCL
Like all things, memory allocation explicitly
performed within a context

void OpenCLmalloc(size_t NumBytes) size_t
Size NumBytes / 32 (NumBytes 31 ? 1 0)
cl_int RC cl_mem Memory clCreateBuffer(
mGPUContext, CL_MEM_READ_WRITE,
Size, NULL, RC ) return( RC
CL_SUCCESS ? Memory NULL ) void
OpenCLfree(void Memory) if (Memory)
cl_mem Ptr reinterpret_castltcl_memgt(
Memory ) clReleaseMemObject( Memory )

33
34
CPU/GPU Data Transfer

Data moved across PCIe bus
CUDA
Data transfer accomplished via cudaMemcpy()
routine
Implicit synchronization point
Non-blocking copies are available
Direction is determined by enumeration
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
Allocated memory can be bound to texture memory
cudaBindTexture
OpenCL
Memory transfer via clEnqueueWriteBuffer() and
clEnqueueReadBuffer()
Synchronization controlled by parameters to calls
Default is non-blocking

34
35
Call GPU Functions (Kernels)

Functions in CPU are executed when invoked
GPU function calls from CPU create execution
queue
CPU does not wait until GPU function completes
command is simply queued
GPU executes commands on the queue using its own
ordering
Synchronization points cause CPU to stall to wait
for GPU return
CUDA
cudaThreadSynchronize()

35
36
GPU Function Calls

GPU function calls have an associated
dimensionality (which can be 1D, 2D or 3D)
CUDA
Extended language syntax to include problem
dimension
Syntax
functionltltltdimBlock,dimGridgtgtgt( arguments )
OpenCL
Must explicitly put function arguments into
context
clSetKernelArg()
Invoke kernel using the context
Kernel retrieves arguments from context
automatically

36
37
GPU Cleanup/Termination

CUDA
Manages most cleanup operations automatically as
a context is destroyed
OpenCL
Provides low-level APIs for deallocation of all
resources
Invoked in order opposite to invocation
clReleaseKernel()
clReleaseProgram()
clReleaseCommandQueue()
clReleaseContext()

37
38
Thread Batching Grids and Blocks

A kernel is executed as a grid of thread blocks
(aka blocks)
A thread block is a batch of threads that can
cooperate with each other by
Synchronizing their execution
Diverging execution results in performance loss
Efficiently sharing data through a low latency
shared memory
Two threads from two different blocks cannot
cooperate

Host
Device
Kernel 1
Kernel 2
Source NVIDIA CUDA Programming Guide version
1.1
39
Block and Thread IDs

Threads and blocks have IDs
So each thread can identify what data they will
operate on
Block ID 1D or 2D
Thread ID 1D, 2D, or 3D
Simplifies memoryaddressing when
processingmultidimensional data
Image processing
Solving PDEs on volumes
Other problems with underlying 1D, 2D or 3D
geometry

Source NVIDIA CUDA Programming Guide version
1.1
40
GPU Kernels

Each function is passed data to create a unique
ID
Data typically specifies spatial coordinates of
function execution processor within the hardware
The ID is used to coordinate data access
Ensures that two threads accesses do not collide
CUDA function types
__global__
Callable by CPU
Cannot be called by GPU
__device__
Callable by other GPU functions
Cannot be called by CPU
CUDA expands these as inline functions via nvcc
Adds to function resource utilization

40
41
OpenCL Kernel Invocation

Use C templates to simplify argument handling

templatelttypename Tgt inline cl_int
setArg(cl_kernel Kernel, unsigned Pos, T Arg)
return( clSetKernelArg( Kernel, Pos, sizeof( T
), Arg ) ) templateltgt inline cl_int
setArg(cl_kernel Kernel, unsigned Pos, size_t
SharedSize) // This routine, unlike the
others, sets up shared memory by passing //
NULL in as the pointer to the variable. return(
clSetKernelArg( Kernel, Pos, SharedSize, NULL )
) templateltgt inline cl_int setArg(cl_kernel
Kernel, unsigned Pos, int Arg) cl_int ArgInt
Arg return( clSetKernelArg( Kernel, Pos,
sizeof( ArgInt ), ArgInt ) ) templateltgt
inline cl_int setArg(cl_kernel Kernel, unsigned
Pos, float Arg) cl_float ArgFloat Arg
return( clSetKernelArg( Kernel, Pos, sizeof(
ArgFloat ), ArgFloat ) ) ... templatelttypename
T0gt inline cl_int setArgs(cl_kernel Kernel, T0
Arg0) return( setArg( Kernel, 0, Arg0 )
) templatelttypename T0, typename T1gt inline
cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1
Arg1) return( setArg( Kernel, 0, Arg0 )
setArg( Kernel, 1, Arg1 ) ) templatelttypename
T0, typename T1, typename T2gt inline cl_int
setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2
Arg2) return( setArg( Kernel, 0, Arg0 )
setArg( Kernel, 1, Arg1 ) setArg( Kernel, 2,
Arg2 ) ) ...
41
42
OpenCL Kernel Invocation

BLAS-like example
CUDA provides BLAS library OpenCL doesnt
Must write own BLAS routines in OpenCL to port
between the two easily
swap() function swaps contents of 2 vectors with
differing vector strides

void OpenCLblasSswap(int n, float x, int incx,
float y, int incy) if (!checkBLASKernel(
mSswapKernel, "Sswap" )) return
mLastBLASStatus BaseBLAS_INTERNAL_ERROR
if (x y) if (setArgs(
mSswapKernel, n, x, incx, y, incy )
CL_SUCCESS) executeBLASKernel(
mSswapKernel, n )
42
43
OpenCL Kernel Invocation

BLAS support functions

bool OpenCLcheckBLASKernel(cl_kernel Kernel,
const char KernelName) if (!mValid)
mLastBLASStatus BaseBLAS_NOT_INITIALIZED
return( false ) if (!(Kernel))
cl_int RC Kernel
clCreateKernel( mProgram, KernelName, RC )
if (RC ! CL_SUCCESS)
mLastBLASStatus BaseBLAS_INTERNAL_ERROR
return( false ) return( true
) inline void OpenCLexecuteBLASKernel(cl_ker
nel Kernel, int n) size_t Size n
size_t GlobalWorkSize Size mMaxWorkItemsMask
if (Size mMaxWorkItemsMask)
GlobalWorkSize mMaxWorkItems cl_int
RC clEnqueueNDRangeKernel( mCommandQueue,
Kernel, 1, NULL, GlobalWorkSize,
mMaxWorkItems, 0, NULL,
NULL ) clFinish( mCommandQueue )
mLastBLASStatus (RC CL_SUCCESS) ?
BaseBLAS_SUCCESS BaseBLAS_EXECUTION_FAILED

43
44
OpenCL Kernels

BLAS SSWAP example

__kernel void Sswap(__global int n, __global
float x, __global int incx,
__global float y, __global
int incy) const unsigned GID
get_global_id( 0 ) if (GID lt n)
int lx (incx gt 0) ? 0 ((1 - n) incx)
int ly (incy gt 0) ? 0 ((1 - n) incy)
float temp y ly GID incy y ly
GID incy x lx GID incx x
lx GID incx temp
http//developer.download.nvidia.com/OpenCL/NVIDIA
_OpenCL_JumpStart_Guide.pdf
44
45
CUDA Kernels

CPU
GPU (kernel.cu)

include kernel.cu ... const unsigned int
size_x 256 const unsigned int size_y
4096 ... dim3 grid(size_x / BLOCK_DIM,
size_y / BLOCK_DIM, 1) dim3
threads(BLOCK_DIM, BLOCK_DIM, 1)
transpose_naiveltltlt grid, threads gtgtgt(d_odata,
d_idata, size_x, size_y) cudaThreadSynchronize
() ...
define BLOCK_DIM 16 __global__ void
transpose_naive(float odata, float idata, int
width, int height) unsigned int xIndex
blockDim.x blockIdx.x threadIdx.x
unsigned int yIndex blockDim.y blockIdx.y
threadIdx.y if (xIndex lt width yIndex lt
height) unsigned int index_in
xIndex width yIndex unsigned int
index_out yIndex height xIndex
odataindex_out idataindex_in
45
46
Outline

GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Streams and Pinned Memory
Thread Scheduling
Parallel reduction
Program partitioning
Simultaneous graphics and algorithm processing
Case Studies
Conclusion
QA

46
47
Streams
Data1
Data2

Sequence of commands that execute serially
Allow overlapping of memory transfers and
kernel computations from different streams
Hides data transfer cost
Implementable in CUDA deviceswith compute
capability 1.1
Host memory must be of typepinned

Data1
Data2
Data2
Data1
H?D Transfers
D?H Transfers
Kernel Computation
Data1
Data2
Data1
Data2
Data1
Data2
H?D Transfers
Kernel Computation
47
D?H Transfers
48
Pinned Memory

Memory on the host that is mapped to devices
address space and thus accessible directly by a
kernel
Has several advantages
There is no need to allocate a block in device
memory and copy data between this block and the
block in host memory data transfers are
implicitly performed as needed by the kernel
Bandwidth between host and device memories is
higher
Write-combining Memory
Type of pinned memory where individual writes are
aggregated into a larger write operation
Avoids internal L1, L2 cache writes making more
cache available for rest of the application
Is not snooped during transfers across the PCI
Express bus, which can improve transfer
performance by up to 40

48
49
Threads and Scheduling in GPU

GPU consists of multiprocessors, each of which
has many processors
A kernel is executed as a grid of blocks
Thread block is a batch of threads that
cooperate with each other by
Synchronizing their execution
Diverging execution results in performance loss
Efficiently sharing data through a low latency
shared memory
All threads of a block reside on the same
multiprocessor (max 1024/MP)
Number of blocks a multiprocessor can process at
once depends on register and shared memory usage
per thread

Source NVIDIA CUDA Programming Guide version
1.1
50
Threads and Scheduling in GPU (contd)

Before execution a block is split into warps
A warp is a set of 32 threads which execute the
same instruction on a MP
Half-warp is either first 16 or second 16 threads
of a warp
Full efficiency is realized when all 16 threads
of a half-warp agree on their execution path
Branch divergence occurs if threads of a
half-warp diverge via a data dependent
conditional branch
The half-warp serially executes each branch path
taken, ignoring the result from threads that are
not on that path
Increases kernel execution time
Warps of the same block are executed in time
sliced fashion

50
51
Program Parallelism

The GPU is designed to address applications that
are data-parallel
Parallelism is an inherent factor to determine
suitability of a problem for GPU applications
In fact, applications in which enough parallelism
cannot be exposed may be slower on a GPU in
comparison to a single threaded CPU
Since the same program is executed for each data
element, there is no sophisticated flow control
Conditional checks need to be done on the CPU
Reduce the output of all threads, transfer
reduced result to CPU which tests condition and
appropriately issues further GPU threads
Can be expensive since transfers are done over
the PCIe bus!

52
Parallel Reduction

Perform a reduction of the data before
transferring to the CPU
Tree based reduction approach used within each
thread block
Reduction decomposed into multiple kernels to
reduce number of threads issued in the later
stages of tree based reduction

Example of tree based SUM
syncThreads()
52
53
Parallel Reduction (contd)

Types of optimization for efficient parallel
reduction
Algorithmic optimizations
Avoid divergent warps
Avoid shared memory bank conflicts sequential
addressing
First addition during global load halves the
number of blocks
Code optimizations
Loop unrolling
Multiple adds per thread to increase arithmetic
intensity of kernels (high ratio of computation
in kernel to global read and writes)

53
54
Parallel Reduction (contd)

Example of tree based reduced sum

Shared Memory
10
1
8
0
-2
3
5
-2
-3
2
7
11
2
-1
0
0
Thread IDs
0
2
4
6
8
10
12
14
1
-2
-2
8
9
7
11
7
-1
5
-5
-3
11
11
2
2
0
4
8
12
-3
18
1
6
-2
8
5
-5
9
7
13
11
2
7
-1
2
0
8
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
-1
11
41
1
7
6
-2
8
5
17
-3
9
7
2
13
2
54
55
Parallel Reduction (contd)

Warp divergence removed

0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Bank IDs
Shared Memory
10
1
8
0
-2
3
5
-2
-3
2
7
2
-1
0
11
0
Thread IDs
0
1
2
3
4
5
6
7
1
-2
-2
8
9
7
11
7
-1
5
-5
-3
11
11
2
2
0
1
2
3
-3
18
1
6
-2
8
5
-5
9
7
11
2
7
-1
13
2
0
1
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
-1
41
1
7
6
-2
8
5
17
-3
9
7
11
2
13
2
55
56
Parallel Reduction (contd)

Sequential Addressing

10
1
8
0
-2
3
5
-2
-3
2
7
2
Shared Memory
-1
0
11
0
Thread IDs
0
1
2
3
4
5
6
7
11
1
-2
-2
8
9
7
7
-1
5
-5
-3
11
11
2
2
0
1
2
3
-3
18
1
6
-2
8
5
-5
9
7
11
2
7
-1
13
2
0
1
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
41
1
7
-1
6
-2
8
5
17
-3
9
7
11
2
13
2
56
57
Program Partitioning

Assume a subroutine S is invoked N times in an
application
A multiprocessor of the GPU has 16K registers,
then maximum parallelism 16K/x
Since GPU can do fast hardware
context switches between the threads,
which share the 16K registers
However, data transfers between kernels will
become a significant overhead with increase in
number of partitions

N
3
1
2
Registers y
Time T Registers x
57
58
Simultaneous Graphics and Algorithm Processing

If the same GPU is used for graphics and
algorithmic processing
GPU resources may be saturated by graphics
application, leaving little bandwidth for other
applications
The fixed size of GPU memory (without swap space)
may cause application launch failure
Graphics tasks may cause cache pollution which
may cause erratic runtimes for general purpose
applications
Run warm up code to flush out caches
A single kernel execution cannot be longer than 5
seconds
Using a separate GPU for graphics and computation
avoids the above listed problems

58
59
Outline

GPU Architecture Overview
GPU Programming
Algorithm Acceleration Guidelines
Case Studies
Boolean Satisfiability
Fast SPICE model evaluation
Fault Simulation
SSTA
Conclusion
QA

59
60
Guidelines for GPU Acceleration for Software

Current GPUs have an expensive communication link
to the host. Data transfers should be minimized
Streams should be used to overlap communication
and computation
Partition kernels to increase parallelism that
can be leveraged
Full efficiency is realized when all 16 threads
of a half-warp agree on their execution path
Reduce warp divergence
Avoid bank conflicts when using shared memory
Kernels should have high arithmetic intensity

60
61
Case Studies

Two approaches for accelerating an algorithm on
the GPU
Re-architecting approach
Applicable when the problem does not have
inherent SIMD nature
May require significant algorithmic modifications
Examples
Boolean Satisfiability
Fault Dictionary Computation (not covered in this
talk, slides at end)
Porting approach
Applicable when problem runtime is dominated by a
subroutine, multiple invocations of which operate
upon independent data
Partition the subroutine into GPU kernels
Examples
Accelerating SPICE by porting model evaluation on
the GPU
Fault Simulation
Monte Carlo based statistical static timing
analysis (SSTA)

61
62
Boolean Satisfiability (SAT)

Given a Boolean formula in conjunctive normal
form (CNF)
Either find a satisfying truth assignment of all
variables
Or prove that there is no satisfying assignment
Decisions x true y true
The unassigned literal z gets implied because of
the unit clause rule
Implication z false
Iterative application of the unit clause rule is
called Boolean constant propagation (BCP)
Recent BCP based SAT solvers incorporate conflict
driven learning
A learned clause represents the search space that
has been pruned

x true
y true
Negative Literal
Positive Literal
Clause
62
63
Approach

Complete Approaches for SAT
Are exact, but algorithms do not easily lend
themselves to parallel implementations. Examples
GRASP, zChaff , CirCUs, MiniSAT
Stochastic Approaches for SAT
Can execute at high speeds, are scalable, but are
not exact. Examples Survey Propagation, WalkSAT,
RandomSAT
Present a hybrid procedure for SAT
Retains the best features of complete and
stochastic approaches
Proposed algorithm is based on MiniSAT
(implemented on the CPU)
The variable ordering heuristic of MiniSAT is
enhanced by a survey propagation (SP) based
procedure, which is implemented on the GPU
Proposed approach is called MESP (MiniSAT
enhanced with SP)

MESP

Next few slides
Discuss the GPU based SP implementation
Describe our MESP approach

MiniSAT
SP
63
64
Survey Propagation (SP) based SAT

Factor Graph - graphical representation of a SAT
instance
Variable nodes (variables)
Function nodes (clauses)
Is a tree if it has no cycles
SP is an algorithm in which agreement between
clauses and variables is reached by sending
probabilistic messages along edges of the
factor graph (message passing)
Pros highly scalable, parallelizable, exact for
factor graphs that are trees
Cons incomplete for non-tree factor graphs

64
65
Survey Propagation Equations

Notation
?, ß are clauses i, j are variables
V (i) set of all clauses where i appears in the
positive form
V -(i) set of all clauses where i appears in the
negative - form
?a?i is a warning (a probability) from clause ?
to variable i
Let i be in the form in ?
?s and ps are iteratively computed until
convergence

During Computation
After Convergence
65
66
Survey Propagation Flowchart
Randomly initialize ?a?i
Fixed variables satisfied clauses (ignored)
Compute p
Compute ?a?i
new
N
Declare non-convergence
C S ?a?i - ?a?i e?01
new
N
If contradiction, report and quit
?a?i??a?i itgtmax
new
C0
N
Y
Y
Y
S(?a?i ) 0
Call WalkSAT to determine satisfying assignment
N
Sorted List
Compute W (biases) Sort variables in decreasing
order of Ws
Fix first x of variables
66
67
Survey Propagation on the GPU

Implemented GPU kernels for the following
Compute ps, for all variables (V ) in parallel
Compute ?s, for all clauses (C ) in parallel
In particular, computes ?a?i for each variable i
in clause a
Check convergence (S(?a?i - ?a?i ) e?01)
using a reduced integer add operation over all
literals in all clauses
Compute S( ?a?i ) (to determine if non- trivial
convergence) using a reduced float add
operation
Compute Ws, for all variables in parallel
Parallel bitonic sort to find the largest x of
the Ws
CPU performs conditional checks, fixes variables
and executes WalkSAT

new
67
68
Data Structure on the GPU
V
2
1
Clause
Literal
Polarity
Per Variable Data (Static)
C
2
1
Variable
Polarity
Per Clause Data (Static)
C
1
2
?a?i
?s Written by Clauses Read by Variables
V
1
2

With 1 GB of Global memory, the 280 GTX GPU can
fit instances with upto 10M clauses and 1M
variables

p -
p
ps Written by Variables Read by Clauses
68
69
Survey Propagation on the GPU

Memory transfers between GPU and CPU
Single transfer for static per variable and per
clause data
During the computation of p and ?, there are no
transfers at all. All intermediate data is stored
in the global memory of the GPU
After convergence is detected, the sorted list of
variables in decreasing order of biases is
transferred (GPU ? CPU)
After the graph is simplified, the following are
updated (CPU ? GPU)
Variables that are fixed (dont contribute to ?
computation)
Clauses that are satisfied (dont contribute to p
computation)

69
70
Results (GPU based SP)

MESP is compared against
Braunstein et al. 2005 (B05) and MiniSAT which
were executed on a 3.6 GHz, 3GB Intel machine
running Linux
Manolios et al. 2006 (M06), which uses OpenGL on
NVIDIA GTX 7900 (512 MB memory , 128 cores,
750MHz) to implement survey propagation
For hard random instances MESP shows a 22
speedup over B05
M06 reports a 9 speedup over B05

70
71
MESP

SAT instance is read into MiniSAT and on the GPU
(executing SP)
MiniSAT is first invoked on the instance and
after it has made some progress, it invokes
GPU-based SP. MiniSAT transfers to SP
The current assignments and
A subset of the current learned clauses
Augment the current clause database in GPU-based
SP with 3 sets of learned clauses (LC) C1, C2 and
C3 . L is num. of literals in LC
C1 (0 lt L 10) C2 (10 lt L 25) C3 (25 lt L
50)
Statically allocate enough space in GPUs Global
Memory to store 8K clauses in C1, C2 and C3 each
Messages computed over all clauses (?) are now
computed in 4 separate kernels, one for each set
of clauses (C1, C2, C3 and C)
On convergence, SP (in MESP) fixes variables for
which the absolute bias difference W () - W
(-) lt t

71
72
MESP

MiniSAT decides the next variable to assign based
on Variable State Independent Decaying Sum
(VSIDS) heuristic
VSIDS chooses next decision variable with the
highest activity
Activity is the variable occurrence count, with a
higher weight on the variables of the more
recently added learned clauses
Activity of the variables in the learned clauses
is incremented by FM
In MESP, GPU-based SP invocation can return with
the following outcomes

SP converges and fixes certain variables, S
MiniSAT updates activity of variables in S by FSP
SP converges, fixes S and determines factor graph
is a tree, invokes WalkSAT. If WalkSAT finds
assignment, instance is solved. Else fixed
variables in S are returned to MiniSAT
SP converges but does not fix any variable
MiniSAT continues the search
SP does not converge/reports contradiction
72
73
MESP
MiniSAT (complete)
Survey Propagation (stochastic)
Current Assignments Subset of Learned Clauses
MiniSATs Decision Tree
Initial search
GPU attempts to converge on the SP messages
GPU
Continues search using updated activities
Activity Table
GPU works in conjunction with CPU to fix
variables
CPU
CPU instructs GPU to ignore fixed variables and
satisfied clauses
Activity updated for the variables S that are
fixed in SP
CPU
GPU
73
74
Results

MESP approach on GTX 280 GPU card on an Intel i7
CPU with 2.6 GHz, 9GB RAM, and running Linux.
MiniSAT run on the same CPU. Runtime in seconds
D 1 of Number of Variables FSP FM 1 C
20 t 0.01
The learned clauses on the GPU were updated at
every 5th invocation of SP
Up to 24K learned clauses
None of these instances were solved in MESP by an
invocation to WalkSAT

74
75
Summary

MESP is a GPU enhanced variable ordering
heuristic for SAT
GPU based survey propagation
ps for all variables and ?s for all clauses
computed in parallel
Check convergence using a reduced integer add
operation over all literals in all clauses
Test whether non-trivial convergence uses a
reduced float add operation
Compute biases for all variables in parallel
Parallel bitonic sort to find the largest x of
the biases
Survey propagation enhances the variable ordering
in MESP
Augment clause database on GPU with 3 sets of
learned clauses
?s for all clauses computed in 4 different
kernels
On average MESP is
64 (92) faster than MiniSAT on original (3-SAT)
instance

75
76
SPICE Model Evaluation on a GPU

SPICE is the de facto industry standard for VLSI
circuit simulations
Significant motivation for accelerating SPICE
simulations without losing accuracy
Increasing complexity and size of VLSI circuits
Increasing impact of process variations on the
electrical behavior of circuits
Require Monte Carlo based simulations
Accelerate the computationally expensive portion
of SPICE transistor model evaluation on a GPU
Proposed approach is integrated into a commercial
SPICE accelerator tool OmegaSIM
Already 10-1000x faster than traditional SPICE
implementations
With the proposed approach integrated, OmegaSIM
achieves a further speedup of 2.36X (3.07X) on
average (max)

77
Approach

Profiled SPICE simulations over several
benchmarks
75 of time spent in BSIM3 device model
evaluations
Billions of calls to device model evaluation
routines
Every device in the circuit is evaluated for
every time step
Possibly repeatedly until the Newton Raphson loop
for solving non-linear equations converges
Asymptotic speedup of 4X considering Amdahls
law.
These calls are parallelizable
Since they are independent of each other
Each call performs identical computations on
different data
Conform to the GPUs SIMD operating paradigm

78
Approach

CDFG-guided manual partitioning of BSIM3
evaluation code
Limitation on the available hardware resources
Registers (8192/per multiprocessor)
Shared Memory (16KB/per multiprocessor)
Bandwidth to global memory (max. sustainable is
80 GB/s)
If entire BSIM3 model is implemented as a single
kernel
Number of threads that can be issued in parallel
are not enough
To hide global memory access latency
If BSIM3 code is partitioned into many (small)
kernels
Requires large amounts of data transfer across
kernels
Done using global memory (not cached)
Negatively impacts performance
Proposed approach
Creates CDFG of the BSIM3 equations
Uses maximally disconnected components of this
graph as different kernels, considering the above
hardware limitations

79
Approach

Take GPU memory constraints into account
Global Memory
Used to store intermediate data which is
generated by one kernel and needed by another
(instead of transferring this data to host)
Texture Memory
Used for storing runtime parameters
Device parameters that remain unchanged
throughout the simulation
Advantages
It is cached, unlike global memory
No coalescing requirements, unlike global memory
No bank conflicts, such as possible in shared
memory
CUDAs efficient built in texture fetching
routines are used
Small texture memory loading overhead is easily
amortized
Constant Memory used for storing physical
constants
Most efficient when all threads access the same
data

80
Experiments

Proposed approach is implemented and integrated
into a commercial SPICE accelerator tool
OmegaSIM
Hardware used
CPU Intel Core 2 Quad, 2.4 GHz, 4GB RAM
GPU GeForce 8800 GTS, 128 Processors, 675 MHz,
512 MB RAM
Comparing BSIM3 model evaluation alone

81
Experiments - Complete SPICE Sim

With increase in number of transistors, speedup
obtained is higher
More device evaluation calls made in parallel,
latencies are better hidden
High accuracy with single precision floating
point implementation
Over 1M device evals. avg. (max.) error of 2.88
X 10-26 (9.0 X 10-22) Amp.
Newer devices with double precision capability
already in market

82
Conclusions

Significant interest in accelerating SPICE
75 of the SPICE runtime spent in BSIM3 model
evaluation allows asymptotic speedup of 4X
Our approach of accelerating model evaluation
using GPUs has been integrated with a commercial
fast SPICE tool
Obtained speedup of 2.36 X on average
BSIM3 model evaluation can be sped up by 30-40X
over 1M-2M calls
Take GPU memory constraints into account
Global Memory used to store intermediate data
Texture Memory used for storing runtime
parameters
Constant Memory used for storing physical
constants
Carefully partition kernels since
If entire BSIM3 model is implemented as a single
kernel
Number of threads that can be issued in parallel
are not enough to hide global memory access
latency
If BSIM3 code is partitioned into many (small)
kernels
Requires large amounts of data transfer across
kernels done using global memory

83
Introduction Fault Simulation

Fault Simulation (FS) is crucial in the VLSI
design flow
Given a digital design and a set of vectors V, FS
evaluates the number of stuck at faults (Fsim)
tested by applying V
The ratio of Fsim/Ftotal is a measure of fault
coverage
Current designs have millions of logic gates
The number of faulty variations are proportional
to design size
Each of these variations needs to be simulated
for the V vectors
Therefore, it is important to explore ways to
accelerate FS
The ideal FS approach should be
Fast
Scalable
Cost effective

83
84
Approach

Implement a look up table (LUT) based FS
All gates LUTs stored in texture memory (cached)
LUTs of all library gates fit in texture cache
To avoid cache misses during lookup
Individual k-input gate LUT requires 2k entries
Each gates LUT entries are located at a fixed
offset in the texture memory as shown above
Gate output is obtained by
accessing the memory at the gate offset input
value
Example output of AND2 gate when inputs are 1
and 0

0 1 2 3
0
84
85
Approach

Evaluate two vectors for the same gate in a
single thread
1/2/3/4 input gates require 4/16/64/256 entries
in LUT respectively
Our library consists of an INV and 2/3/4 input
AND, NAND, NOR and OR gates.
Hence total memory required for all LUTs is 1348
words
This fits in the texture memory cache (8KB per
MP)
Exploit both fault and pattern parallelism
Fault Parallel
All gates at a fixed topological level are
evaluated in parallel
Pattern Parallel
Simulations for any gate, for different patterns,
are done in parallel

85
86
Approach
Good
Faulty
vector
vector
vector
2
N
1
Good circuit value for vector 1
Faulty circuit value for vector 1

In practice, simulations for any gate, for
different patterns, are done in 2 phases, for all
the faults which lie in its TFI only
Phase 1 Good circuit simulation. Results
returned to CPU
Phase 2 Faulty circuit simulation. CPU does not
schedule a stuck-at-v fault in a pattern which
has v as the good circuit value
Fault injection also performed in parallel

86
87
Approach Fault Injection
Approach Fault Simulation
typedef struct __align__(16) int offset // Gate
types offset int a, b, c, d // Input values int
m0, m1 // Mask variables threadData
87
88
Approach Fault Detection
typedef struct __align__(16) int offset // Gate
types offset int a, b, c, d // Input values int
Good_Circuit_threadID // Good circuit simulation
thread ID threadData_Detect
88
89
Approach

We maximize GPU performance by ensuring that
No data dependency exists between threads issued
in parallel
The same instructions, on different data are
executed by all threads
We adapt to specific G80 memory constraints
LUT stored in texture memory. Key advantages are
Texture memory is cached
Total LUT size easily fits into available cache
size of 8KB/MP
No memory coalescing requirements
Efficient built-in texture fetching routines
available in CUDA
Non-zero time taken to load texture memory, but
cost easily amortized
Global memory writes for level i gates (and reads
for level i1 gates) are performed in a coalesced
fashion