Title: Introduction to GPU Programming for EDA
1Introduction to GPU Programming for EDA
- John F. Croix
- Cadence Design Systems, Inc.
- Sunil P. Khatri
- Texas AM University
- Acknowledgements NVIDIA, Nascentric Inc.,
Accelicon Inc. - Students Kanupriya Gulati, Vinay Karkala,
Kalyana Bollapalli
2Outline
- GPU Architecture Overview
- GPU Programming
- Algorithm Acceleration Guidelines
- Case Studies
- Conclusion
- QA
2
3Outline
- GPU Architecture Overview
- Evolution and architecture
- Peak performance
- GPU and CPU interaction practical
considerations - GPU Programming
- Algorithm Acceleration Guidelines
- Case Studies
- Conclusion
- QA
3
4GPU Evolution
- In the early days, graphics accelerators were
primitive - Acceleration of graphics rendering tasks for
(CRT) displays - Many hardwired graphics acceleration units
- With VLSI technology scaling, the GPU was born
- Many programmable processors to handle graphics
rendering tasks - Increased peak memory bandwidths and peak
performance - Goal was faster and more realistic rendering for
gaming applications - Recently, several scientific communities began to
leverage these GPUs - Initially used graphics APIs like OpenGL and
DirectX for these tasks - GPU vendors recognized this interest
- Development of C-like programming environments
such as CUDA - Development of GPU architectures tuned for
scientific computations
4
5GPU Introduction
- A GPU is essentially a commodity stream processor
- Highly parallel (100s of processor cores)
- Very fast (gt900 GFLOPS of peak performance)
- Operates in a SIMD manner. This is a key
restriction - Multiple processors operate in lock-step (same
instruction) but on different data - GPUs, owing to their massively parallel
architecture, have been used to accelerate - Image/stream processing, data compression,
numerical algorithms - Recently they have been used to accelerate CAD
algorithms as well. - Inexpensive, off-the-shelf cards like the NVIDIA
Quadro FX / 280 GTX GPU achieve impressive
performance - 933 GFLOPs peak performance
- 240 SIMD cores partitioned into 30
Multiprocessors (MPs) - 4GB (Quadro) and 1GB (GTX 280) device memory with
142 GB/s bandwidth - 1.4 GHz GPU operating frequency
- Programmed with Compute Unified Device
Architecture (CUDA) framework -
6GPU Architecture
- In the GTX 280, there are 10 Thread Processing
Clusters (TPCs) - Each has 3 Streaming Multiprocessors (SMs), which
we will refer to as multiprocessors (MPs) - Each MP has 8 Streaming Processors (SPs) or
Thread Processors (TPs). We will refer to these
as processors. - 240 processors and 30 MPs in all!
- One double-precision FP unit per SM
6
Source NVIDIA
7GPU vs CPUNVIDIA 280 vs Intel i7 860
1http//ark.intel.com/Product.aspx?id41316 2TPC
Thread Processing Cluster (24 cores) 330
multi-processors in a 280
7
8GPU vs CPU Peak Performance Trends
- GPU peak performance has grown aggressively.
- Hardware has kept up with Moores law
8
Source NVIDIA
9GPU Programming Model
- The GPU is viewed as a compute device that
- Is a coprocessor (slave) to the CPU (host)
- Has its own DRAM (device memory) but no virtual
memory - Entire design instance may not fit on the GPU!
- Kernel is a CPU-callable function. Thread is an
instance of a kernel. - GPU runs many threads in parallel.
Device
Host
(CPU)
(GPU)
Kernel
Threads (instances of the kernel)
PCIe
Device
Memory
10Data Transfers (CPU?GPU)
- GPUs and CPUs communicate via a PCIe bus
- This communication is expensive and should be
minimized for target applications - Graphics applications usually require
- Initial data to be sent from CPU to GPU
- Single transfer of processed data from GPU to CPU
- General purpose computations usually require
- Multiple transfers between CPU and GPU (since
conditional checks on CPU) - Possibility of saturating the PCIe bus and
reducing the achievable performance
10
11GPU Threads v/s CPU Threads
- GPU threads
- Lightweight, small creation and scheduling
overhead, extremely fast hardware context
switching - Need to issue 1000s of GPU threads to hide global
memory latencies (600-800 cycles) - CPU threads
- Heavyweight, large scheduling overhead, slow
context switching - Multi-GPU usage requires invocation of multiple
CPU threads - Each CPU thread creates a GPU context
- Context swapping is required for a CPU thread to
access GPU memory allocated by another CPU thread
11
12Device Memory Space Overview
- Each thread runs on a SP and has
- R/W per-thread registers (on-chip)
- Limit usage (max 16K/MP)
- R/W per-thread local memory (off)
- R/W per-block shared memory (on)
- Need to avoid bank conflicts
- R/W per-grid global memory (off)
- Not cached, 600-800 cycle read
- Latency hidden by parallelism
- and fast context switches
- Main means for data transfer from host and device
- Coalescing recommended
- RO per-grid cached constant and texture memory
(off) - The host can R/W global, constant and texture
memories (visible to all threads)
(Device) Grid
Block (0, 0)
Block (1, 0)
Shared Memory
Shared Memory
Registers
Registers
Registers
Registers
Thread (0, 0)
Thread (1, 0)
Thread (0, 0)
Thread (1, 0)
Local Memory
Local Memory
Local Memory
Local Memory
Global Memory
Host
Constant Memory
Texture Memory
Source NVIDIA CUDA Programming Guide version
1.1
13Outline
- GPU Architecture Overview
- GPU Programming
- CPU threads
- Conditional and Loop processing
- Floating point
- General GPU program structure
- CUDA and OpenCL
- Algorithm Acceleration Guidelines
- Case Studies
- Conclusion
- QA
13
14CPU Threading
- CPU
- All threads are equivalent
- Read/write concurrently to the same memory
- Synchronization primitives required to avoid
collisions - GPU (NVIDIA)
- Each CPU thread maintains a unique context
- GPU resources (e.g. memory, code modules, address
space) are context-specific - Each CPU thread can access a single context at
once - Contexts must be exchanged between CPU threads to
share GPU resources between CPU threads - Contexts use reference counting and are
automatically destroyed
14
15SIMD Conditional Processing
- Unlike threads in a CPU-based program, SIMD
programs cannot follow different execution paths - Ideal scenario
- All GPU threads follow the same execution path
- All processors active continuously
- In divergent paths, some processors execute the
then-block and others the else-block - Program flow cannot actually diverge. All
instructions are executed - The then- and else- blocks are both executed
- A bit is used to enable/disable processors based
on the block being executed - Parallelism is reduced, impacting performance
15
16Idle Processors
- Idle CPU processors can be dynamically
rescheduled by OS - SIMD processors are not actually idle
- All processors scheduled are following identical
execution paths - Disabled (idle) processors are unavailable for
other work and cannot be rescheduled - Effective utilization of processors is the
programmers responsibility - Scheduling is an art, not necessarily a science
- Techniques will vary from chip to chip
16
17Conditional Processing
17
18Nested Conditional Processing
-
- If (condition)
-
- if (condition2)
-
-
-
- else
-
-
-
-
- else
-
-
-
18
19Loop Processing
-
- while (condition)
-
- if (cond2)
-
-
-
-
19
20The Cost of Memory Access
- Registers are extremely fast, but are a limited
resource - Cached memories also tend to be small
- For large data sets, global memory provides read
write access - Accesses take between 600 and 800 clock cycles
- Accesses are not cached
- To hide memory latency, the hardware provides
fast context switches when memory is accessed - However, there must be enough computational work
to do to hide the high cost of memory access - Programmers need to be smart
- Compilers often dont provide the necessary
optimizations when optimizing for speed instead
of code size - It can sometimes be cheaper to recompute a result
than perform a memory read/write
20
21Conditional Processing
-
- float a someVar
- if (condition)
-
-
-
- else
-
-
-
- ...
- if (condition)
-
- ...
- float a someVar
- ...
-
- else
-
- ...
- float a someVar
- ...
-
- ...
Access Swap
Access Swap
Access Swap
21
22Floating Point
- GPUs are optimized for 32-bit accesses
- 64-bit double-precision values fetched from
memory as two 32-bit quantities - May impact performance in the event of memory
bank conflicts - One double-precision unit per multi-processor1
1http//www.ddj.com/hpc-high-performance-computing
/210102115
22
23OpenCL vs CUDA
- CUDA uses early code binding
- Code is compiled with normal C/C/FORTRAN (beta)
source code - Need CUDA occupancy calculator to determine
number of threads based on resource utilization - Library support BLAS FFT DPT
- OpenCL
- Late binding of OpenCL code to executable
- OpenCL compiler/linker embedded within
application - No need for CUDA occupancy calculator
- Only supports C
- No libraries
23
24CUDA Occupancy Calculator
24
25OpenCL vs CUDA
25
26General Program Structure
- Initialize GPU
- Create GPU context
- Build GPU program
- Allocate GPU memory
- Transfer data from CPU to GPU
- Invoke GPU functions
- Transfer data from GPU to CPU
- Deallocate GPU memory
- Finalize GPU usage
26
27Create GPU Context
- CUDA
- Context creation is implicit in single-threaded
programs - Multiple contexts can be explicitly created
- Each thread maintains a context stack
- Top context is current context
- Threads
- Contexts can be swapped between threads
- A thread can only have one context active at a
time (stack) - A context cannot be shared simultaneously between
threads - OpenCL
- All commands explicitly associated with a context
- Must create a command queue to invoke
27
28Initialize GPU
- CUDA
- cudaGetDeviceCount()
- cudaSetDevice()
- cudaGetDeviceProperties()
CUDACUDA(int Device) Base() mValid
false int DeviceCount cudaGetDeviceCount(
DeviceCount ) if (!DeviceCount)
return Device Device -1 ?
DeviceCount - 1 Device cudaSetDevice(
Device ) mValid true
28
29Initialize GPU
- OpenCL
- Context must be built before anything can be done
on the GPU - All commands are with respect to a given context
OpenCLOpenCL(int Device) Base() init()
// Initialize class pointers to NULL
cl_int RC mGPUContext clCreateContextFromTyp
e( 0, CL_DEVICE_TYPE_GPU, NULL, NULL, RC )
size_t Bytes RC clGetContextInfo(
mGPUContext, CL_CONTEXT_DEVICES, 0, NULL, Bytes
) int NumDevices Bytes / sizeof(
cl_device_id ) cl_device_id Devices
new cl_device_id NumDevices RC
clGetContextInfo( mGPUContext, CL_CONTEXT_DEVICES,
Bytes, Devices, NULL )
mCommandQueue clCreateCommandQueue(
mGPUContext, Devices Device , 0, RC )
size_t MaxWorkItemSizes 256 RC
clGetDeviceInfo( Devices Device ,
CL_DEVICE_MAX_WORK_ITEM_SIZES,
sizeof( MaxWorkItemSizes ),
MaxWorkItemSizes, NULL ) mMaxWorkItems
MaxWorkItemSizes 0 mMaxWorkItemsMask
(mMaxWorkItems - 1)
29
30Build GPU Program
- CUDA
- GPU code is compiled using nvcc compiler
- Object code is statically bound to CPU executable
- GPU code is intrinsically part of the program
- Mapping of problem to threads performed at
compile time
30
31Build GPU Program
- OpenCL
- GPU code is bound at runtime to the GPU
- OpenCL compiler is part of executable
- Code can be source code or object code
- Source code can be dynamically generated by the
program - Can be stored in an external file
// Continued from constructor char code
shrFindFilePath( code.cl", "." ) size_t
CodeLength 0 char Source
oclLoadProgSource( myCode, "", CodeLength )
const char SourceCode Source mProgram
clCreateProgramWithSource( mGPUContext, 1,
SourceCode,
CodeLength, RC ) RC
clBuildProgram( mProgram, 0, NULL, NULL, NULL,
NULL ) stdfree( code )
stdfree( Source ) mValid RC
CL_SUCCESS
31
32Allocate/Deallocate GPU Memory
- CUDA
- Most frequently used allocator cudaMalloc()
- Returns a memory pointer to GPU memory
- Memory pointer cannot be used by CPU directly
- Passed to GPU calls
void CUDAmalloc(size_t Bytes) void
Memory cudaError_t RC cudaMalloc( Memory,
Bytes ) return( RC cudaSuccess ? Memory
NULL ) void CUDAfree(void Memory) if
(Memory) cudaFree( Memory )
32
33Allocate/Deallocate GPU Memory
- OpenCL
- Like all things, memory allocation explicitly
performed within a context
void OpenCLmalloc(size_t NumBytes) size_t
Size NumBytes / 32 (NumBytes 31 ? 1 0)
cl_int RC cl_mem Memory clCreateBuffer(
mGPUContext, CL_MEM_READ_WRITE,
Size, NULL, RC ) return( RC
CL_SUCCESS ? Memory NULL ) void
OpenCLfree(void Memory) if (Memory)
cl_mem Ptr reinterpret_castltcl_memgt(
Memory ) clReleaseMemObject( Memory )
33
34CPU/GPU Data Transfer
- Data moved across PCIe bus
- CUDA
- Data transfer accomplished via cudaMemcpy()
routine - Implicit synchronization point
- Non-blocking copies are available
- Direction is determined by enumeration
- cudaMemcpyHostToDevice
- cudaMemcpyDeviceToHost
- Allocated memory can be bound to texture memory
- cudaBindTexture
- OpenCL
- Memory transfer via clEnqueueWriteBuffer() and
clEnqueueReadBuffer() - Synchronization controlled by parameters to calls
- Default is non-blocking
34
35Call GPU Functions (Kernels)
- Functions in CPU are executed when invoked
- GPU function calls from CPU create execution
queue - CPU does not wait until GPU function completes
command is simply queued - GPU executes commands on the queue using its own
ordering - Synchronization points cause CPU to stall to wait
for GPU return - CUDA
- cudaThreadSynchronize()
35
36GPU Function Calls
- GPU function calls have an associated
dimensionality (which can be 1D, 2D or 3D) - CUDA
- Extended language syntax to include problem
dimension - Syntax
- functionltltltdimBlock,dimGridgtgtgt( arguments )
- OpenCL
- Must explicitly put function arguments into
context - clSetKernelArg()
- Invoke kernel using the context
- Kernel retrieves arguments from context
automatically
36
37GPU Cleanup/Termination
- CUDA
- Manages most cleanup operations automatically as
a context is destroyed - OpenCL
- Provides low-level APIs for deallocation of all
resources - Invoked in order opposite to invocation
- clReleaseKernel()
- clReleaseProgram()
- clReleaseCommandQueue()
- clReleaseContext()
37
38Thread Batching Grids and Blocks
- A kernel is executed as a grid of thread blocks
(aka blocks) - A thread block is a batch of threads that can
cooperate with each other by - Synchronizing their execution
- Diverging execution results in performance loss
- Efficiently sharing data through a low latency
shared memory - Two threads from two different blocks cannot
cooperate
Host
Device
Kernel 1
Kernel 2
Source NVIDIA CUDA Programming Guide version
1.1
39Block and Thread IDs
- Threads and blocks have IDs
- So each thread can identify what data they will
operate on - Block ID 1D or 2D
- Thread ID 1D, 2D, or 3D
- Simplifies memoryaddressing when
processingmultidimensional data - Image processing
- Solving PDEs on volumes
- Other problems with underlying 1D, 2D or 3D
geometry
Source NVIDIA CUDA Programming Guide version
1.1
40GPU Kernels
- Each function is passed data to create a unique
ID - Data typically specifies spatial coordinates of
function execution processor within the hardware - The ID is used to coordinate data access
- Ensures that two threads accesses do not collide
- CUDA function types
- __global__
- Callable by CPU
- Cannot be called by GPU
- __device__
- Callable by other GPU functions
- Cannot be called by CPU
- CUDA expands these as inline functions via nvcc
- Adds to function resource utilization
40
41OpenCL Kernel Invocation
- Use C templates to simplify argument handling
templatelttypename Tgt inline cl_int
setArg(cl_kernel Kernel, unsigned Pos, T Arg)
return( clSetKernelArg( Kernel, Pos, sizeof( T
), Arg ) ) templateltgt inline cl_int
setArg(cl_kernel Kernel, unsigned Pos, size_t
SharedSize) // This routine, unlike the
others, sets up shared memory by passing //
NULL in as the pointer to the variable. return(
clSetKernelArg( Kernel, Pos, SharedSize, NULL )
) templateltgt inline cl_int setArg(cl_kernel
Kernel, unsigned Pos, int Arg) cl_int ArgInt
Arg return( clSetKernelArg( Kernel, Pos,
sizeof( ArgInt ), ArgInt ) ) templateltgt
inline cl_int setArg(cl_kernel Kernel, unsigned
Pos, float Arg) cl_float ArgFloat Arg
return( clSetKernelArg( Kernel, Pos, sizeof(
ArgFloat ), ArgFloat ) ) ... templatelttypename
T0gt inline cl_int setArgs(cl_kernel Kernel, T0
Arg0) return( setArg( Kernel, 0, Arg0 )
) templatelttypename T0, typename T1gt inline
cl_int setArgs(cl_kernel Kernel, T0 Arg0, T1
Arg1) return( setArg( Kernel, 0, Arg0 )
setArg( Kernel, 1, Arg1 ) ) templatelttypename
T0, typename T1, typename T2gt inline cl_int
setArgs(cl_kernel Kernel, T0 Arg0, T1 Arg1, T2
Arg2) return( setArg( Kernel, 0, Arg0 )
setArg( Kernel, 1, Arg1 ) setArg( Kernel, 2,
Arg2 ) ) ...
41
42OpenCL Kernel Invocation
- BLAS-like example
- CUDA provides BLAS library OpenCL doesnt
- Must write own BLAS routines in OpenCL to port
between the two easily - swap() function swaps contents of 2 vectors with
differing vector strides
void OpenCLblasSswap(int n, float x, int incx,
float y, int incy) if (!checkBLASKernel(
mSswapKernel, "Sswap" )) return
mLastBLASStatus BaseBLAS_INTERNAL_ERROR
if (x y) if (setArgs(
mSswapKernel, n, x, incx, y, incy )
CL_SUCCESS) executeBLASKernel(
mSswapKernel, n )
42
43OpenCL Kernel Invocation
bool OpenCLcheckBLASKernel(cl_kernel Kernel,
const char KernelName) if (!mValid)
mLastBLASStatus BaseBLAS_NOT_INITIALIZED
return( false ) if (!(Kernel))
cl_int RC Kernel
clCreateKernel( mProgram, KernelName, RC )
if (RC ! CL_SUCCESS)
mLastBLASStatus BaseBLAS_INTERNAL_ERROR
return( false ) return( true
) inline void OpenCLexecuteBLASKernel(cl_ker
nel Kernel, int n) size_t Size n
size_t GlobalWorkSize Size mMaxWorkItemsMask
if (Size mMaxWorkItemsMask)
GlobalWorkSize mMaxWorkItems cl_int
RC clEnqueueNDRangeKernel( mCommandQueue,
Kernel, 1, NULL, GlobalWorkSize,
mMaxWorkItems, 0, NULL,
NULL ) clFinish( mCommandQueue )
mLastBLASStatus (RC CL_SUCCESS) ?
BaseBLAS_SUCCESS BaseBLAS_EXECUTION_FAILED
43
44OpenCL Kernels
__kernel void Sswap(__global int n, __global
float x, __global int incx,
__global float y, __global
int incy) const unsigned GID
get_global_id( 0 ) if (GID lt n)
int lx (incx gt 0) ? 0 ((1 - n) incx)
int ly (incy gt 0) ? 0 ((1 - n) incy)
float temp y ly GID incy y ly
GID incy x lx GID incx x
lx GID incx temp
http//developer.download.nvidia.com/OpenCL/NVIDIA
_OpenCL_JumpStart_Guide.pdf
44
45CUDA Kernels
include kernel.cu ... const unsigned int
size_x 256 const unsigned int size_y
4096 ... dim3 grid(size_x / BLOCK_DIM,
size_y / BLOCK_DIM, 1) dim3
threads(BLOCK_DIM, BLOCK_DIM, 1)
transpose_naiveltltlt grid, threads gtgtgt(d_odata,
d_idata, size_x, size_y) cudaThreadSynchronize
() ...
define BLOCK_DIM 16 __global__ void
transpose_naive(float odata, float idata, int
width, int height) unsigned int xIndex
blockDim.x blockIdx.x threadIdx.x
unsigned int yIndex blockDim.y blockIdx.y
threadIdx.y if (xIndex lt width yIndex lt
height) unsigned int index_in
xIndex width yIndex unsigned int
index_out yIndex height xIndex
odataindex_out idataindex_in
45
46Outline
- GPU Architecture Overview
- GPU Programming
- Algorithm Acceleration Guidelines
- Streams and Pinned Memory
- Thread Scheduling
- Parallel reduction
- Program partitioning
- Simultaneous graphics and algorithm processing
- Case Studies
- Conclusion
- QA
46
47Streams
Data1
Data2
- Sequence of commands that execute serially
- Allow overlapping of memory transfers and
kernel computations from different streams - Hides data transfer cost
- Implementable in CUDA deviceswith compute
capability 1.1 - Host memory must be of typepinned
Data1
Data2
Data2
Data1
H?D Transfers
D?H Transfers
Kernel Computation
Data1
Data2
Data1
Data2
Data1
Data2
H?D Transfers
Kernel Computation
47
D?H Transfers
48Pinned Memory
- Memory on the host that is mapped to devices
address space and thus accessible directly by a
kernel - Has several advantages
- There is no need to allocate a block in device
memory and copy data between this block and the
block in host memory data transfers are
implicitly performed as needed by the kernel - Bandwidth between host and device memories is
higher - Write-combining Memory
- Type of pinned memory where individual writes are
aggregated into a larger write operation - Avoids internal L1, L2 cache writes making more
cache available for rest of the application - Is not snooped during transfers across the PCI
Express bus, which can improve transfer
performance by up to 40
48
49Threads and Scheduling in GPU
- GPU consists of multiprocessors, each of which
has many processors - A kernel is executed as a grid of blocks
- Thread block is a batch of threads that
cooperate with each other by - Synchronizing their execution
- Diverging execution results in performance loss
- Efficiently sharing data through a low latency
shared memory - All threads of a block reside on the same
multiprocessor (max 1024/MP) - Number of blocks a multiprocessor can process at
once depends on register and shared memory usage
per thread
Source NVIDIA CUDA Programming Guide version
1.1
50Threads and Scheduling in GPU (contd)
- Before execution a block is split into warps
- A warp is a set of 32 threads which execute the
same instruction on a MP - Half-warp is either first 16 or second 16 threads
of a warp - Full efficiency is realized when all 16 threads
of a half-warp agree on their execution path - Branch divergence occurs if threads of a
half-warp diverge via a data dependent
conditional branch - The half-warp serially executes each branch path
taken, ignoring the result from threads that are
not on that path - Increases kernel execution time
- Warps of the same block are executed in time
sliced fashion
50
51Program Parallelism
- The GPU is designed to address applications that
are data-parallel - Parallelism is an inherent factor to determine
suitability of a problem for GPU applications - In fact, applications in which enough parallelism
cannot be exposed may be slower on a GPU in
comparison to a single threaded CPU - Since the same program is executed for each data
element, there is no sophisticated flow control - Conditional checks need to be done on the CPU
- Reduce the output of all threads, transfer
reduced result to CPU which tests condition and
appropriately issues further GPU threads - Can be expensive since transfers are done over
the PCIe bus!
52Parallel Reduction
- Perform a reduction of the data before
transferring to the CPU - Tree based reduction approach used within each
thread block - Reduction decomposed into multiple kernels to
reduce number of threads issued in the later
stages of tree based reduction
Example of tree based SUM
syncThreads()
52
53Parallel Reduction (contd)
- Types of optimization for efficient parallel
reduction - Algorithmic optimizations
- Avoid divergent warps
- Avoid shared memory bank conflicts sequential
addressing - First addition during global load halves the
number of blocks - Code optimizations
- Loop unrolling
- Multiple adds per thread to increase arithmetic
intensity of kernels (high ratio of computation
in kernel to global read and writes)
53
54Parallel Reduction (contd)
- Example of tree based reduced sum
Shared Memory
10
1
8
0
-2
3
5
-2
-3
2
7
11
2
-1
0
0
Thread IDs
0
2
4
6
8
10
12
14
1
-2
-2
8
9
7
11
7
-1
5
-5
-3
11
11
2
2
0
4
8
12
-3
18
1
6
-2
8
5
-5
9
7
13
11
2
7
-1
2
0
8
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
-1
11
41
1
7
6
-2
8
5
17
-3
9
7
2
13
2
54
55Parallel Reduction (contd)
0
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Bank IDs
Shared Memory
10
1
8
0
-2
3
5
-2
-3
2
7
2
-1
0
11
0
Thread IDs
0
1
2
3
4
5
6
7
1
-2
-2
8
9
7
11
7
-1
5
-5
-3
11
11
2
2
0
1
2
3
-3
18
1
6
-2
8
5
-5
9
7
11
2
7
-1
13
2
0
1
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
-1
41
1
7
6
-2
8
5
17
-3
9
7
11
2
13
2
55
56Parallel Reduction (contd)
10
1
8
0
-2
3
5
-2
-3
2
7
2
Shared Memory
-1
0
11
0
Thread IDs
0
1
2
3
4
5
6
7
11
1
-2
-2
8
9
7
7
-1
5
-5
-3
11
11
2
2
0
1
2
3
-3
18
1
6
-2
8
5
-5
9
7
11
2
7
-1
13
2
0
1
24
1
-2
8
5
7
7
-1
6
17
-3
9
13
11
2
2
0
41
1
7
-1
6
-2
8
5
17
-3
9
7
11
2
13
2
56
57Program Partitioning
- Assume a subroutine S is invoked N times in an
application - A multiprocessor of the GPU has 16K registers,
then maximum parallelism 16K/x - Since GPU can do fast hardware
context switches between the threads,
which share the 16K registers - However, data transfers between kernels will
become a significant overhead with increase in
number of partitions
N
3
1
2
Registers y
Time T Registers x
57
58Simultaneous Graphics and Algorithm Processing
- If the same GPU is used for graphics and
algorithmic processing - GPU resources may be saturated by graphics
application, leaving little bandwidth for other
applications - The fixed size of GPU memory (without swap space)
may cause application launch failure - Graphics tasks may cause cache pollution which
may cause erratic runtimes for general purpose
applications - Run warm up code to flush out caches
- A single kernel execution cannot be longer than 5
seconds - Using a separate GPU for graphics and computation
avoids the above listed problems
58
59Outline
- GPU Architecture Overview
- GPU Programming
- Algorithm Acceleration Guidelines
- Case Studies
- Boolean Satisfiability
- Fast SPICE model evaluation
- Fault Simulation
- SSTA
- Conclusion
- QA
59
60Guidelines for GPU Acceleration for Software
- Current GPUs have an expensive communication link
to the host. Data transfers should be minimized - Streams should be used to overlap communication
and computation - Partition kernels to increase parallelism that
can be leveraged - Full efficiency is realized when all 16 threads
of a half-warp agree on their execution path - Reduce warp divergence
- Avoid bank conflicts when using shared memory
- Kernels should have high arithmetic intensity
60
61Case Studies
- Two approaches for accelerating an algorithm on
the GPU - Re-architecting approach
- Applicable when the problem does not have
inherent SIMD nature - May require significant algorithmic modifications
- Examples
- Boolean Satisfiability
- Fault Dictionary Computation (not covered in this
talk, slides at end) - Porting approach
- Applicable when problem runtime is dominated by a
subroutine, multiple invocations of which operate
upon independent data - Partition the subroutine into GPU kernels
- Examples
- Accelerating SPICE by porting model evaluation on
the GPU - Fault Simulation
- Monte Carlo based statistical static timing
analysis (SSTA)
61
62Boolean Satisfiability (SAT)
- Given a Boolean formula in conjunctive normal
form (CNF) - Either find a satisfying truth assignment of all
variables - Or prove that there is no satisfying assignment
-
- Decisions x true y true
- The unassigned literal z gets implied because of
the unit clause rule - Implication z false
- Iterative application of the unit clause rule is
called Boolean constant propagation (BCP) - Recent BCP based SAT solvers incorporate conflict
driven learning - A learned clause represents the search space that
has been pruned
x true
y true
Negative Literal
Positive Literal
Clause
62
63Approach
- Complete Approaches for SAT
- Are exact, but algorithms do not easily lend
themselves to parallel implementations. Examples
GRASP, zChaff , CirCUs, MiniSAT - Stochastic Approaches for SAT
- Can execute at high speeds, are scalable, but are
not exact. Examples Survey Propagation, WalkSAT,
RandomSAT - Present a hybrid procedure for SAT
- Retains the best features of complete and
stochastic approaches - Proposed algorithm is based on MiniSAT
(implemented on the CPU) - The variable ordering heuristic of MiniSAT is
enhanced by a survey propagation (SP) based
procedure, which is implemented on the GPU - Proposed approach is called MESP (MiniSAT
enhanced with SP)
MESP
- Next few slides
- Discuss the GPU based SP implementation
- Describe our MESP approach
MiniSAT
SP
63
64Survey Propagation (SP) based SAT
- Factor Graph - graphical representation of a SAT
instance - Variable nodes (variables)
- Function nodes (clauses)
- Is a tree if it has no cycles
- SP is an algorithm in which agreement between
clauses and variables is reached by sending
probabilistic messages along edges of the
factor graph (message passing) - Pros highly scalable, parallelizable, exact for
factor graphs that are trees - Cons incomplete for non-tree factor graphs
64
65Survey Propagation Equations
- Notation
- ?, ß are clauses i, j are variables
- V (i) set of all clauses where i appears in the
positive form - V -(i) set of all clauses where i appears in the
negative - form - ?a?i is a warning (a probability) from clause ?
to variable i - Let i be in the form in ?
- ?s and ps are iteratively computed until
convergence -
During Computation
After Convergence
65
66Survey Propagation Flowchart
Randomly initialize ?a?i
Fixed variables satisfied clauses (ignored)
Compute p
Compute ?a?i
new
N
Declare non-convergence
C S ?a?i - ?a?i e?01
new
N
If contradiction, report and quit
?a?i??a?i itgtmax
new
C0
N
Y
Y
Y
S(?a?i ) 0
Call WalkSAT to determine satisfying assignment
N
Sorted List
Compute W (biases) Sort variables in decreasing
order of Ws
Fix first x of variables
66
67Survey Propagation on the GPU
- Implemented GPU kernels for the following
- Compute ps, for all variables (V ) in parallel
- Compute ?s, for all clauses (C ) in parallel
- In particular, computes ?a?i for each variable i
in clause a - Check convergence (S(?a?i - ?a?i ) e?01)
using a reduced integer add operation over all
literals in all clauses - Compute S( ?a?i ) (to determine if non- trivial
convergence) using a reduced float add
operation - Compute Ws, for all variables in parallel
- Parallel bitonic sort to find the largest x of
the Ws - CPU performs conditional checks, fixes variables
and executes WalkSAT
new
67
68Data Structure on the GPU
V
2
1
Clause
Literal
Polarity
Per Variable Data (Static)
C
2
1
Variable
Polarity
Per Clause Data (Static)
C
1
2
?a?i
?s Written by Clauses Read by Variables
V
1
2
- With 1 GB of Global memory, the 280 GTX GPU can
fit instances with upto 10M clauses and 1M
variables
p -
p
ps Written by Variables Read by Clauses
68
69Survey Propagation on the GPU
- Memory transfers between GPU and CPU
- Single transfer for static per variable and per
clause data - During the computation of p and ?, there are no
transfers at all. All intermediate data is stored
in the global memory of the GPU - After convergence is detected, the sorted list of
variables in decreasing order of biases is
transferred (GPU ? CPU) - After the graph is simplified, the following are
updated (CPU ? GPU) - Variables that are fixed (dont contribute to ?
computation) - Clauses that are satisfied (dont contribute to p
computation)
69
70Results (GPU based SP)
- MESP is compared against
- Braunstein et al. 2005 (B05) and MiniSAT which
were executed on a 3.6 GHz, 3GB Intel machine
running Linux - Manolios et al. 2006 (M06), which uses OpenGL on
NVIDIA GTX 7900 (512 MB memory , 128 cores,
750MHz) to implement survey propagation - For hard random instances MESP shows a 22
speedup over B05 - M06 reports a 9 speedup over B05
70
71MESP
- SAT instance is read into MiniSAT and on the GPU
(executing SP) - MiniSAT is first invoked on the instance and
after it has made some progress, it invokes
GPU-based SP. MiniSAT transfers to SP - The current assignments and
- A subset of the current learned clauses
- Augment the current clause database in GPU-based
SP with 3 sets of learned clauses (LC) C1, C2 and
C3 . L is num. of literals in LC - C1 (0 lt L 10) C2 (10 lt L 25) C3 (25 lt L
50) - Statically allocate enough space in GPUs Global
Memory to store 8K clauses in C1, C2 and C3 each - Messages computed over all clauses (?) are now
computed in 4 separate kernels, one for each set
of clauses (C1, C2, C3 and C) - On convergence, SP (in MESP) fixes variables for
which the absolute bias difference W () - W
(-) lt t
71
72MESP
- MiniSAT decides the next variable to assign based
on Variable State Independent Decaying Sum
(VSIDS) heuristic - VSIDS chooses next decision variable with the
highest activity - Activity is the variable occurrence count, with a
higher weight on the variables of the more
recently added learned clauses - Activity of the variables in the learned clauses
is incremented by FM - In MESP, GPU-based SP invocation can return with
the following outcomes
SP converges and fixes certain variables, S
MiniSAT updates activity of variables in S by FSP
SP converges, fixes S and determines factor graph
is a tree, invokes WalkSAT. If WalkSAT finds
assignment, instance is solved. Else fixed
variables in S are returned to MiniSAT
SP converges but does not fix any variable
MiniSAT continues the search
SP does not converge/reports contradiction
72
73MESP
MiniSAT (complete)
Survey Propagation (stochastic)
Current Assignments Subset of Learned Clauses
MiniSATs Decision Tree
Initial search
GPU attempts to converge on the SP messages
GPU
Continues search using updated activities
Activity Table
GPU works in conjunction with CPU to fix
variables
CPU
CPU instructs GPU to ignore fixed variables and
satisfied clauses
Activity updated for the variables S that are
fixed in SP
CPU
GPU
73
74Results
- MESP approach on GTX 280 GPU card on an Intel i7
CPU with 2.6 GHz, 9GB RAM, and running Linux.
MiniSAT run on the same CPU. Runtime in seconds - D 1 of Number of Variables FSP FM 1 C
20 t 0.01 - The learned clauses on the GPU were updated at
every 5th invocation of SP - Up to 24K learned clauses
- None of these instances were solved in MESP by an
invocation to WalkSAT
74
75Summary
- MESP is a GPU enhanced variable ordering
heuristic for SAT - GPU based survey propagation
- ps for all variables and ?s for all clauses
computed in parallel - Check convergence using a reduced integer add
operation over all literals in all clauses - Test whether non-trivial convergence uses a
reduced float add operation - Compute biases for all variables in parallel
- Parallel bitonic sort to find the largest x of
the biases - Survey propagation enhances the variable ordering
in MESP - Augment clause database on GPU with 3 sets of
learned clauses - ?s for all clauses computed in 4 different
kernels - On average MESP is
- 64 (92) faster than MiniSAT on original (3-SAT)
instance
75
76SPICE Model Evaluation on a GPU
- SPICE is the de facto industry standard for VLSI
circuit simulations - Significant motivation for accelerating SPICE
simulations without losing accuracy - Increasing complexity and size of VLSI circuits
- Increasing impact of process variations on the
electrical behavior of circuits - Require Monte Carlo based simulations
- Accelerate the computationally expensive portion
of SPICE transistor model evaluation on a GPU - Proposed approach is integrated into a commercial
SPICE accelerator tool OmegaSIM - Already 10-1000x faster than traditional SPICE
implementations - With the proposed approach integrated, OmegaSIM
achieves a further speedup of 2.36X (3.07X) on
average (max)
77Approach
- Profiled SPICE simulations over several
benchmarks - 75 of time spent in BSIM3 device model
evaluations - Billions of calls to device model evaluation
routines - Every device in the circuit is evaluated for
every time step - Possibly repeatedly until the Newton Raphson loop
for solving non-linear equations converges - Asymptotic speedup of 4X considering Amdahls
law. - These calls are parallelizable
- Since they are independent of each other
- Each call performs identical computations on
different data - Conform to the GPUs SIMD operating paradigm
78Approach
- CDFG-guided manual partitioning of BSIM3
evaluation code - Limitation on the available hardware resources
- Registers (8192/per multiprocessor)
- Shared Memory (16KB/per multiprocessor)
- Bandwidth to global memory (max. sustainable is
80 GB/s) - If entire BSIM3 model is implemented as a single
kernel - Number of threads that can be issued in parallel
are not enough - To hide global memory access latency
- If BSIM3 code is partitioned into many (small)
kernels - Requires large amounts of data transfer across
kernels - Done using global memory (not cached)
- Negatively impacts performance
- Proposed approach
- Creates CDFG of the BSIM3 equations
- Uses maximally disconnected components of this
graph as different kernels, considering the above
hardware limitations
79Approach
- Take GPU memory constraints into account
- Global Memory
- Used to store intermediate data which is
generated by one kernel and needed by another
(instead of transferring this data to host) - Texture Memory
- Used for storing runtime parameters
- Device parameters that remain unchanged
throughout the simulation - Advantages
- It is cached, unlike global memory
- No coalescing requirements, unlike global memory
- No bank conflicts, such as possible in shared
memory - CUDAs efficient built in texture fetching
routines are used - Small texture memory loading overhead is easily
amortized - Constant Memory used for storing physical
constants - Most efficient when all threads access the same
data
80Experiments
- Proposed approach is implemented and integrated
into a commercial SPICE accelerator tool
OmegaSIM - Hardware used
- CPU Intel Core 2 Quad, 2.4 GHz, 4GB RAM
- GPU GeForce 8800 GTS, 128 Processors, 675 MHz,
512 MB RAM - Comparing BSIM3 model evaluation alone
81Experiments - Complete SPICE Sim
- With increase in number of transistors, speedup
obtained is higher - More device evaluation calls made in parallel,
latencies are better hidden - High accuracy with single precision floating
point implementation - Over 1M device evals. avg. (max.) error of 2.88
X 10-26 (9.0 X 10-22) Amp. - Newer devices with double precision capability
already in market
82Conclusions
- Significant interest in accelerating SPICE
- 75 of the SPICE runtime spent in BSIM3 model
evaluation allows asymptotic speedup of 4X - Our approach of accelerating model evaluation
using GPUs has been integrated with a commercial
fast SPICE tool - Obtained speedup of 2.36 X on average
- BSIM3 model evaluation can be sped up by 30-40X
over 1M-2M calls - Take GPU memory constraints into account
- Global Memory used to store intermediate data
- Texture Memory used for storing runtime
parameters - Constant Memory used for storing physical
constants - Carefully partition kernels since
- If entire BSIM3 model is implemented as a single
kernel - Number of threads that can be issued in parallel
are not enough to hide global memory access
latency - If BSIM3 code is partitioned into many (small)
kernels - Requires large amounts of data transfer across
kernels done using global memory
83Introduction Fault Simulation
- Fault Simulation (FS) is crucial in the VLSI
design flow - Given a digital design and a set of vectors V, FS
evaluates the number of stuck at faults (Fsim)
tested by applying V - The ratio of Fsim/Ftotal is a measure of fault
coverage - Current designs have millions of logic gates
- The number of faulty variations are proportional
to design size - Each of these variations needs to be simulated
for the V vectors - Therefore, it is important to explore ways to
accelerate FS - The ideal FS approach should be
- Fast
- Scalable
- Cost effective
83
84Approach
- Implement a look up table (LUT) based FS
- All gates LUTs stored in texture memory (cached)
- LUTs of all library gates fit in texture cache
- To avoid cache misses during lookup
- Individual k-input gate LUT requires 2k entries
- Each gates LUT entries are located at a fixed
offset in the texture memory as shown above - Gate output is obtained by
- accessing the memory at the gate offset input
value - Example output of AND2 gate when inputs are 1
and 0
0 1 2 3
0
84
85Approach
- Evaluate two vectors for the same gate in a
single thread - 1/2/3/4 input gates require 4/16/64/256 entries
in LUT respectively - Our library consists of an INV and 2/3/4 input
AND, NAND, NOR and OR gates. - Hence total memory required for all LUTs is 1348
words - This fits in the texture memory cache (8KB per
MP) - Exploit both fault and pattern parallelism
- Fault Parallel
- All gates at a fixed topological level are
evaluated in parallel - Pattern Parallel
- Simulations for any gate, for different patterns,
are done in parallel
85
86Approach
Good
Faulty
vector
vector
vector
2
N
1
Good circuit value for vector 1
Faulty circuit value for vector 1
- In practice, simulations for any gate, for
different patterns, are done in 2 phases, for all
the faults which lie in its TFI only - Phase 1 Good circuit simulation. Results
returned to CPU - Phase 2 Faulty circuit simulation. CPU does not
schedule a stuck-at-v fault in a pattern which
has v as the good circuit value - Fault injection also performed in parallel
86
87Approach Fault Injection
Approach Fault Simulation
typedef struct __align__(16) int offset // Gate
types offset int a, b, c, d // Input values int
m0, m1 // Mask variables threadData
87
88Approach Fault Detection
typedef struct __align__(16) int offset // Gate
types offset int a, b, c, d // Input values int
Good_Circuit_threadID // Good circuit simulation
thread ID threadData_Detect
88
89Approach
- We maximize GPU performance by ensuring that
- No data dependency exists between threads issued
in parallel - The same instructions, on different data are
executed by all threads - We adapt to specific G80 memory constraints
- LUT stored in texture memory. Key advantages are
- Texture memory is cached
- Total LUT size easily fits into available cache
size of 8KB/MP - No memory coalescing requirements
- Efficient built-in texture fetching routines
available in CUDA - Non-zero time taken to load texture memory, but
cost easily amortized - Global memory writes for level i gates (and reads
for level i1 gates) are performed in a coalesced
fashion
89
90Results