Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1 - PowerPoint PPT Presentation

1 / 25
About This Presentation
Title:

Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1

Description:

A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs ICS'08 ... Array access functions - affine functions of surrounding loop variables, ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 26
Provided by: psaday
Learn more at: https://cse.osu.edu
Category:

less

Transcript and Presenter's Notes

Title: Muthu Baskaran1 Uday Bondhugula1 Sriram Krishnamoorthy 1


1
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs
  • Muthu Baskaran1 Uday Bondhugula1
    Sriram Krishnamoorthy 1
  • J Ramanujam2 Atanas Rountev1 P
    Sadayappan1
  • 1Department of Computer Science Engineering
  • The Ohio State University
  • 2Department of Electrical and Computer
    Engineering
  • Louisiana State University

Supported by NSF
2
Introduction
  • Emergence of many-core architectures
  • High computation power
  • E.g. GPUs
  • Development of high-performance codes in such
    architectures Non-trivial!
  • CUDA parallel programming model for NVIDIA GPUs
  • Good abstraction of the underlying architecture
  • Not straight-forward to write a high-performance
    CUDA code

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
3
Introduction
  • Optimizations needed to address architectural
    challenges
  • Memory access model
  • Granularity and levels of parallelism in
    architecture
  • Solution Compiler infrastructure to
    automatically generate efficient parallel
    programs
  • PLuTo compiler framework PLDI08 recently
    developed for general-purpose multi-core targets
  • Sequential C to OpenMP Parallel Tiled Code
  • Develop a framework to automatically generate
    parallel CUDA code

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
4
Polyhedral Model
  • An algebraic framework for representing affine
    programs statement domains, dependences, array
    access functions and affine program
    transformations
  • Regular affine programs
  • Dense arrays
  • Loop bounds affine functions of outer loop
    variables, constants and program parameters
  • Array access functions - affine functions of
    surrounding loop variables, constants and program
    parameters

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
5
Polyhedral Model
for (i1 ilt7 i) for (j2 jlt6 j)
S1 aij aji aij-1
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
6
PLuTo Framework
  • Available at
  • http//sourceforge.net/projects/pluto-compiler

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
7
NVIDIA GPU Architecture
  • Two levels of parallelism
  • Threads (Processor cores)
  • Grouped into SIMD warps
  • Thread blocks (Multiprocessors)
  • Various memory units
  • Different memory access model
  • Cache and local store hierarchy
  • Partitioning (e.g. registers) and sharing of
    resources (e.g. shared memory)

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
8
Performance Characterization of NVIDIA GeForce
8800 GTX
  • Get insights into optimizations to be addressed
    by a compiler framework
  • Characterize key features of the machine
    architecture and their impact on different
    strategies
  • Global memory access
  • Shared memory (scratchpad) access
  • Concurrency and register pressure

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
9
Global Memory Access
  • Measured memory read bandwidth for
  • Different data sizes
  • Blocked and cyclic distribution of data access
    amongst the threads of a single thread block

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
10
Global Memory Access
  • Cyclic access has much higher bandwidth
  • Hardware optimization called global memory
    coalescing
  • Access from consecutive threads of a (half) warp
    to consecutive locations coalesced
  • Base address of (half) warp aligned to 4, 8 or 16
    bytes

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
11
Optimizing Global Memory Access
  • Determine the extent of reuse of arrays
  • For arrays with sufficient reuse
  • Copy from global memory to shared memory
    PPoPP08
  • For arrays with no reuse
  • Find affine transformations enabling global
    memory coalescing
  • If no suitable affine transformation enabling
    global memory coalescing
  • Copy to shared memory with possible global memory
    coalescing

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
12
Optimizing Global Memory Access
tmv kernel for (i0iltni) S xi0
for (j0jltnj) T xiaji yj
mv kernel for (i0iltni) P x i0
for (j0jltnj) Q xiaij yj
a
y
x
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
13
Experimental Evaluation
Performance comparison (in GFLOPS) of mv kernel
N Direct Global Copied to Shared
4K 0.43 5.61
5K 0.48 5.79
6K 0.35 6.04
7K 0.30 5.78
8K 0.24 5.52
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
14
Experimental Evaluation
Performance comparison (in GFLOPS) of mv kernel
Performance comparison (in GFLOPS) of tmv kernel
N Direct Global Copied to Shared
4K 0.43 5.61
5K 0.48 5.79
6K 0.35 6.04
7K 0.30 5.78
8K 0.24 5.52
N Non optimized Global Optimized Global
4K 4.22 25.21
5K 3.09 28.90
6K 3.24 33.47
7K 3.70 33.58
8K 4.13 34.93
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
15
Shared Memory Access
  • Shared memory organized into banks
  • 16 banks in NVIDIA 8800 GTX
  • Successive 32-bit words in successive banks
  • Bank conflicts in shared memory
  • n threads access different address in same bank
    n sequential requests (n-way conflict)
  • Bandwidth of shared memory access inversely
    proportional to the degree of bank conflicts
  • Goal To minimize shared memory bank conflicts

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
16
Optimizing Shared Memory Access
  • Strategy to minimize bank conflicts during shared
    memory access
  • Pad the arrays copied into shared memory
  • Degree of bank conflicts
  • gcd (stride of array access across threads of
    a half warp, number of bank modules)
  • Cost of accessing a word in shared memory
  • Linear function of degree of bank conflicts
  • Find padding factor that minimizes the cost
    considering all references to the array

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
17
Experimental Evaluation
Performance comparison (in GFLOPS) of mv kernel
N Non optimized Share Optimized Share
4K 5.61 13.18
5K 5.79 13.87
6K 6.04 14.37
7K 5.78 13.86
8K 5.52 13.63
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
18
Parallelism vs Register Pressure
  • Performance enhancing approaches
  • Reduction of number of loads/stores
  • Increase in ILP
  • Reduce dynamic instructions
  • Loop overhead reduction
  • Well-known optimization Loop unrolling
  • Issues
  • Increased register pressure
  • Might reduce number of concurrent threads
  • Registers are partitioned among thread blocks

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
19
Parallelism vs Register Pressure
  • Higher thread-level parallelism needed to mask
    global memory access latency
  • Threads scheduled in an interleaved manner to
    mask global memory access latency
  • Trade-off between
  • number of active concurrent threads
  • number of registers available for a thread in a
    thread block
  • Problem Register allocation cannot be managed by
    any external compilation framework
  • Solution Empirical evaluation to select an
    optimal choice

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
20
Model-driven Empirical Search
  • Need for empirical search
  • Tight coupling
  • Program parameters tile sizes, unroll factors
  • System parameters threads, thread blocks
  • Resources - Number of registers, shared memory
  • Lack of control on the registers (usage and
    allocation)
  • Model to estimate number of loads/stores
  • Analytically in polyhedral model
  • Empirically using ptx code
  • Register usage instrumentation
  • Empirically using cubin object code

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
21
Model-driven Empirical Search
  • Perform multilevel tiling (except register-level
    tiling)
  • Generate optimal copy code
  • Prune code versions by global memory traffic
  • For all selected loop structures
  • do register-level tiling and explicit unrolling
  • instrument the register usage
  • discard those for which increased register
    pressure reduces concurrency to less than 25 of
    maximum possible concurrency
  • In all selected code versions, pad the arrays in
    shared memory with optimal padding factor
  • Search empirically among the remaining candidate
    loop structures
  • explicitly running them and timing the execution
    time and select the optimal one

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
22
Experimental Evaluation
Performance of Matrix Kernels
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
23
Experimental Evaluation
Performance of Matrix Kernels
A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
24
Related Work
  • Earlier GPU works
  • Automatic generation of pixel shader operations
    from a high-level data-parallel language
    Tarditi et al. ASPLOS06
  • Stream processing Brook, RapidMind, PeakStream
  • Considerable work on developing specific
    optimized algorithms and libraries for GPUs
  • E.g. CUDPP CUDA Data Parallel Primitives
  • Very little work on general compiler optimization
    strategies for GPUs
  • Performance metrics to prune the optimization
    search space on a pareto-optimality basis - by
    Ryoo et al. CGO08
  • Optimize data communication between CPU and
    co-processor by Gelado et al. ICS08

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
25
Summary
  • Developed compiler optimizations to address key
    performance-influencing factors on NVIDIA GPUs
  • Enable global memory coalescing in the polyhedral
    model for regular programs
  • Reduce shared memory bank conflicts
  • Determine optimized program parameters (unroll
    factors, tile sizes) through a model-driven
    empirical search

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
26
Ongoing and Future Work
  • Automatic thread-centric CUDA Code Generation in
    Polyhedral Model
  • Looking at data layout reordering to enhance
    memory accesses at various levels

A Compiler Framework for Optimization of Affine
Loop Nests for GPGPUs ICS08
27
Thank You!
28
How prevalent are affine computations?
  • Innermost core computations in many codes
  • Dense linear algebra
  • Image and signal processing
  • Computational Electromagnetics (FDTD)
  • Explicit PDE solvers (e.g. SWIM, SWEEP3D)
  • Likely to be more prevalent in future (esp.
    scientific)
  • Codes with direct data access better than
    indirect data access power performance
  • Structured-sparse (block sparse) better than
    arbitrary sparse (e.g. OSKI)
  • Algorithms with sparse-outer but regular-inner
    structure may be more attractive for many-core
    processors, e.g. multi-block methods
Write a Comment
User Comments (0)
About PowerShow.com