Title: SPIRAL: An Overview
1SPIRAL An Overview
José M.F. Moura and Markus Püschel
Students
- Gavin Haentjens (CMU)
- Pinit Kumhom (Drexel)
- Neungsoo Park (USC)
- David Sepiashvili (CMU)
- Bryan Singer (CMU)
- Yevgen Voronenko (Drexel)
- Edward Wertz (CMU)
- Jianxin Xiong (UIUC)
Faculty
- José Moura (CMU)
- Jeremy Johnson (Drexel)
- Robert Johnson (MathStar)
- David Padua (UIUC)
- Viktor Prasanna (USC)
- Markus Püschel (CMU)
- Manuela Veloso (CMU)
Collaborators
- Christoph Ãœberhuber (TU Vienna)
- Franz Franchetti (TU Vienna)
http//www.ece.cmu.edu/spiral
2Sponsor
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3Moores Law and High(est) Performance Scientific
Computing
4SPIRAL
Automates
Implementation
Optimization
Platform-Adaptation
of DSP algorithms
5SPIRAL system
6Related Work on Code Generation/Adaptation
- PhiPAC, ATLAS (Linear algebra)
- Enumeration and evaluation of different blocking,
looping, etc. strategies for BLAS routines - SPARSITY (sparse matrix-vector multiply)
- Search for optimal blocking strategy to improve
register performance - FFTW (discrete Fourier transform package)
- Generated code modules (machine independent) for
small sizes - Flexible recursion to adapt to memory hierarchy
SPIRAL
- Code generation and adaptation for an entire
domain (linear transforms) of structurally
complex algorithms - Adaptation to all architecture features (memory,
cache, register, etc.) by automatic exploration
of algorithm space
7DSP Transform
Algorithm
8DSP Algorithms Example 4-point DFT
Cooley/Tukey FFT (size 4)
Fourier transform
Diagonal matrix (twiddles)
Permutation
Kronecker product
Identity
- algorithms reduce arithmetic cost
O(n2)?O(nlog(n)) - product of structured sparse matrices
- mathematical notation exhibits structure
9DSP Algorithms Terminology (SPIRAL)
Transform
parameterized matrix
Rule
- a breakdown strategy
- product of sparse matrices
Ruletree
- recursive application of rules
- uniquely defines an algorithm
- efficient representation
- easy manipulation
Formula
- few constructs and primitives
- uniquely defines an algorithm
- can be translated into code
10DSP Transforms
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sine Transforms (16 types)
modified discrete cosine transform
two-dimensional transform
Others filters, discrete wavelet transforms,
Haar, Hartley,
11Rules Breakdown Strategies
base case
recursive
translation
iterative
recursive
recursive
recursive
iterative/ recursive
built from few constructs and primitives
12Algorithms Ruletrees Formulas
13Formula for a DCT, size 16
14Number of Formulas/Algorithms
Using the rules included in SPIRAL
k 1 2 3 4 5 6 7 8 9
DFTs, size 2k 1 6 40 296 27744 162570361280 1
.01 1027 2.31 1061 2.86 10133
DCT IV, size 2k 1 10 126 31242 1924443362 7343
815121631354242 1.07 1038 2.30 1076 1.06
10153
exponential search space
15DSP Transform
Algorithm (Formula)
Implementation
16Formulas in SPL
( compose ( diagonal ( 2cos(1/16pi)
2cos(3/16pi) 2cos(5/16pi) 2cos(7/16pi) ) )
( permutation ( 1 3 4 2 ) ) ( tensor
( I 2 ) ( F 2 ) ) (
permutation ( 1 4 2 3 ) ) ( direct_sum
( compose ( F 2 ) (
diagonal ( 1 sqrt(1/2) ) ) ) (
compose ( matrix ( 1 1 0 )
( 0 (-1) 1 ) ) (
diagonal ( cos(13/8pi)-sin(13/8pi) sin(13/8pi)
cos(13/8pi)sin(13/8pi) ) ) ( matrix
( 1 0 ) ( 1 1 )
( 0 1 ) ) ( permutation ( 2
1 ) )
17SPL Syntax (Subset)
- matrix operations
- (compose formula formula ...)
- (tensor formula formula ...)
- (direct_sum formula formula ...)
- direct matrix description
- (matrix (a11 a12 ...) (a21 a22 ...) ...)
- (diagonal (d1 d2 ...))
- (permutation (p1 p2 ...))
- parameterized matrices
- (I n)
- (F n)
- scalars
- 1.5, 2/7, cos(..), w(3), pi, 1.2e-04
- definition of new symbols
- (define name formula)
- (template formula (i-code-list)
- directives for code generation
- codetype real/complex
- unroll on/off
allows extension of SPL
controls loop unrolling
18SPL Compiler, 4-point FFT
fast algorithm as formula as SPL program
(compose (tensor (F 2) (I 2)) (T 4 2) (tensor
(I 2) (F 2)) (L 4 2))
codetype
complex
real
19SPL Compiler Summary
SPL Program
SPL Formula
Template Definition
Symbol Definition
Parsing
Symbol Table
Abstract Syntax Tree
Template Table
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Built-in optimizations
Optimization
- single static assignment code
- no reuse of temporary vars
- only scalar temporary vars
- constants precomputed
- limited CSE
I-Code
Target Code Generation
C, FORTRAN function
Extensible through templates
20SIMD Short Vector Extensions
vector length 4
(4-way)
x
- Extension to instruction set architecture
- Available on most current architectures
- (SSE on Pentium, AltiVec on Motorola G4)
- Requires fine grain parallelism
- Large potential speed-up
Problems
- SIMD instructions are architecture specific
- No common API (usually assembly hand coding)
- Performance very sensitive to memory access
- Automatic (compiler) vectorization very limited
very difficult to use
21Vector code generation from SPL formulas
22DSP Transform
Algorithm (Formula)
Search
Implementation
23Why Search?
24Search Methods available in SPIRAL
- Exhaustive Search
- Dynamic Programming (DP)
- Random Search
- Hill Climbing
- STEER (similar to a genetic algorithm)
- Search over
- algorithm space and
- implementation options (degree of unrolling)
25STEER
Population n
Mutation
expand differently
Cross-Breeding
Population n1
swap expansions
Survival of Fittest
26Learning to Generate Fast Algorithms
- Learns from given dataset (formulas runtimes)
how to design a fast algorithm (breakdown
strategy) - Learns from a transform of one size, generates
the best algorithm for many sizes - Tested for DFT and WHT
27Experimental Results
28Generated DFT Code Pentium 4, SSE
hand-tuned vendor assembly code
(Pseudo) gflop/s
n
DFT 2n single precision, Pentium 4, 2.53 GHz,
using Intel C compiler 6.0
speedups (vector to C code) up to factor of 3.1
P. Rodriguez. A Radix-2 FFT Algorithm for Modern
Single Instruction Multiple Data (SIMD)
Architectures. Proc. ICASSP 2002
29Generated DFT Code Pentium 4, SSE2
gflops
n
DFT 2n double precision, Pentium 4, 2.53 GHz,
using Intel C compiler 6.0
speedups (vector to C code) up to factor of 1.8
30Other transforms
2-dim DCT 2n x 2n Pentium 4, 2.53 GHz, SSE
WHT 2n Pentium 4, 2.53 GHz, SSE
gflops
transform size
transform size
- WHT has only additions
- very simple transform
speedups (vector to C code) up to factor of 3
31Best DFT Trees, size 210 1024
Pentium 4 float
Pentium 4 double
Pentium III float
AthlonXP float
10
10
10
10
8
2
6
4
8
2
scalar
6
4
6
2
2
2
2
4
5
2
3
2
2
3
4
2
2
2
2
3
2
2
10
10
10
10
8
2
6
4
6
4
C vect
4
6
2
2
4
2
2
5
4
2
2
2
2
2
4
2
2
2
2
2
2
3
2
2
10
10
10
10
9
1
8
2
5
5
SIMD
5
5
7
2
7
1
2
2
2
3
3
2
3
3
5
2
5
2
2
3
2
3
trees platform/datatype dependent
32Crosstiming of best trees on Pentium 4
e.g., 50 performance loss by using PIII code on
P4
Relative performance w.r.t. best
n
DFT 2n single precision, runtime of best found of
other platforms
software adaptation is necessary
33Conclusions
SPIRAL closes the gap between math domain
(algorithms) and implementation domain (programs)
- Mathematical computer representation of
algorithms - Automatic translation of algorithms into code
SPIRAL does automatic optimization by intelligent
search/learning in the space of alternatives
- High level Mathematical manipulation of
algorithms - Low level Coding degrees of freedom
http//www.ece.cmu.edu/spiral
34References
Related Work
- R.C. Whaley and J. Dongarra. Automatically Tuned
Linear Algebra Software (ATLAS). In Proc.
Supercomputing 1998. Math-atlas.sourceforge.net - M. Frigo and S.-G. Johnson. FFTW An adaptive
software architecture for the FFT. In Proc.
ICASSP 1998, pp. 1381-1384. www.fftw.org - E.-J. Im and K. Yelick. Optimizing Sparse Matrix
Computations for Register Reuse in SPARSITY. In
Proc. ICCS 2001, pp. 127-136.
Further Reading on SPIRAL
http//www.ece.cmu.edu/spiral
- M. Püschel, B. Singer, J. Xiong, J. Moura, J.
Johnson, D. Padua, M. Veloso, R. Johnson. SPIRAL
A Generator for Platform-Adapted Libraries of
Signal Processing Algorithms. To appear in
Journal of High Performance Computing and
Applications. - J. Xiong, J. Johnson, R. Johnson, and D. Padua.
SPL A Language and Compiler for DSP Algorithms.
In Proc. PLDI 2001, pp. 298-308. - Bryan Singer and Manuela Veloso. Automating the
Modeling and Optimization of the Performance of
Signal Transforms. IEEE Trans. Signal Processing,
50(8), 2002, pp. 2003-2014. - F. Franchetti and M. Püschel. A SIMD Vectorizing
Compiler for Digital Signal Processing
Algorithms. In Proc. IPDPS 2002. - F. Franchetti and M. Püschel. Short Vector Code
Generation for the Discrete Fourier Transform. To
appear in Proc. IPDPS 2003.