SPIRAL: An Overview - PowerPoint PPT Presentation

About This Presentation
Title:

SPIRAL: An Overview

Description:

large spread in runtimes, even for modest size. precisely equal arithmetic cost ... Learns from given dataset (formulas runtimes) how to design a fast algorithm ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 35
Provided by: jose270
Category:

less

Transcript and Presenter's Notes

Title: SPIRAL: An Overview


1
SPIRAL An Overview
José M.F. Moura and Markus Püschel
Students
  • Gavin Haentjens (CMU)
  • Pinit Kumhom (Drexel)
  • Neungsoo Park (USC)
  • David Sepiashvili (CMU)
  • Bryan Singer (CMU)
  • Yevgen Voronenko (Drexel)
  • Edward Wertz (CMU)
  • Jianxin Xiong (UIUC)

Faculty
  • José Moura (CMU)
  • Jeremy Johnson (Drexel)
  • Robert Johnson (MathStar)
  • David Padua (UIUC)
  • Viktor Prasanna (USC)
  • Markus Püschel (CMU)
  • Manuela Veloso (CMU)

Collaborators
  • Christoph Ãœberhuber (TU Vienna)
  • Franz Franchetti (TU Vienna)

http//www.ece.cmu.edu/spiral
2
Sponsor
Work supported by DARPA (DSO), Applied
Computational Mathematics Program, OPAL, through
grant managed by research grant DABT63-98-1-0004
administered by the Army Directorate of
Contracting.
3
Moores Law and High(est) Performance Scientific
Computing
4
SPIRAL
Automates
Implementation
Optimization
Platform-Adaptation
of DSP algorithms
5
SPIRAL system
6
Related Work on Code Generation/Adaptation
  • PhiPAC, ATLAS (Linear algebra)
  • Enumeration and evaluation of different blocking,
    looping, etc. strategies for BLAS routines
  • SPARSITY (sparse matrix-vector multiply)
  • Search for optimal blocking strategy to improve
    register performance
  • FFTW (discrete Fourier transform package)
  • Generated code modules (machine independent) for
    small sizes
  • Flexible recursion to adapt to memory hierarchy

SPIRAL
  • Code generation and adaptation for an entire
    domain (linear transforms) of structurally
    complex algorithms
  • Adaptation to all architecture features (memory,
    cache, register, etc.) by automatic exploration
    of algorithm space

7
DSP Transform
Algorithm
8
DSP Algorithms Example 4-point DFT
Cooley/Tukey FFT (size 4)
Fourier transform
Diagonal matrix (twiddles)
Permutation
Kronecker product
Identity
  • algorithms reduce arithmetic cost
    O(n2)?O(nlog(n))
  • product of structured sparse matrices
  • mathematical notation exhibits structure

9
DSP Algorithms Terminology (SPIRAL)
Transform
parameterized matrix
Rule
  • a breakdown strategy
  • product of sparse matrices

Ruletree
  • recursive application of rules
  • uniquely defines an algorithm
  • efficient representation
  • easy manipulation

Formula
  • few constructs and primitives
  • uniquely defines an algorithm
  • can be translated into code

10
DSP Transforms
discrete Fourier transform
Walsh-Hadamard transform
discrete cosine and sine Transforms (16 types)
modified discrete cosine transform
two-dimensional transform
Others filters, discrete wavelet transforms,
Haar, Hartley,
11
Rules Breakdown Strategies
base case
recursive
translation
iterative
recursive
recursive
recursive
iterative/ recursive
built from few constructs and primitives
12
Algorithms Ruletrees Formulas
13
Formula for a DCT, size 16
14
Number of Formulas/Algorithms
Using the rules included in SPIRAL
k 1 2 3 4 5 6 7 8 9
DFTs, size 2k 1 6 40 296 27744 162570361280 1
.01 1027 2.31 1061 2.86 10133
DCT IV, size 2k 1 10 126 31242 1924443362 7343
815121631354242 1.07 1038 2.30 1076 1.06
10153
exponential search space
15
DSP Transform
Algorithm (Formula)
Implementation
16
Formulas in SPL

( compose ( diagonal ( 2cos(1/16pi)
2cos(3/16pi) 2cos(5/16pi) 2cos(7/16pi) ) )
( permutation ( 1 3 4 2 ) ) ( tensor
( I 2 ) ( F 2 ) ) (
permutation ( 1 4 2 3 ) ) ( direct_sum
( compose ( F 2 ) (
diagonal ( 1 sqrt(1/2) ) ) ) (
compose ( matrix ( 1 1 0 )
( 0 (-1) 1 ) ) (
diagonal ( cos(13/8pi)-sin(13/8pi) sin(13/8pi)
cos(13/8pi)sin(13/8pi) ) ) ( matrix
( 1 0 ) ( 1 1 )
( 0 1 ) ) ( permutation ( 2
1 ) )

17
SPL Syntax (Subset)
  • matrix operations
  • (compose formula formula ...)
  • (tensor formula formula ...)
  • (direct_sum formula formula ...)
  • direct matrix description
  • (matrix (a11 a12 ...) (a21 a22 ...) ...)
  • (diagonal (d1 d2 ...))
  • (permutation (p1 p2 ...))
  • parameterized matrices
  • (I n)
  • (F n)
  • scalars
  • 1.5, 2/7, cos(..), w(3), pi, 1.2e-04
  • definition of new symbols
  • (define name formula)
  • (template formula (i-code-list)
  • directives for code generation
  • codetype real/complex
  • unroll on/off

allows extension of SPL
controls loop unrolling
18
SPL Compiler, 4-point FFT
fast algorithm as formula as SPL program
(compose (tensor (F 2) (I 2)) (T 4 2) (tensor
(I 2) (F 2)) (L 4 2))
codetype
complex
real
19
SPL Compiler Summary
SPL Program
SPL Formula
Template Definition
Symbol Definition
Parsing
Symbol Table
Abstract Syntax Tree
Template Table
Intermediate Code Generation
I-Code
Intermediate Code Restructuring
I-Code
Built-in optimizations
Optimization
  • single static assignment code
  • no reuse of temporary vars
  • only scalar temporary vars
  • constants precomputed
  • limited CSE

I-Code
Target Code Generation
C, FORTRAN function
Extensible through templates
20
SIMD Short Vector Extensions
vector length 4
(4-way)
x
  • Extension to instruction set architecture
  • Available on most current architectures
  • (SSE on Pentium, AltiVec on Motorola G4)
  • Requires fine grain parallelism
  • Large potential speed-up

Problems
  • SIMD instructions are architecture specific
  • No common API (usually assembly hand coding)
  • Performance very sensitive to memory access
  • Automatic (compiler) vectorization very limited

very difficult to use
21
Vector code generation from SPL formulas
22
DSP Transform
Algorithm (Formula)
Search
Implementation
23
Why Search?
24
Search Methods available in SPIRAL
  • Exhaustive Search
  • Dynamic Programming (DP)
  • Random Search
  • Hill Climbing
  • STEER (similar to a genetic algorithm)
  • Search over
  • algorithm space and
  • implementation options (degree of unrolling)

25
STEER
Population n
Mutation
expand differently

Cross-Breeding
Population n1
swap expansions

Survival of Fittest
26
Learning to Generate Fast Algorithms
  • Learns from given dataset (formulas runtimes)
    how to design a fast algorithm (breakdown
    strategy)
  • Learns from a transform of one size, generates
    the best algorithm for many sizes
  • Tested for DFT and WHT

27
Experimental Results
28
Generated DFT Code Pentium 4, SSE
hand-tuned vendor assembly code
(Pseudo) gflop/s

n
DFT 2n single precision, Pentium 4, 2.53 GHz,
using Intel C compiler 6.0
speedups (vector to C code) up to factor of 3.1
P. Rodriguez. A Radix-2 FFT Algorithm for Modern
Single Instruction Multiple Data (SIMD)
Architectures. Proc. ICASSP 2002
29
Generated DFT Code Pentium 4, SSE2
gflops
n
DFT 2n double precision, Pentium 4, 2.53 GHz,
using Intel C compiler 6.0
speedups (vector to C code) up to factor of 1.8
30
Other transforms
2-dim DCT 2n x 2n Pentium 4, 2.53 GHz, SSE
WHT 2n Pentium 4, 2.53 GHz, SSE
gflops
transform size
transform size
  • WHT has only additions
  • very simple transform

speedups (vector to C code) up to factor of 3
31
Best DFT Trees, size 210 1024
Pentium 4 float
Pentium 4 double
Pentium III float
AthlonXP float
10
10
10
10
8
2
6
4
8
2
scalar
6
4
6
2
2
2
2
4
5
2
3
2
2
3
4
2
2
2
2
3
2
2
10
10
10
10
8
2
6
4
6
4
C vect
4
6
2
2
4
2
2
5
4
2
2
2
2
2
4
2
2
2
2
2
2
3
2
2
10
10
10
10
9
1
8
2
5
5
SIMD
5
5
7
2
7
1
2
2
2
3
3
2
3
3
5
2
5
2
2
3
2
3
trees platform/datatype dependent
32
Crosstiming of best trees on Pentium 4
e.g., 50 performance loss by using PIII code on
P4
Relative performance w.r.t. best
n
DFT 2n single precision, runtime of best found of
other platforms
software adaptation is necessary
33
Conclusions
SPIRAL closes the gap between math domain
(algorithms) and implementation domain (programs)
  • Mathematical computer representation of
    algorithms
  • Automatic translation of algorithms into code

SPIRAL does automatic optimization by intelligent
search/learning in the space of alternatives
  • High level Mathematical manipulation of
    algorithms
  • Low level Coding degrees of freedom

http//www.ece.cmu.edu/spiral
34
References
Related Work
  • R.C. Whaley and J. Dongarra. Automatically Tuned
    Linear Algebra Software (ATLAS). In Proc.
    Supercomputing 1998. Math-atlas.sourceforge.net
  • M. Frigo and S.-G. Johnson. FFTW An adaptive
    software architecture for the FFT. In Proc.
    ICASSP 1998, pp. 1381-1384. www.fftw.org
  • E.-J. Im and K. Yelick. Optimizing Sparse Matrix
    Computations for Register Reuse in SPARSITY. In
    Proc. ICCS 2001, pp. 127-136.

Further Reading on SPIRAL
http//www.ece.cmu.edu/spiral
  • M. Püschel, B. Singer, J. Xiong, J. Moura, J.
    Johnson, D. Padua, M. Veloso, R. Johnson. SPIRAL
    A Generator for Platform-Adapted Libraries of
    Signal Processing Algorithms. To appear in
    Journal of High Performance Computing and
    Applications.
  • J. Xiong, J. Johnson, R. Johnson, and D. Padua.
    SPL A Language and Compiler for DSP Algorithms.
    In Proc. PLDI 2001, pp. 298-308.
  • Bryan Singer and Manuela Veloso. Automating the
    Modeling and Optimization of the Performance of
    Signal Transforms. IEEE Trans. Signal Processing,
    50(8), 2002, pp. 2003-2014.
  • F. Franchetti and M. Püschel. A SIMD Vectorizing
    Compiler for Digital Signal Processing
    Algorithms. In Proc. IPDPS 2002.
  • F. Franchetti and M. Püschel. Short Vector Code
    Generation for the Discrete Fourier Transform. To
    appear in Proc. IPDPS 2003.
Write a Comment
User Comments (0)
About PowerShow.com