1How Can We Address the Needs and Solve the
Problems in HPCBenchmarking?
Workshop on the Performance Characterization of
- Jack Dongarra
- Innovative Computing Laboratory
- University of Tennessee
- http//www.cs.utk.edu/dongarra/
2LINPACK Benchmark
- Accidental benchmarking
- Designed to help users extrapolate execution time
for Linpack software - First benchmark report from 1979
3Accidental Benchmarking
- Portable, runs on any system
- Easy to understand
- Content changed over time
- n100, 300, 1000, as large as possible (Top500)
- Allows for restructuring algorithm
- Performance data with the same arithmetic
precision - Benchmark checks to see if correct solution
achieved - Not intended to measure entire machine
performance. - In the benchmark report, One further note The
following performance data should not be taken
too seriously.
4LINPACK Benchmark
- Scalable benchmark, size and parallel
- Pressure on vendors to optimize my software and
provide a set of kernels that benefit others - Run rules very important
- Today, n .5x106 at 7.2 TFlop/s requires 3.3
hours - On a Petaflops machine, at n5x106 will require 1
- Historical data
- For n100 same software for the last 22 years
- Unbiased reporting
- Freely available sw/results worldwide
- Should be able to achieve high performance on
this problem, if not - Compiler test at n100, heavily hand optimized at
TPP (Modified ScaLAPACK implementation)
- Machine signatures
- Algorithm characteristics
- Make improvements in applications
- Users looking for performance portability
- Many of the things we do are specific to one
systems parameters - Need a way understand and rapidly develop
software which has a chance at high performance
6Self-Adapting Numerical Software (SANS)
- Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning. - Simple operations like Matrix-Vector ops require
many man-hours / platform - Software lags far behind hardware introduction
- Only done if financial incentive is there
- Compilers not up to optimization challenge
- Hardware, compilers, and software have a large
design space w/many parameters - Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules. - Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors. - Need for quick/dynamic deployment of optimized
routines. - ATLAS - Automatic Tuned Linear Algebra Software
7Software Generation Strategy - BLAS
- Parameter study of the hw
- Generate multiple versions of code, w/difference
values of key performance parameters - Run and measure the performance for various
versions - Pick best and generate library
- Level 1 cache multiply optimizes for
- TLB access
- L1 cache reuse
- FP unit usage
- Memory fetch
- Register reuse
- Loop overhead minimization
- Takes 20 minutes to run.
- New model of high performance programming where
critical code is machine generated using
parameter optimization. - Designed for RISC arch
- Super Scalar
- Need reasonable C compiler
- Today ATLAS in use by Matlab, Mathematica,
Octave, Maple, Debian, Scyld Beowulf, SuSE,
8ATLAS (DGEMM n 500)
- ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.
9Related Tuning Projects
- Portable High Performance ANSI C
initial automatic GEMM generation project - FFTW Fastest Fourier Transform in the West
- http//www.fftw.org
- tuning parallel FFT algorithms
- http//rodin.cs.uh.edu/mirkovic/fft/parfft.htm
- Signal Processing Algorithms Implementation
Research for Adaptable Libraries maps DSP
algorithms to architectures - http//www.ece.cmu.edu/spiral/
- Sparsity
- Sparse-matrix-vector and Sparse-matrix-matrix
multiplication http//www.cs.berkeley.edu/ejim/pu
blication/ tunes code to sparsity structure of
matrix more later in this tutorial - University of Tennessee
10Experiments with C, Fortran, and Java for ATLAS
(DGEMM kernel)
11Machine-Assisted Application Development and
- Communication libraries
- Optimize for the specifics of ones
configuration. - Algorithm layout and implementation
- Look at the different ways to express
12Work in ProgressATLAS-like Approach Applied to
Broadcast (PII 8 Way Cluster with 100 Mb/s
switched network)
13Conjugate Gradient Variants by Dynamic Selection
at Run Time
- Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops. - Same number of iterations, no advantage on a
sequential processor - With a large number of processor and a
high-latency network may be advantages. - Improvements can range from 15 to 50 depending
on size.
14Conjugate Gradient Variants by Dynamic Selection
at Run Time
- Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops. - Same number of iterations, no advantage on a
sequential processor - With a large number of processor and a
high-latency network may be advantages. - Improvements can range from 15 to 50 depending
on size.
- Example is the reduction to narrow band from for
the SVD - Fetch each entry of A once
- Restructure and combined operations
- Results in a speedup of gt 30
16Tools for Performance Evaluation
- Timing and performance evaluation has been an art
- Resolution of the clock
- Issues about cache effects
- Different systems
- Can be cumbersome and inefficient with
traditional tools - Situation about to change
- Todays processors have internal counters
17Performance Counters
- Almost all high performance processors include
hardware performance counters. - Some are easy to access, others not available to
users. - On most platforms the APIs, if they exist, are
not appropriate for the end user or well
documented. - Existing performance counter APIs
- Compaq Alpha EV 6 6/7
- SGI MIPS R10000
- IBM Power Series
- Sun Solaris
- Pentium Linux and Windows
- IA-64
- Hitachi
- Fujitsu
- Need tools that allow us to examine performance
and identify problems. - Should be simple to use
- Perhaps in an automatic fashion
- Machine assisted optimization of key components
- Think of it as a higher level compiler
- Done via experimentation