- PowerPoint PPT Presentation

About This Presentation

Title:

Description:

Performance data with the same arithmetic precision ... Today ATLAS in use by Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE, ... – PowerPoint PPT presentation

Number of Views:39

Avg rating:3.0/5.0

Slides: 19

Provided by: jack236

Learn more at: https://www.netlib.org

Category:

Tags: octave

more less

Transcript and Presenter's Notes

Title:

1
How Can We Address the Needs and Solve the
Problems in HPCBenchmarking?
Workshop on the Performance Characterization of
Algorithms

Jack Dongarra
Innovative Computing Laboratory
University of Tennessee
http//www.cs.utk.edu/dongarra/

2
LINPACK Benchmark

Accidental benchmarking
Designed to help users extrapolate execution time
for Linpack software
First benchmark report from 1979

3
Accidental Benchmarking

Portable, runs on any system
Easy to understand
Content changed over time
n100, 300, 1000, as large as possible (Top500)
Allows for restructuring algorithm
Performance data with the same arithmetic
precision
Benchmark checks to see if correct solution
achieved
Not intended to measure entire machine
performance.
In the benchmark report, One further note The
following performance data should not be taken
too seriously.

4
LINPACK Benchmark

Scalable benchmark, size and parallel
Pressure on vendors to optimize my software and
provide a set of kernels that benefit others
Run rules very important
Today, n .5x106 at 7.2 TFlop/s requires 3.3
hours
On a Petaflops machine, at n5x106 will require 1
day.

Historical data
For n100 same software for the last 22 years
Unbiased reporting
Freely available sw/results worldwide
Should be able to achieve high performance on
this problem, if not
Compiler test at n100, heavily hand optimized at
TPP (Modified ScaLAPACK implementation)

5
Benchmark

Machine signatures
Algorithm characteristics
Make improvements in applications
Users looking for performance portability
Many of the things we do are specific to one
systems parameters
Need a way understand and rapidly develop
software which has a chance at high performance

6
Self-Adapting Numerical Software (SANS)

Todays processors can achieve high-performance,
but this requires extensive machine-specific hand
tuning.
Simple operations like Matrix-Vector ops require
many man-hours / platform
Software lags far behind hardware introduction
Only done if financial incentive is there
Compilers not up to optimization challenge
Hardware, compilers, and software have a large
design space w/many parameters
Blocking sizes, loop nesting permutations, loop
unrolling depths, software pipelining strategies,
register allocations, and instruction schedules.
Complicated interactions with the increasingly
sophisticated micro-architectures of new
microprocessors.
Need for quick/dynamic deployment of optimized
routines.
ATLAS - Automatic Tuned Linear Algebra Software

7
Software Generation Strategy - BLAS

Parameter study of the hw
Generate multiple versions of code, w/difference
values of key performance parameters
Run and measure the performance for various
versions
Pick best and generate library
Level 1 cache multiply optimizes for
TLB access
L1 cache reuse
FP unit usage
Memory fetch
Register reuse
Loop overhead minimization

Takes 20 minutes to run.
New model of high performance programming where
critical code is machine generated using
parameter optimization.
Designed for RISC arch
Super Scalar
Need reasonable C compiler
Today ATLAS in use by Matlab, Mathematica,
Octave, Maple, Debian, Scyld Beowulf, SuSE,

8
ATLAS (DGEMM n 500)

ATLAS is faster than all other portable BLAS
implementations and it is comparable with
machine-specific libraries provided by the vendor.

9
Related Tuning Projects

PHiPAC
Portable High Performance ANSI C
http//www.icsi.berkeley.edu/bilmes/phipac
initial automatic GEMM generation project
FFTW Fastest Fourier Transform in the West
http//www.fftw.org
UHFFT
tuning parallel FFT algorithms
http//rodin.cs.uh.edu/mirkovic/fft/parfft.htm
SPIRAL
Signal Processing Algorithms Implementation
Research for Adaptable Libraries maps DSP
algorithms to architectures
http//www.ece.cmu.edu/spiral/
Sparsity
Sparse-matrix-vector and Sparse-matrix-matrix
multiplication http//www.cs.berkeley.edu/ejim/pu
blication/ tunes code to sparsity structure of
matrix more later in this tutorial
University of Tennessee

10
Experiments with C, Fortran, and Java for ATLAS
(DGEMM kernel)
11
Machine-Assisted Application Development and
Adaptation

Communication libraries
Optimize for the specifics of ones
configuration.
Algorithm layout and implementation
Look at the different ways to express
implementation

12
Work in ProgressATLAS-like Approach Applied to
Broadcast (PII 8 Way Cluster with 100 Mb/s
switched network)
13
Conjugate Gradient Variants by Dynamic Selection
at Run Time

Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops.
Same number of iterations, no advantage on a
sequential processor
With a large number of processor and a
high-latency network may be advantages.
Improvements can range from 15 to 50 depending
on size.

14
Conjugate Gradient Variants by Dynamic Selection
at Run Time

Variants combine inner products to reduce
communication bottleneck at the expense of more
scalar ops.
Same number of iterations, no advantage on a
sequential processor
With a large number of processor and a
high-latency network may be advantages.
Improvements can range from 15 to 50 depending
on size.

15
Reformulating/Rearranging/Reuse

Example is the reduction to narrow band from for
the SVD
Fetch each entry of A once
Restructure and combined operations
Results in a speedup of gt 30

16
Tools for Performance Evaluation

Timing and performance evaluation has been an art
Resolution of the clock
Issues about cache effects
Different systems
Can be cumbersome and inefficient with
traditional tools
Situation about to change
Todays processors have internal counters

17
Performance Counters

Almost all high performance processors include
hardware performance counters.
Some are easy to access, others not available to
users.
On most platforms the APIs, if they exist, are
not appropriate for the end user or well
documented.
Existing performance counter APIs
Compaq Alpha EV 6 6/7
SGI MIPS R10000
IBM Power Series
CRAY T3E
Sun Solaris
Pentium Linux and Windows