- PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

Description:

Performance data with the same arithmetic precision ... Today ATLAS in use by Matlab, Mathematica, Octave, Maple, Debian, Scyld Beowulf, SuSE, ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 19
Provided by: jack236
Learn more at: http://www.netlib.org
Category:
Tags: octave

less

Transcript and Presenter's Notes

Title:


1
How Can We Address the Needs and Solve the
Problems in HPCBenchmarking?
Workshop on the Performance Characterization of
Algorithms
  • Jack Dongarra
  • Innovative Computing Laboratory
  • University of Tennessee
  • http//www.cs.utk.edu/dongarra/

2
LINPACK Benchmark
  • Accidental benchmarking
  • Designed to help users extrapolate execution time
    for Linpack software
  • First benchmark report from 1979

3
Accidental Benchmarking
  • Portable, runs on any system
  • Easy to understand
  • Content changed over time
  • n100, 300, 1000, as large as possible (Top500)
  • Allows for restructuring algorithm
  • Performance data with the same arithmetic
    precision
  • Benchmark checks to see if correct solution
    achieved
  • Not intended to measure entire machine
    performance.
  • In the benchmark report, One further note The
    following performance data should not be taken
    too seriously.

4
LINPACK Benchmark
  • Scalable benchmark, size and parallel
  • Pressure on vendors to optimize my software and
    provide a set of kernels that benefit others
  • Run rules very important
  • Today, n .5x106 at 7.2 TFlop/s requires 3.3
    hours
  • On a Petaflops machine, at n5x106 will require 1
    day.
  • Historical data
  • For n100 same software for the last 22 years
  • Unbiased reporting
  • Freely available sw/results worldwide
  • Should be able to achieve high performance on
    this problem, if not
  • Compiler test at n100, heavily hand optimized at
    TPP (Modified ScaLAPACK implementation)

5
Benchmark
  • Machine signatures
  • Algorithm characteristics
  • Make improvements in applications
  • Users looking for performance portability
  • Many of the things we do are specific to one
    systems parameters
  • Need a way understand and rapidly develop
    software which has a chance at high performance

6
Self-Adapting Numerical Software (SANS)
  • Todays processors can achieve high-performance,
    but this requires extensive machine-specific hand
    tuning.
  • Simple operations like Matrix-Vector ops require
    many man-hours / platform
  • Software lags far behind hardware introduction
  • Only done if financial incentive is there
  • Compilers not up to optimization challenge
  • Hardware, compilers, and software have a large
    design space w/many parameters
  • Blocking sizes, loop nesting permutations, loop
    unrolling depths, software pipelining strategies,
    register allocations, and instruction schedules.
  • Complicated interactions with the increasingly
    sophisticated micro-architectures of new
    microprocessors.
  • Need for quick/dynamic deployment of optimized
    routines.
  • ATLAS - Automatic Tuned Linear Algebra Software

7
Software Generation Strategy - BLAS
  • Parameter study of the hw
  • Generate multiple versions of code, w/difference
    values of key performance parameters
  • Run and measure the performance for various
    versions
  • Pick best and generate library
  • Level 1 cache multiply optimizes for
  • TLB access
  • L1 cache reuse
  • FP unit usage
  • Memory fetch
  • Register reuse
  • Loop overhead minimization
  • Takes 20 minutes to run.
  • New model of high performance programming where
    critical code is machine generated using
    parameter optimization.
  • Designed for RISC arch
  • Super Scalar
  • Need reasonable C compiler
  • Today ATLAS in use by Matlab, Mathematica,
    Octave, Maple, Debian, Scyld Beowulf, SuSE,

8
ATLAS (DGEMM n 500)
  • ATLAS is faster than all other portable BLAS
    implementations and it is comparable with
    machine-specific libraries provided by the vendor.

9
Related Tuning Projects
  • PHiPAC
  • Portable High Performance ANSI C
    http//www.icsi.berkeley.edu/bilmes/phipac
    initial automatic GEMM generation project
  • FFTW Fastest Fourier Transform in the West
  • http//www.fftw.org
  • UHFFT
  • tuning parallel FFT algorithms
  • http//rodin.cs.uh.edu/mirkovic/fft/parfft.htm
  • SPIRAL
  • Signal Processing Algorithms Implementation
    Research for Adaptable Libraries maps DSP
    algorithms to architectures
  • http//www.ece.cmu.edu/spiral/
  • Sparsity
  • Sparse-matrix-vector and Sparse-matrix-matrix
    multiplication http//www.cs.berkeley.edu/ejim/pu
    blication/ tunes code to sparsity structure of
    matrix more later in this tutorial
  • University of Tennessee

10
Experiments with C, Fortran, and Java for ATLAS
(DGEMM kernel)
11
Machine-Assisted Application Development and
Adaptation
  • Communication libraries
  • Optimize for the specifics of ones
    configuration.
  • Algorithm layout and implementation
  • Look at the different ways to express
    implementation

12
Work in ProgressATLAS-like Approach Applied to
Broadcast (PII 8 Way Cluster with 100 Mb/s
switched network)
13
Conjugate Gradient Variants by Dynamic Selection
at Run Time
  • Variants combine inner products to reduce
    communication bottleneck at the expense of more
    scalar ops.
  • Same number of iterations, no advantage on a
    sequential processor
  • With a large number of processor and a
    high-latency network may be advantages.
  • Improvements can range from 15 to 50 depending
    on size.

14
Conjugate Gradient Variants by Dynamic Selection
at Run Time
  • Variants combine inner products to reduce
    communication bottleneck at the expense of more
    scalar ops.
  • Same number of iterations, no advantage on a
    sequential processor
  • With a large number of processor and a
    high-latency network may be advantages.
  • Improvements can range from 15 to 50 depending
    on size.

15
Reformulating/Rearranging/Reuse
  • Example is the reduction to narrow band from for
    the SVD
  • Fetch each entry of A once
  • Restructure and combined operations
  • Results in a speedup of gt 30

16
Tools for Performance Evaluation
  • Timing and performance evaluation has been an art
  • Resolution of the clock
  • Issues about cache effects
  • Different systems
  • Can be cumbersome and inefficient with
    traditional tools
  • Situation about to change
  • Todays processors have internal counters

17
Performance Counters
  • Almost all high performance processors include
    hardware performance counters.
  • Some are easy to access, others not available to
    users.
  • On most platforms the APIs, if they exist, are
    not appropriate for the end user or well
    documented.
  • Existing performance counter APIs
  • Compaq Alpha EV 6 6/7
  • SGI MIPS R10000
  • IBM Power Series
  • CRAY T3E
  • Sun Solaris
  • Pentium Linux and Windows
  • IA-64
  • HP-PA RISC
  • Hitachi
  • Fujitsu
  • NEC

18
Directions
  • Need tools that allow us to examine performance
    and identify problems.
  • Should be simple to use
  • Perhaps in an automatic fashion
  • Machine assisted optimization of key components
  • Think of it as a higher level compiler
  • Done via experimentation
Write a Comment
User Comments (0)
About PowerShow.com