Title: Automatic Parameterisation of Parallel Linear Algebra Routines
1Automatic Parameterisation of Parallel Linear
Algebra Routines
- Domingo Giménez
- Javier Cuenca
- José González
- University of Murcia
- SPAIN
- Algèbre Linéaire et Arithmétique Calcul
Numérique, Symbolique et Paralèle - Rabat, Maroc. 28-31 Mai 2001
2Outline
- Current Situation of Linear Algebra Parallel
Routines (LAPRs) - Objective
- Approach I Analytical Model of the LAPRs
- Application Jacobi Method on Origin 2000
- Approach II Exhaustive Executions
- Application Gauss elimination on networks of
processors - Validation with the LU factorization
- Conclusions
- Future Works
3Current Situation of Linear Algebra Parallel
Routines (LAPRs)
- Linear Algebra highly optimizable operations
- Optimizations are Platform Specific
- Traditional method Hand-Optimization for each
platform
4Problems of traditional method
- Time-consuming
- Incompatible with Hardware Evolution
- Incompatible with changes in the system
(architecture and basic libraries) - Unsuitable for dynamic systems
- Misuse by non expert users
5Current approaches
- ATLAS, FLAME, I-LIB
- Analyse platform characteristics in detail
- Sequential code
- Empirical results of the LAPR Automation
- High Installation Time
6Our objective
- Develop a methodology for obtaining Automatically
Tuned Software - Execution Environment
- Auto-tuning Software
7Methodology
- Routines Parameterised
- System parameters, Algorithmic parameters
- System parameters obtained at installation time
- Analytical model of the routine and simple
installation routines to obtain the system
parameters - A reduced number of executions at installation
time - Algorithmic parameters obtained at running time
- From the analytical model with the system
parameters obtained in the installation process - From the file with information generated in the
installation process
8Analytical modelling
- System parameters obtained at installation time
- Analytical model of the routine and simple
installation routines to obtain the system
parameters - Algorithmic parameters obtained at running time
- From the analytical model with the system
parameters obtained in the installation process
9Analytical Model
- The behaviour of the algorithm on the platform is
defined -
- Texec f (SPs, n, APs)
- SPs f(n, APs) System Parameters
- APs Algorithmic Parameters
- n Problem Size
-
10Analytical Model
- System Parameters (SPs)
- Hardware Platform
- Physical Characteristics
- Current Conditions
- Basic libraries
- How to estimate each SP?
- 1º.- Obtain the kernel of performance cost of
LAPR - 2º.- Make an Estimation Routine from this
kernel - Two Kinds of SPs
- Communication System Parameters (CSPs)
- Arithmetic System Parameters (ASPs)
LAPRs Performance
11Analytical Model
- Arithmetic System Parameters (ASPs)
- tc arithmetic cost
- but using BLAS k1 k2 and k3.
- Computation Kernel of the LAPR ? Estimation
Routine - Similar storage scheme
- Similar quantity of data
12Analytical Model
- Communication System Parameters (CSPs)
- ts start-up time
- tw word-sending time
- Communication Kernel of the LAPR ? Estimation
Routine - Similar kind of communication
- Similar quantity of data
13Analytical Model
Algorithmic Parameters (APs) Values chosen in
each execution b block size p number of
processors r ? c logical topology grid
configuration (logical 2D mesh)
14The Methodology. Step by step
Pre-installing (manual) 1º Make the Analytical
Model Texec f (SPs, n, APs) 2º Write the
Estimation Routines for the SPs Installing on a
Platform (automatic) 3º Estimate the SPs using
the Estimation Routines of step 2 4º Write a
Configuration File, or include the information in
the LAPR for each n APs that minimize
Texec Execution The user executes LAPR for a
size n LAPR obtains optimal APs
15Application Example
- LAPR One-sided Block Jacobi Method to solve the
Symmetric Eigenvalue Problem. - Message-passing with MPI
- Logical Ring Logical 2D-Mesh
- Platform SGI Origin 2000
16Application Example. Algorithm Scheme
B
W
D
b
n/r
00
01
01
00
01
00
n
10
10
11
11
10
11
20
20
21
21
20
21
17Application Example Pre-installing.
1º Make the Analytical Model Texec f
(SPs,n,APs)
18Application Example Pre-installing.
2º Write the Estimation Routines for the SPs k3
matrix-matrix multiplication with DGEMM k1
Givens Rotation to 2 vectors with
DROT ts communications along the 2 directions
of the 2D-mesh tw
19Application Example Installing
3º Estimate the SPs using the Estimation
Routines k1 0.01 µs 0.005 µs b
32 k3 0.004 µs b 64 0.003 µs b
128 ts 20 µs tw 0.1 µs
20Application Example Executing
Comparison of execution times using
different sets of Execution Parameters (4
processors)
21Application Example Executing
Comparison of execution times using
different sets of Execution Parameters (8
processors)
22Application Example Executing
- LAPR One-sided Block Jacobi Method
- Algorithmic Parameters block size
- mesh topology
- Platform SGI Origin 2000 with message-passing
- System Parameters arithmetic costs
- communication costs
- Satisfactory Reduction of the Execution Time
- from 25 higher than the optimal to only 2
23Outline
- Current Situation of Linear Algebra Parallel
Routines (LAPRs) - Objective
- Approach I Analytical Model of the LAPRs
- Application Jacobi Method on Origin 2000
- Approach II Exhaustive Executions
- Application Gauss elimination on networks of
processors - Validation with the LU factorization
- Conclusions
- Future Works
24Exhaustive Execution
- System parameters obtained at installation time
- Installation routines making a reduced number of
executions at installation time - Algorithmic parameters obtained at running time
- From the file with information generated in the
installation process
25Exhaustive Execution
- The behaviour of the algorithm on the platform is
defined (as in Analytical Modelling) -
- Texec f (SPs, n, APs)
- SPs f(n, APs) System Parameters
- APs Algorithmic Parameters
- n Problem Size
-
26Exhaustive Execution
Identify Algorithmic Parameters (APs) (as in
Analytical Modelling) Values chosen in each
execution b block size p number of
processors r ? c logical topology grid
configuration (logical 2D mesh)
27The Methodology. Step by step
Pre-installing (manual) 1º Determine the APs
2º Decide heuristics to reduce execution time
in the installation process Installing on a
Platform (automatic) 3º Decide (the manager)
the problem sizes to be analysed 4º Execute and
write a Configuration File, or include the
information in the LAPR for each n APs that
minimize Texec Execution The user executes
LAPR for a size n LAPR obtains optimal APs
28Application Example
- LAPR Gaussian elimination.
- Message-passing with MPI
- Logical Ring,
- rowwise block-cyclic striped partitioning
- Platform networks of processors (heterogeneous
system)
29Application Example Pre-installing.
1º Determine the APs logical ring, rowwise
block-cyclic striped partitioning p number of
processors b block size for the data
distribution different block sizes in
heterogeneous systems
b0
b1
b2
b0
b1
b2
b0
b1
b2
b0
30Application Example Pre-installing.
- 2º Decide heuristics to reduce execution time in
the installation process - Execution time varies in a continuous way with
the problem size and the APs - Consider the system as homogeneous
- Installation can finish
- When Analytical and Experimental predictions
coincide - When a certain time has been spent on the
installation -
31Application Example Installing
- Homogeneous Systems
- 3º The manager decides the problem sizes
- 4º Execute and write a Configuration File, or
include the information in the LAPR - for each n APs that minimize Texec
- Heterogeneous Systems
- 3º The manager decides the problem sizes
- 4º Execute
- write a Configuration File, for each n APs that
minimize Texec - write a Speed File, with the relative speeds of
the processors in the system
32Application Example Installation Routines
- RI-THE Obtains p and b from the formula.
- RI-HOM Obtains p and b through a reduced number
of executions. - RI-HET 1º. As RI-HOM.
- 2º. Obtains bi for each processor
33Application Example Systems
Three different configurations PLA_HOM 5 SUN
Ultra-1 PLA_HYB 5 SUN Ultra-1 1 SUN
Ultra-5 PLA_HET 1 SUN Ultra-1 1 SUN
Ultra-5 1 SUN Ultra-1 (manages the file
system)
34Application Example Executing
Experimental results in PLA-HOM Quotient
between the execution time with the parameters
from the Installation Routine and the optimum
execution time
35Application Example Executing
Experimental results in PLA-HYB Quotient
between the execution time with the parameters
from the Installation Routine and the optimum
execution time
36Application Example Executing
Experimental results in PLA-HET Quotient
between the execution time with the parameters
from the Installation Routine and the optimum
execution time
37Comparison
- Two techniques for automatic tuning of Parallel
Linear Algebra Routines - 1. Analytical Modelling
- For predictable systems (homogeneous, static,
...) - like Origin 2000
- 2. Exhaustive Execution
- For less predictable systems (heterogeneous,
dynamic, ...) - like networks of workstations
- Transparent to the user
- Execution close to the optimum
38Outline
- Current Situation of Linear Algebra Parallel
Routines (LAPRs) - Objective
- Approach I Analytical Model of the LAPRs
- Application Jacobi Method on Origin 2000
- Approach II Exhaustive Executions
- Application Gauss elimination on networks of
processors - Validation with the LU factorization
- Conclusions
- Future Works
39Validation with the LU factorization
- To validate the methodology it is necessary to
experiment with - More routines
- block LU factorization
- More systems
- Architectures
- IBM SP2 and Origin 2000
- Libraries
- reference BLAS, machine BLAS, ATLAS
40Sequential LU
Analytical Model Texec f (SPs,n,APs) SPs
cost of arithmetic operations of different
levels k1, k2, k3 APs block size b
LU
ES
b
ES
UM
41Sequential LU. Comparison in IBM SP2
Quotient between different execution
times and the optimum execution time
42Sequential LU. Model execution time/optimum
execution time
Quotient between the execution time
with the parameters provided by the model and the
optimum execution time, with different basic
libraries. In SUN 1
43Parallel LU
Analytical Model Texec f (SPs,n,APs) SPs
cost of arithmetic operations k1, k2, k3
cost of communications ts, tw APs block size b,
number of processors p, grid
configuration r?c
00
01
02
00
01
02
b
10
11
12
10
11
12
00
01
02
00
01
02
10
11
12
10
11
12
00
01
02
00
01
02
10
11
12
10
11
12
44Parallel LU. Comparison in IBM SP2
Quotient between the execution time with
the parameters provided by the model and the
optimum execution time. In the sequential case,
and in parallel with 4 and 8 processors.
45Parallel LU. Comparison in Origin 2000
Quotient between the execution time with
the parameters provided by the model and the
optimum execution time. In the sequential case,
and in parallel with 4 and 8 processors.
46Parallel LU. Conclusions
- The modelling of the algorithm provides
satisfactory results in different systems - Origin 2000, IBM SP2
- reference BLAS, machine BLAS, ATLAS
- The prediction is worse in some cases
- When the number of processors increases
- In multicomputers where communications are more
important (IBM SP2) - ? Exhaustive Executions
47Parallel LU. Exhaustive Execution
If the manager installs the routine for sizes
512, 1536, 2560, and executions are performed for
sizes 1024, 2048, 3072, the execution time is
well predicted The same policy can be used in
the installation of other software Quotien
t between the execution time with the parameters
provided by the installation process and the
optimum execution time. With ScaLAPACK, in IBM
SP2
48Conclusions
- Parameterisation of Parallel Linear Algebra
Routines enables development of Automatically
Tuned Software - Two techniques can be used
- Analytical Modelling
- Exhaustive Executions
- or
- a combination of both
- Experiments performed in different systems and
with different routines
49Future Works
- We try to develop a methodology valid for a wide
range of systems, and to include it in the design
of linear algebra libraries - it is necessary to analyse the methodology in
more systems and with more routines - Architecture of an Automatically Tuned Linear
Algebra Library - At the moment we are analysing routines
individually, but it could be preferable to
analyse algorithmic schemes
50Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
SP file
AP file
Library
designer
Compilation
51Architecture of an Automatically Tuned Linear
Algebra Library
Installation routines
designer
Library
designer
52Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
Installation routines
designer
Basic routines library
Library
designer
53Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
Library
designer
54Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
SP file
AP file
Library
designer
55Architecture of an Automatically Tuned Linear
Algebra Library
Basic routines declaration
manager
manager
Installation file
Installation routines
designer
Basic routines library
Installation
manager
SP file
AP file
Library
designer
Compilation