Title: MLD2P4: a package of parallel algebraic multilevel Preconditioners
1MLD2P4 a package of parallel algebraic
multilevel Preconditioners
Bologna, March 2008
- Pasqua DAmbra, Institute for High-Performance
Computing and Networking (ICAR-CNR), Naples
Branch, Italy
joint work with Daniela di Serafino, Second
University of Naples Salvatore Filippone,
University of Rome Tor-Vergata
2Overview
- Motivations
- Background
- Objectives
- MLD2P4 Multi-Level Domain Decomposition Parallel
Preconditioners Package based on PSBLAS - Algorithms and computational kernels
- Software architecture
- Some Results Applications
3Background
- Large-scale applications have to solve
- The linear system matrix is
- Real or complex and square
- Large and Sparse
- Distributed among parallel processors
- Matrix dimensions and entries, conditioning,
sparsity pattern and coupling among variables
vary along simulations
4Background (contd)
- What is the best method/preconditioner?
- No absolute winner, experimentation is needed
- Reliable preconditioners require access to the
complete matrix - Parallel implementation is not trivial
- Interfacing with application software is required
- Custom-made interfaces to parallel legacy codes
- Different interfaces for different
preconditioners/solvers
5Objectives
- designing and implementing a suite of
- algebraic preconditioners
- based on Linear Algebra kernels for
- parallel sparse matrix computations
- Flexibility
- Different preconditioners by single API
- Portability Efficiency
- Standard base software for serial kernels and
data communications - Simplicity of usage
- Modern (OO) Fortran 95 features and auxiliary
routines for smooth legacy code integration
6MLD2P4
- Multi-Level Domain Decomposition
- Parallel Preconditioners Package based on PSBLAS
7PSBLAS (Filippone et al., http//www.ce.uniroma2.i
t/psblas/) Basic Linear Algebra Operations with
Sparse Matrices on MIMD Architectures
Iterative Sparse Linear Solvers CG, BiCG, CGS,
BiCGSTAB, RGMRES,
Appl.
Parallel Sparse Matrix Operations matrix-matrix
products, matrix-vector products,
Parallel Sparse Matrix Management allocate,
build, update,
Kernels
BLACS Basic Linear Algebra Communication
Subprograms
Base sw
MPI
F95
F77
8MLD2P4 Design Algorithms
- Algebraic multi-level Schwarz preconditioners
- based on smoothed aggregation
- good trade-off between parallelism and
convergence - optimal scalability for symmetric
positive-definite matrices - algebraic framework allows general-purpose
application
9(1-lev) Schwarz basic ingredients
Adjacency graph of A
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
0-overlap partition of W
d-overlap partition of W
10AS basic ingredients (contd)
Restriction/prolongation operators
1 2 3 4 5 6 7 8 9
1 2 3 4 5 6 7 8 9
Restriction of A
11Coarse level correction basic ingredients
Algebraic coarsening uncoupled aggregation
Smoothed prol./restr. operators
Coarse-level matrix
12Multilevel-Schwarz preconditioners
computational kernels
Example 2-lev hybrid-post
P. DAmbra, D. di Serafino, S. Filippone, On the
Development of PSBLAS-based Parallel Two-level
Schwarz Preconditioners, Applied Numerical
Mathematics, 57, 2007.
13MLD2P4 Design Software Architecture
14Performance Results Comparisons
- Different test matrices from various sources
- thm matrices thermal diffusion in solids
- kivap matrices automotive engine design
- shipsec matrices from UF sparse matrix
collection - Experiments carried out on different Linux
clusters - 64 Intel Itanium dual-processor nodes connected
by Quadrics QSNetII Elan 4 - 32 AMD Opteron dual-processor nodes connected by
Myrinet - 8 AMD Opteron dual-processor nodes connected by
InfiniBand - 8 Intel Itanium dual-processor nodes connected by
Myrinet - 16 Intel Pentium IV nodes connected by Fast
Ethernet - Comparison with up-to-date related work
- Trilinos-ML
A. Buttari, P. DAmbra, D. di Serafino, S.
Filippone, 2LEV-D2P4 a package of
high-performance preconditioners for scientific
and engineering applications , Applicable Algebra
in Engineering, Communication and Computing, Vol.
18, 2007.
15Experimental Setting
- MLD2P4 right-preconditioned BiCGSTAB
- 1-lev Restricted Additive Schwarz preconditioner
with ILU(0) (RAS) - 2-lev hybrid Schwarz preconditioner, with
RAS/ILU(0) as 1-lev prec. - Distributed coarsest matrix 4 sweeps of block
Jacobi with ILU(0) (2LDI) or with UMFPACK (2LDU)
on diagonal blocks - 3-lev hybrid Schwarz preconditioner, with
RAS/ILU(0) as 1-lev prec. - Distributed coarsest matrix 4 sweeps of block
Jacobi with ILU(0) (3LDI) or with UMFPACK (3LDU)
on diagonal blocks
16thm matrices number of iterations
thm1 n 600000 nnz 2996800
np OV0 OV0 OV0 OV0 OV0
np RAS 2LDI 2LDU 3LDI 3LDU
1 613 190 - 70 -
2 705 184 - 72 -
4 761 206 - 74 -
8 688 202 44 67 28
16 748 211 61 70 36
32 766 186 81 69 51
64 809 196 113 86 68
np OV1 OV1 OV1 OV1 OV1
np RAS 2LDI 2LDU 3LDI 3LDU
1 613 190 - 70 -
2 923 183 - 76 -
4 684 178 - 63 -
8 937 191 34 62 27
16 688 172 57 68 33
32 714 181 74 65 45
64 720 180 107 77 62
64 Intel Itanium dual-processor nodes connected
by QSNetII
17thm matrices execution times and speed-ups
(OV1 best execution times3LDU)
64 Intel Itanium dual-processor nodes connected
by QSNetII
18Application test caselarge eddy simulation of
incompressible turbulent flows in a bi-periodical
channel
- main computational kernel
- nonsymmetric and singular linear systems arising
from elliptic PDE with Neumann b.c.
A. Aprovitola, P. DAmbra, F. M. Denaro, D. di
Serafino, S. Filippone, Application of Parallel
Algebraic Multilevel Domain Decomposition
Preconditioners in Large-Eddy Simulations of
Wall-bounded Turbulent Flows First Experiments,
RT-ICAR-NA-2007-02, July 2007.
19Experimental Setting
Reynolds number 180 Computational Grid
140x32x45 non-uniform in the y direction,
time-step 10-4
Pressure linear system n201600 nnz1398600
- MLD2P4 right-preconditioned RGMRES(30)
- 1-lev Restricted Additive Schwarz preconditioner
with ILU(0) (RAS) - 2-lev/3-lev hybrid Schwarz preconditioner, with
RAS/ILU(0) as 1-lev prec. - Distributed coarse matrix 4 sweeps of block
Jacobi with ILU(0) (2LDI/3LDI) on diagonal blocks - Stopping criterion or
maxit - General row-block distribution
20LES of incompressible wall-bounded flow
SOR on 1 proc.9 sec.
SOR on 1 proc.8580 sec.
16 Intel Itanium dual-processor nodes connected
by QSNetII
21Work in progress
- Package available on the web very soon
- More sophisticated aggregation algorithms
- Integration of preconditioners and solvers in
large-scale applications