Title: PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE MATRICES
1PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE
MATRICES
- Rui Ralha
- DMAT, CMAT
- Univ. do Minho
- Portugal
- r_ralha_at_math.uminho.pt
2Acknowledgements
- CMAT
- FCT
- POCTI (European Union contribution)
- Prof. B. Parlett
3Outline
- Bidiagonalization Methods
- Ralhas algorithm
- Barlows algorithm
- Modified Barlows algorithm
- Parallelization Strategy
- ScaLAPACKs methodology
- Parallel algorithms
- Computational Results
- Parallel bisection algorithm
- Conclusions (on bidiagonalization)
4Bidiagonalization Methods
- Given
-
- where
- and
- are orthogonal matrices.
5Bidiagonalization Methods
- Ralhas algorithm (1994, 1996, LAA, 2003)
- Implicit reduction of to tridiagonal
form post-multiplication by f orthogonal
matrices (for instance, n-2 Householder matrices
Hr ) - QR decomposition (Gram-Schmidt for
orthogonalization of adjacent columns) - Problems
- singular values of B may have large errors
- columns of U may be far from orthogonal
6Ralhas bidiagonalization
- In exact arithmetic, is
exactly tridiagonal, i.e, - Rounding errors may ruin orthogonality. Example
(n3, 24 bits IEEE arithmetic) - Reorthogonalization may be used to guarantee that
non-adjacent - columns get orthogonal to working precision
(in the example Householder - needs two reorthogonalization steps). The
purpose of this is high relative accuracy.
7Ralhas bidiagonalization
- The first stage is perfectly stable
- with P exactly orthogonal and
- If
- this one-sided transformation preserves better
the - small singular values than two-sided
transformations -
8Ralhas bidiagonalization
- Ill-conditioning of the reduction to bidiagonal
form (example) - If AQR, for some orthogonal Q, then non-adjacent
columns of A are - orthogonal to working precision but there is no
bidiagonal B satisfying AQB - (uniqueness of QR dec)
- Small perturbations outside the tridiagonal band
of cause much larger - perturbations in R.
9Relation to Golub-Kahans algorithm
- G-K ( can not be applied before )
- makes columns 3 and 4 orthogonal to first
column - may be applied first if inner products
are computed (co-linear vectors produce the same
Householder vector, thus ).
Householder vector computed from -
10Relation to Golub-Kahans algorithm
- For n columns, because multiplication on the
left side by an - orthogonal matrix does not change inner products
of - columns of A, the right side transformation
- can be completed (requires computing the inner
products). Then - This corresponds to a QR decomposition of .
- HOWEVER, will
introduce non-negligible - perturbations in the ith row of R if (the
ith diagonal - element) gets much smaller than the norm of the
ith column - of .
11Relation to Golub-Kahans algorithm
- Let
- The first left transformation can not introduce
large elements in the first row outside the
bidiagonal form - since
- BUT, if are close to
being parallel then will take most of
norm of the second column - of and
12- Assuming that non adjacent columns of
are orthogonal, we just need
to - orthogonalize adjacent columns
-
- But, singular values of bidiagonal
may have errors. GS is NOT to be blamed. Left - orthogonal transformations can not do better.
13Ralhas algorithm
For r 1,..., n-2 Compute
Compute Hr such that Compute End
For Compute Compute For r 2,..., n
Compute Compute Compute
Compute End For
14A stable algorithm (Barlow et al, LAA 2005)
- Interleaves the two stages
- Carries out the rth GS step (this makes
column r1 orthogonal to column r) immediately
after post-multiplication with Hr (this makes
columns r2,,n orthogonal to column r) The
Householder matrix Hr is computed from - The computed satisfies
- Orthogonality of computed U similar to that of Q
in MGS QR
15 Barlows algorithm
Compute For r 1,..., n-2 Compute
Compute Hr such that
Update Compute
Compute End Compute last elements
16 A variant of Barlows algorithm
- motivation to reduce communications in the
parallel algorithm - A(,r) instead of Q(,r), in the rth step
-
17Barlows algorithm (modified)
(sequential)
For r 1,..., n-2 Compute
Compute Hr such that Update
Compute End For Compute Compute Compute Update Co
mpute
18Parallelization Strategy
- ScaLAPACKs methodology
- ScaLAPACK software hierarchy
- BLAS Basic Linear Algebra Subprograms
- LAPACK Linear Algebra Package
- BLACS Basic Linear Algebra Communication
Subprograms - PBLAS Parallel BLAS
19Parallelization Strategy
- ScaLAPACKs methodology
- Two-dimensional block cyclic distribution
- Rectangular grid of processors
- Example (matrix 6x6, blocks 2x2 2x3 grid of
processors)
20Parallelization Strategy
- Parallel algorithms
- Translation of BLAS and LAPACK routines into
calls of the equivalent parallel routines of
PBLAS and ScaLAPACK - Takes in account the data distribution and the
corresponding rules to convert sequential
LAPACK-based programs into parallel
ScaLAPACK-based programs - Parallel algorithms
- Barlows algorithm
- The modified algorithm
21Parallel algorithms
- Barlows algorithm (parallel)
PBLAS / ScaLAPACK
Compute Broadcast of Compute For r 1,...,
n-2 Compute Compute Hr such that
Compute Compute
Compute Broadcast of
Compute End Compute last elements
PDNRM2 DGEBDS2D/DGEBR2D PDSCAL PDGEMV PDLARFG PD
LARF PDAXPY PDNRM2 DGEBDS2D/DGEBR2D PDSCAL
22Computational Results
- Computational environment (hardware)
- Cluster with 20 nodes biprocessors
- Each node is a Pentium Xeon with 2Ghz and 1GB of
RAM - Connected with a SCI network
- Torus 2D with a 4x5 processors grid
23Computational Results
- Computational environment (software)
- Operating system
- Red Hat 8.0 (psyche) with kernel 2.4.20 and
OpenMosix - Compilers
- GNU gcc, g, g77
- Intel version 7.0 icc, ifc
- Communications
- ScaFmpich (a Scali version of MPICH 1.2.4)
- ScaMPI
- Tasks schedule
- OpenPBS
- Numerical libraries
- BLAS, LAPACK, ScaLAPACK
- Programming language
- Fortran 90
24Computational Results
- Computational experiences
- Data
- Random matrices 10000x1000 up to 10000x4500,
square matrices - Results
- Execution time without accumulation of the
orthogonal transformations - DGEBRD vs PDGEBRD
- DGEBRD vs sequential implementations
- PDGEBRD vs parallel implementations
- Speedup and efficiency
25Computational Results
- Remark
- LAPACK and ScaLAPACK implement a block
organization of the Golub-Kahan algorithm (BLAS3) - The new algorithm relies upon BLAS2 modules (not
BLAS3)
26Computational Results
- DGEBRD vs PDGEBRD
- High performance of PDGEBRD module
27Computational Results
- DGEBRD vs PDGEBRD
- High performance of PDGEBRD module
28Computational Results
DGEBRD vs sequential implementations
29Computational Results
- DGEBRD vs sequential implementations
- Initial QR decomposition
30Computational Results
DGEBRD vs sequential implementations
31Computational Results
PDGEBRD vs parallel implementations
32Computational Results
- PDGEBRD vs parallel implementations
- A variant of Barlows algorithm reduces the
number of communication instances - A small improvement in the performance of the
modified algorithm is obtained when compared to
Barlows algorithm
33Computational Results
- DGEBRD vs sequential implementations
- PDGEBRD vs parallel implementations
34Computational Results
- DGEBRD vs sequential implementations
- PDGEBRD vs parallel implementations
35Computational Results
- Parallel implementations
- Barlows algorithm
36Computational Results
- Parallel implementations
- Modified Barlows algorithm
37Computational Results
- Parallel implementations
- Barlows algorithm / Modified Barlows algorithm
38Computational Results
- Parallel results
- Parallel implementations
- Barlows algorithm / Modified Barlows algorithm
- Some improvements in the performance of modified
Barlows algorithm are obtained when compared to
Barlows algorithm
39Computational Results
- Parallel implementations
- PDGEBRD / Modified Barlows algorithm
40Computational Results
- Parallel implementations
- PDGEBRD / Modified Barlows algorithm
41Computational Results
- Parallel implementations
- Barlows algorithm / Modified Barlows algorithm
42Computational Results
- Parallel implementations
- Barlows algorithm / Modified Barlows algorithm
43Computational Results
- Parallel implementations
- PDGEBRD / Modified Barlows algorithm
44Computational Results
- Parallel implementations
- PDGEBRD / Modified Barlows algorithm
45Computational Results
- Speed-up
- Modified Barlows algorithm
46Computational Results
- Efficiency
- Modified Barlows algorithm
47Computational Results
- Speed-up and efficiency
- Modified Barlows algorithm
48Computational Results
- Speed-up
- Modified Barlows algorithm
49Computational Results
- Efficiency
- Modified Barlows algorithm
50Computational Results
- Speed-up and Efficiency
- Modified Barlows algorithm
51Parallel bisection algorithm
52Parallel bisection algorithm
- Computing eigenvalues is a significant part of
sequential MRRR algorithm (C. Voemel) - Bisection is well suited to parallel computing
- Highly flexible
- Adapts well to mixed precision arithmetic
- (not true for the competitors QR, dqds DC?)
can take advantage of processors with very fast
single precision arithmetic.
53Parallel bisection algorithm
- Each bisection step delivers one bit of accuracy,
therefore single precision can be used until
double precision is required for more correct
bits - Single precision needs to deliver an interval
which is guaranteed to contain an eigenvalue - Switch to a faster method when good aproximations
are known. (Newton-Raphson? a recurrence relation
for the arithmetic inverse of the correction
delivers high relative accuracy).
54Delivering a correct interval
55Delivering a correct interval
56Towards an algorithm for load balanced parallel
bisection without communication
- Floating-point performance is likely to keep
improving much faster than inter-processor
communications. - Unlikely that it will pay off to do much
communication for achieving perfect load- balance
in the parallel bisection algorithm. - Some redundancy in the arithmetic acceptable.
- An assignment of segments (of the initial
interval) of equal width to processors may
deliver poor speedup.
57Parallel bisection for computing the eigenvalues
of -1 2 -1 with 100 processors
58Towards an algorithm for load balanced parallel
bisection without communication
- Phase 1 Carry out a (not too large) number of
bisection steps in a breadth first search to get
a good picture of the spectrum. Produces a
number of intervals (at least p number of
processors). - Phase 2 Distributes intervals to processors
trying to achieve load- balance (in non
patological cases it will assign almost the same
number of eigenvalues to each processor) - Phase 3 Each processor computes the assigned
eigenvalues to prescribed accuracy
59Towards an algorithm for load balanced parallel
bisection without communication
60Towards an algorithm for load balanced parallel
bisection without communication
61Towards an algorithm for load balanced parallel
bisection without communication
- Preliminar implementation (in Matlab)
- Finishes Phase 1 when enough intervals have been
produced such that, for each k1,,p-1, an end
point x of one of those intervals satisfies -
-
- This may affect the speedup by 10.
- This termination criteria for Phase 1 may be hard
(i.e, take too many bisection steps) to satisfy
in some cases.
62Parallel bisection for computing the eigenvalues
of -1 2 -1 of order 104
63Conclusions (on bidiagonalization)
- Sequential Barlows algorithm competitive with
LAPACKs DGEBRD - Modification proposed by us reduces communication
costs - Scalapacks PDGEBRD efficient on our tests (not
as efficient in other platforms) - Current parallel implementation of new algorithm
not OPTIMAL but results already indicate that the
proposed algorithm is a serious candidate for
ScaLAPACK. - Relative accuracy ? (reorthogonalization required
in some cases)
64References
- Ralha, R., A new algorithm for singular value
decompositions, Proc. 3rd Euromicro Workshop on
Parallel and Distributed Processing, IEEE
Computer Society Press, 1994, pp. 240-244. - Ralha, R. and Mackiewicz, A., An efficient
algorithm for the computation of singular values,
Proc. 3rd Int. Congress on Numerical Methods in
Engineering, Spanish Soc. of Numerical Methods in
Engineering, 1996, pp.1371-1380. - Ralha, R., One-sided reduction to bidiagonal
form, LAA 358 (2003), pp. 219-238. - Barlow, J., Bosner, N. and Drmac, Z., A new
stable bidiagonal reduction algorithm, LAA 397
(2005), pp.35-84. - Campos, C., Guerrero, D., Hernandez, V. and
Ralha, R., Parallel bidiagonalization of a dense
matrix, submited. - Barlow, J. and Bosner, N., Block and parallel
versions of one-sided bidiagonalization, submited.