PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE MATRICES - PowerPoint PPT Presentation

1 / 50

About This Presentation

Title:

PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE MATRICES

Description:

LAPACK SEMINAR, UC Berkeley, 1st March 2006. PARALLEL ... ScaMPI. Tasks schedule. OpenPBS. Numerical libraries. BLAS, LAPACK, ScaLAPACK. Programming language ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 51

Provided by: ccam5

Category:

more less

Transcript and Presenter's Notes

Title: PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE MATRICES

1
PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE
MATRICES

Rui Ralha
DMAT, CMAT
Univ. do Minho
Portugal
r_ralha_at_math.uminho.pt

2
Acknowledgements

CMAT
FCT
POCTI (European Union contribution)
Prof. B. Parlett

3
Outline

Bidiagonalization Methods
Ralhas algorithm
Barlows algorithm
Modified Barlows algorithm
Parallelization Strategy
ScaLAPACKs methodology
Parallel algorithms
Computational Results
Parallel bisection algorithm
Conclusions (on bidiagonalization)

4
Bidiagonalization Methods

Given
where
and
are orthogonal matrices.

5
Bidiagonalization Methods

Ralhas algorithm (1994, 1996, LAA, 2003)
Implicit reduction of to tridiagonal
form post-multiplication by f orthogonal
matrices (for instance, n-2 Householder matrices
Hr )
QR decomposition (Gram-Schmidt for
orthogonalization of adjacent columns)
Problems
singular values of B may have large errors
columns of U may be far from orthogonal

6
Ralhas bidiagonalization

In exact arithmetic, is
exactly tridiagonal, i.e,
Rounding errors may ruin orthogonality. Example
(n3, 24 bits IEEE arithmetic)
Reorthogonalization may be used to guarantee that
non-adjacent
columns get orthogonal to working precision
(in the example Householder
needs two reorthogonalization steps). The
purpose of this is high relative accuracy.

7
Ralhas bidiagonalization

The first stage is perfectly stable
with P exactly orthogonal and
If
this one-sided transformation preserves better
the
small singular values than two-sided
transformations

8
Ralhas bidiagonalization

Ill-conditioning of the reduction to bidiagonal
form (example)
If AQR, for some orthogonal Q, then non-adjacent
columns of A are
orthogonal to working precision but there is no
bidiagonal B satisfying AQB
(uniqueness of QR dec)
Small perturbations outside the tridiagonal band
of cause much larger
perturbations in R.

9
Relation to Golub-Kahans algorithm

G-K ( can not be applied before )
makes columns 3 and 4 orthogonal to first
column
may be applied first if inner products
are computed (co-linear vectors produce the same
Householder vector, thus ).
Householder vector computed from

10
Relation to Golub-Kahans algorithm

For n columns, because multiplication on the
left side by an
orthogonal matrix does not change inner products
of
columns of A, the right side transformation
can be completed (requires computing the inner
products). Then
This corresponds to a QR decomposition of .
HOWEVER, will
introduce non-negligible
perturbations in the ith row of R if (the
ith diagonal
element) gets much smaller than the norm of the
ith column
of .

11
Relation to Golub-Kahans algorithm

Let
The first left transformation can not introduce
large elements in the first row outside the
bidiagonal form
since
BUT, if are close to
being parallel then will take most of
norm of the second column
of and

Assuming that non adjacent columns of
are orthogonal, we just need
to
orthogonalize adjacent columns
But, singular values of bidiagonal
may have errors. GS is NOT to be blamed. Left
orthogonal transformations can not do better.

13
Ralhas algorithm

(sequential)

For r 1,..., n-2 Compute
Compute Hr such that Compute End
For Compute Compute For r 2,..., n
Compute Compute Compute
Compute End For
14
A stable algorithm (Barlow et al, LAA 2005)

Interleaves the two stages
Carries out the rth GS step (this makes
column r1 orthogonal to column r) immediately
after post-multiplication with Hr (this makes
columns r2,,n orthogonal to column r) The
Householder matrix Hr is computed from
The computed satisfies
Orthogonality of computed U similar to that of Q
in MGS QR

15
Barlows algorithm

(sequential)

Compute For r 1,..., n-2 Compute
Compute Hr such that
Update Compute
Compute End Compute last elements
16
A variant of Barlows algorithm

motivation to reduce communications in the
parallel algorithm
A(,r) instead of Q(,r), in the rth step

17
Barlows algorithm (modified)
(sequential)
For r 1,..., n-2 Compute
Compute Hr such that Update
Compute End For Compute Compute Compute Update Co
mpute
18
Parallelization Strategy

ScaLAPACKs methodology
ScaLAPACK software hierarchy

BLAS Basic Linear Algebra Subprograms
LAPACK Linear Algebra Package
BLACS Basic Linear Algebra Communication
Subprograms
PBLAS Parallel BLAS

19
Parallelization Strategy

ScaLAPACKs methodology
Two-dimensional block cyclic distribution
Rectangular grid of processors
Example (matrix 6x6, blocks 2x2 2x3 grid of
processors)

20
Parallelization Strategy

Parallel algorithms
Translation of BLAS and LAPACK routines into
calls of the equivalent parallel routines of
PBLAS and ScaLAPACK
Takes in account the data distribution and the
corresponding rules to convert sequential
LAPACK-based programs into parallel
ScaLAPACK-based programs
Parallel algorithms
Barlows algorithm
The modified algorithm

21
Parallel algorithms

Barlows algorithm (parallel)

PBLAS / ScaLAPACK
Compute Broadcast of Compute For r 1,...,
n-2 Compute Compute Hr such that
Compute Compute
Compute Broadcast of
Compute End Compute last elements

PDNRM2 DGEBDS2D/DGEBR2D PDSCAL PDGEMV PDLARFG PD
LARF PDAXPY PDNRM2 DGEBDS2D/DGEBR2D PDSCAL
22
Computational Results

Computational environment (hardware)
Cluster with 20 nodes biprocessors
Each node is a Pentium Xeon with 2Ghz and 1GB of
RAM
Connected with a SCI network
Torus 2D with a 4x5 processors grid

23
Computational Results

Computational environment (software)
Operating system
Red Hat 8.0 (psyche) with kernel 2.4.20 and
OpenMosix
Compilers
GNU gcc, g, g77
Intel version 7.0 icc, ifc
Communications
ScaFmpich (a Scali version of MPICH 1.2.4)
ScaMPI
Tasks schedule
OpenPBS
Numerical libraries
BLAS, LAPACK, ScaLAPACK
Programming language
Fortran 90

24
Computational Results

Computational experiences
Data
Random matrices 10000x1000 up to 10000x4500,
square matrices
Results
Execution time without accumulation of the
orthogonal transformations
DGEBRD vs PDGEBRD
DGEBRD vs sequential implementations
PDGEBRD vs parallel implementations
Speedup and efficiency

25
Computational Results

Remark
LAPACK and ScaLAPACK implement a block
organization of the Golub-Kahan algorithm (BLAS3)
The new algorithm relies upon BLAS2 modules (not
BLAS3)

26
Computational Results

DGEBRD vs PDGEBRD
High performance of PDGEBRD module

27
Computational Results

DGEBRD vs PDGEBRD
High performance of PDGEBRD module

28
Computational Results
DGEBRD vs sequential implementations
29
Computational Results

DGEBRD vs sequential implementations
Initial QR decomposition

30
Computational Results
DGEBRD vs sequential implementations
31
Computational Results
PDGEBRD vs parallel implementations
32
Computational Results

PDGEBRD vs parallel implementations
A variant of Barlows algorithm reduces the
number of communication instances
A small improvement in the performance of the
modified algorithm is obtained when compared to
Barlows algorithm

33
Computational Results

DGEBRD vs sequential implementations
PDGEBRD vs parallel implementations

34
Computational Results

DGEBRD vs sequential implementations
PDGEBRD vs parallel implementations

35
Computational Results

Parallel implementations
Barlows algorithm

36
Computational Results

Parallel implementations
Modified Barlows algorithm

37
Computational Results

Parallel implementations
Barlows algorithm / Modified Barlows algorithm

38
Computational Results

Parallel results
Parallel implementations
Barlows algorithm / Modified Barlows algorithm
Some improvements in the performance of modified
Barlows algorithm are obtained when compared to
Barlows algorithm

39
Computational Results

Parallel implementations
PDGEBRD / Modified Barlows algorithm

40
Computational Results

Parallel implementations
PDGEBRD / Modified Barlows algorithm

41
Computational Results

Parallel implementations
Barlows algorithm / Modified Barlows algorithm

42
Computational Results

Parallel implementations
Barlows algorithm / Modified Barlows algorithm

43
Computational Results

Parallel implementations
PDGEBRD / Modified Barlows algorithm

44
Computational Results

Parallel implementations
PDGEBRD / Modified Barlows algorithm

45
Computational Results

Speed-up
Modified Barlows algorithm

46
Computational Results

Efficiency
Modified Barlows algorithm

47
Computational Results

Speed-up and efficiency
Modified Barlows algorithm

48
Computational Results

Speed-up
Modified Barlows algorithm

49
Computational Results

Efficiency
Modified Barlows algorithm

50
Computational Results

Speed-up and Efficiency
Modified Barlows algorithm

51
Parallel bisection algorithm
52
Parallel bisection algorithm

Computing eigenvalues is a significant part of
sequential MRRR algorithm (C. Voemel)
Bisection is well suited to parallel computing
Highly flexible
Adapts well to mixed precision arithmetic
(not true for the competitors QR, dqds DC?)
can take advantage of processors with very fast
single precision arithmetic.

53
Parallel bisection algorithm

Each bisection step delivers one bit of accuracy,
therefore single precision can be used until
double precision is required for more correct
bits
Single precision needs to deliver an interval
which is guaranteed to contain an eigenvalue
Switch to a faster method when good aproximations
are known. (Newton-Raphson? a recurrence relation
for the arithmetic inverse of the correction
delivers high relative accuracy).

54
Delivering a correct interval
55
Delivering a correct interval
56
Towards an algorithm for load balanced parallel
bisection without communication

Floating-point performance is likely to keep
improving much faster than inter-processor
communications.
Unlikely that it will pay off to do much
communication for achieving perfect load- balance
in the parallel bisection algorithm.
Some redundancy in the arithmetic acceptable.
An assignment of segments (of the initial
interval) of equal width to processors may
deliver poor speedup.

57
Parallel bisection for computing the eigenvalues
of -1 2 -1 with 100 processors
58
Towards an algorithm for load balanced parallel
bisection without communication

Phase 1 Carry out a (not too large) number of
bisection steps in a breadth first search to get
a good picture of the spectrum. Produces a
number of intervals (at least p number of
processors).
Phase 2 Distributes intervals to processors
trying to achieve load- balance (in non
patological cases it will assign almost the same
number of eigenvalues to each processor)
Phase 3 Each processor computes the assigned
eigenvalues to prescribed accuracy

59
Towards an algorithm for load balanced parallel
bisection without communication

60
Towards an algorithm for load balanced parallel
bisection without communication

61
Towards an algorithm for load balanced parallel
bisection without communication

Preliminar implementation (in Matlab)
Finishes Phase 1 when enough intervals have been
produced such that, for each k1,,p-1, an end
point x of one of those intervals satisfies
This may affect the speedup by 10.
This termination criteria for Phase 1 may be hard
(i.e, take too many bisection steps) to satisfy
in some cases.

62
Parallel bisection for computing the eigenvalues
of -1 2 -1 of order 104
63
Conclusions (on bidiagonalization)

Sequential Barlows algorithm competitive with
LAPACKs DGEBRD
Modification proposed by us reduces communication
costs
Scalapacks PDGEBRD efficient on our tests (not
as efficient in other platforms)
Current parallel implementation of new algorithm
not OPTIMAL but results already indicate that the
proposed algorithm is a serious candidate for
ScaLAPACK.
Relative accuracy ? (reorthogonalization required
in some cases)

64
References

Ralha, R., A new algorithm for singular value
decompositions, Proc. 3rd Euromicro Workshop on
Parallel and Distributed Processing, IEEE
Computer Society Press, 1994, pp. 240-244.
Ralha, R. and Mackiewicz, A., An efficient
algorithm for the computation of singular values,
Proc. 3rd Int. Congress on Numerical Methods in
Engineering, Spanish Soc. of Numerical Methods in
Engineering, 1996, pp.1371-1380.
Ralha, R., One-sided reduction to bidiagonal
form, LAA 358 (2003), pp. 219-238.
Barlow, J., Bosner, N. and Drmac, Z., A new
stable bidiagonal reduction algorithm, LAA 397
(2005), pp.35-84.
Campos, C., Guerrero, D., Hernandez, V. and
Ralha, R., Parallel bidiagonalization of a dense
matrix, submited.
Barlow, J. and Bosner, N., Block and parallel
versions of one-sided bidiagonalization, submited.