PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE MATRICES - PowerPoint PPT Presentation

1 / 50
About This Presentation
Title:

PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE MATRICES

Description:

LAPACK SEMINAR, UC Berkeley, 1st March 2006. PARALLEL ... ScaMPI. Tasks schedule. OpenPBS. Numerical libraries. BLAS, LAPACK, ScaLAPACK. Programming language ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 51
Provided by: ccam5
Category:

less

Transcript and Presenter's Notes

Title: PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE MATRICES


1
PARALLEL COMPUTATION OF SINGULAR VALUES OF DENSE
MATRICES
  • Rui Ralha
  • DMAT, CMAT
  • Univ. do Minho
  • Portugal
  • r_ralha_at_math.uminho.pt

2
Acknowledgements
  • CMAT
  • FCT
  • POCTI (European Union contribution)
  • Prof. B. Parlett

3
Outline
  • Bidiagonalization Methods
  • Ralhas algorithm
  • Barlows algorithm
  • Modified Barlows algorithm
  • Parallelization Strategy
  • ScaLAPACKs methodology
  • Parallel algorithms
  • Computational Results
  • Parallel bisection algorithm
  • Conclusions (on bidiagonalization)

4
Bidiagonalization Methods
  • Given
  • where
  • and
  • are orthogonal matrices.

5
Bidiagonalization Methods
  • Ralhas algorithm (1994, 1996, LAA, 2003)
  • Implicit reduction of to tridiagonal
    form post-multiplication by f orthogonal
    matrices (for instance, n-2 Householder matrices
    Hr )
  • QR decomposition (Gram-Schmidt for
    orthogonalization of adjacent columns)
  • Problems
  • singular values of B may have large errors
  • columns of U may be far from orthogonal

6
Ralhas bidiagonalization
  • In exact arithmetic, is
    exactly tridiagonal, i.e,
  • Rounding errors may ruin orthogonality. Example
    (n3, 24 bits IEEE arithmetic)
  • Reorthogonalization may be used to guarantee that
    non-adjacent
  • columns get orthogonal to working precision
    (in the example Householder
  • needs two reorthogonalization steps). The
    purpose of this is high relative accuracy.

7
Ralhas bidiagonalization
  • The first stage is perfectly stable
  • with P exactly orthogonal and
  • If
  • this one-sided transformation preserves better
    the
  • small singular values than two-sided
    transformations

8
Ralhas bidiagonalization
  • Ill-conditioning of the reduction to bidiagonal
    form (example)
  • If AQR, for some orthogonal Q, then non-adjacent
    columns of A are
  • orthogonal to working precision but there is no
    bidiagonal B satisfying AQB
  • (uniqueness of QR dec)
  • Small perturbations outside the tridiagonal band
    of cause much larger
  • perturbations in R.

9
Relation to Golub-Kahans algorithm
  • G-K ( can not be applied before )
  • makes columns 3 and 4 orthogonal to first
    column
  • may be applied first if inner products
    are computed (co-linear vectors produce the same
    Householder vector, thus ).
    Householder vector computed from

10
Relation to Golub-Kahans algorithm
  • For n columns, because multiplication on the
    left side by an
  • orthogonal matrix does not change inner products
    of
  • columns of A, the right side transformation
  • can be completed (requires computing the inner
    products). Then
  • This corresponds to a QR decomposition of .
  • HOWEVER, will
    introduce non-negligible
  • perturbations in the ith row of R if (the
    ith diagonal
  • element) gets much smaller than the norm of the
    ith column
  • of .

11
Relation to Golub-Kahans algorithm
  • Let
  • The first left transformation can not introduce
    large elements in the first row outside the
    bidiagonal form
  • since
  • BUT, if are close to
    being parallel then will take most of
    norm of the second column
  • of and

12
  • Assuming that non adjacent columns of
    are orthogonal, we just need
    to
  • orthogonalize adjacent columns
  • But, singular values of bidiagonal
    may have errors. GS is NOT to be blamed. Left
  • orthogonal transformations can not do better.

13
Ralhas algorithm
  • (sequential)

For r 1,..., n-2 Compute
Compute Hr such that Compute End
For Compute Compute For r 2,..., n
Compute Compute Compute
Compute End For
14
A stable algorithm (Barlow et al, LAA 2005)
  • Interleaves the two stages
  • Carries out the rth GS step (this makes
    column r1 orthogonal to column r) immediately
    after post-multiplication with Hr (this makes
    columns r2,,n orthogonal to column r) The
    Householder matrix Hr is computed from
  • The computed satisfies
  • Orthogonality of computed U similar to that of Q
    in MGS QR

15
Barlows algorithm
  • (sequential)

Compute For r 1,..., n-2 Compute
Compute Hr such that
Update Compute
Compute End Compute last elements
16
A variant of Barlows algorithm
  • motivation to reduce communications in the
    parallel algorithm
  • A(,r) instead of Q(,r), in the rth step

17
Barlows algorithm (modified)
(sequential)
For r 1,..., n-2 Compute
Compute Hr such that Update
Compute End For Compute Compute Compute Update Co
mpute
18
Parallelization Strategy
  • ScaLAPACKs methodology
  • ScaLAPACK software hierarchy
  • BLAS Basic Linear Algebra Subprograms
  • LAPACK Linear Algebra Package
  • BLACS Basic Linear Algebra Communication
    Subprograms
  • PBLAS Parallel BLAS

19
Parallelization Strategy
  • ScaLAPACKs methodology
  • Two-dimensional block cyclic distribution
  • Rectangular grid of processors
  • Example (matrix 6x6, blocks 2x2 2x3 grid of
    processors)

20
Parallelization Strategy
  • Parallel algorithms
  • Translation of BLAS and LAPACK routines into
    calls of the equivalent parallel routines of
    PBLAS and ScaLAPACK
  • Takes in account the data distribution and the
    corresponding rules to convert sequential
    LAPACK-based programs into parallel
    ScaLAPACK-based programs
  • Parallel algorithms
  • Barlows algorithm
  • The modified algorithm

21
Parallel algorithms
  • Barlows algorithm (parallel)


PBLAS / ScaLAPACK
Compute Broadcast of Compute For r 1,...,
n-2 Compute Compute Hr such that
Compute Compute
Compute Broadcast of
Compute End Compute last elements

PDNRM2 DGEBDS2D/DGEBR2D PDSCAL PDGEMV PDLARFG PD
LARF PDAXPY PDNRM2 DGEBDS2D/DGEBR2D PDSCAL
22
Computational Results
  • Computational environment (hardware)
  • Cluster with 20 nodes biprocessors
  • Each node is a Pentium Xeon with 2Ghz and 1GB of
    RAM
  • Connected with a SCI network
  • Torus 2D with a 4x5 processors grid

23
Computational Results
  • Computational environment (software)
  • Operating system
  • Red Hat 8.0 (psyche) with kernel 2.4.20 and
    OpenMosix
  • Compilers
  • GNU gcc, g, g77
  • Intel version 7.0 icc, ifc
  • Communications
  • ScaFmpich (a Scali version of MPICH 1.2.4)
  • ScaMPI
  • Tasks schedule
  • OpenPBS
  • Numerical libraries
  • BLAS, LAPACK, ScaLAPACK
  • Programming language
  • Fortran 90

24
Computational Results
  • Computational experiences
  • Data
  • Random matrices 10000x1000 up to 10000x4500,
    square matrices
  • Results
  • Execution time without accumulation of the
    orthogonal transformations
  • DGEBRD vs PDGEBRD
  • DGEBRD vs sequential implementations
  • PDGEBRD vs parallel implementations
  • Speedup and efficiency

25
Computational Results
  • Remark
  • LAPACK and ScaLAPACK implement a block
    organization of the Golub-Kahan algorithm (BLAS3)
  • The new algorithm relies upon BLAS2 modules (not
    BLAS3)

26
Computational Results
  • DGEBRD vs PDGEBRD
  • High performance of PDGEBRD module

27
Computational Results
  • DGEBRD vs PDGEBRD
  • High performance of PDGEBRD module

28
Computational Results
DGEBRD vs sequential implementations
29
Computational Results
  • DGEBRD vs sequential implementations
  • Initial QR decomposition

30
Computational Results
DGEBRD vs sequential implementations
31
Computational Results
PDGEBRD vs parallel implementations
32
Computational Results
  • PDGEBRD vs parallel implementations
  • A variant of Barlows algorithm reduces the
    number of communication instances
  • A small improvement in the performance of the
    modified algorithm is obtained when compared to
    Barlows algorithm

33
Computational Results
  • DGEBRD vs sequential implementations
  • PDGEBRD vs parallel implementations

34
Computational Results
  • DGEBRD vs sequential implementations
  • PDGEBRD vs parallel implementations

35
Computational Results
  • Parallel implementations
  • Barlows algorithm

36
Computational Results
  • Parallel implementations
  • Modified Barlows algorithm

37
Computational Results
  • Parallel implementations
  • Barlows algorithm / Modified Barlows algorithm

38
Computational Results
  • Parallel results
  • Parallel implementations
  • Barlows algorithm / Modified Barlows algorithm
  • Some improvements in the performance of modified
    Barlows algorithm are obtained when compared to
    Barlows algorithm

39
Computational Results
  • Parallel implementations
  • PDGEBRD / Modified Barlows algorithm

40
Computational Results
  • Parallel implementations
  • PDGEBRD / Modified Barlows algorithm

41
Computational Results
  • Parallel implementations
  • Barlows algorithm / Modified Barlows algorithm

42
Computational Results
  • Parallel implementations
  • Barlows algorithm / Modified Barlows algorithm

43
Computational Results
  • Parallel implementations
  • PDGEBRD / Modified Barlows algorithm

44
Computational Results
  • Parallel implementations
  • PDGEBRD / Modified Barlows algorithm

45
Computational Results
  • Speed-up
  • Modified Barlows algorithm

46
Computational Results
  • Efficiency
  • Modified Barlows algorithm

47
Computational Results
  • Speed-up and efficiency
  • Modified Barlows algorithm

48
Computational Results
  • Speed-up
  • Modified Barlows algorithm

49
Computational Results
  • Efficiency
  • Modified Barlows algorithm

50
Computational Results
  • Speed-up and Efficiency
  • Modified Barlows algorithm

51
Parallel bisection algorithm
52
Parallel bisection algorithm
  • Computing eigenvalues is a significant part of
    sequential MRRR algorithm (C. Voemel)
  • Bisection is well suited to parallel computing
  • Highly flexible
  • Adapts well to mixed precision arithmetic
  • (not true for the competitors QR, dqds DC?)
    can take advantage of processors with very fast
    single precision arithmetic.

53
Parallel bisection algorithm
  • Each bisection step delivers one bit of accuracy,
    therefore single precision can be used until
    double precision is required for more correct
    bits
  • Single precision needs to deliver an interval
    which is guaranteed to contain an eigenvalue
  • Switch to a faster method when good aproximations
    are known. (Newton-Raphson? a recurrence relation
    for the arithmetic inverse of the correction
    delivers high relative accuracy).

54
Delivering a correct interval
55
Delivering a correct interval
56
Towards an algorithm for load balanced parallel
bisection without communication
  • Floating-point performance is likely to keep
    improving much faster than inter-processor
    communications.
  • Unlikely that it will pay off to do much
    communication for achieving perfect load- balance
    in the parallel bisection algorithm.
  • Some redundancy in the arithmetic acceptable.
  • An assignment of segments (of the initial
    interval) of equal width to processors may
    deliver poor speedup.

57
Parallel bisection for computing the eigenvalues
of -1 2 -1 with 100 processors
58
Towards an algorithm for load balanced parallel
bisection without communication
  • Phase 1 Carry out a (not too large) number of
    bisection steps in a breadth first search to get
    a good picture of the spectrum. Produces a
    number of intervals (at least p number of
    processors).
  • Phase 2 Distributes intervals to processors
    trying to achieve load- balance (in non
    patological cases it will assign almost the same
    number of eigenvalues to each processor)
  • Phase 3 Each processor computes the assigned
    eigenvalues to prescribed accuracy

59
Towards an algorithm for load balanced parallel
bisection without communication

60
Towards an algorithm for load balanced parallel
bisection without communication

61
Towards an algorithm for load balanced parallel
bisection without communication
  • Preliminar implementation (in Matlab)
  • Finishes Phase 1 when enough intervals have been
    produced such that, for each k1,,p-1, an end
    point x of one of those intervals satisfies
  • This may affect the speedup by 10.
  • This termination criteria for Phase 1 may be hard
    (i.e, take too many bisection steps) to satisfy
    in some cases.

62
Parallel bisection for computing the eigenvalues
of -1 2 -1 of order 104
63
Conclusions (on bidiagonalization)
  • Sequential Barlows algorithm competitive with
    LAPACKs DGEBRD
  • Modification proposed by us reduces communication
    costs
  • Scalapacks PDGEBRD efficient on our tests (not
    as efficient in other platforms)
  • Current parallel implementation of new algorithm
    not OPTIMAL but results already indicate that the
    proposed algorithm is a serious candidate for
    ScaLAPACK.
  • Relative accuracy ? (reorthogonalization required
    in some cases)

64
References
  • Ralha, R., A new algorithm for singular value
    decompositions, Proc. 3rd Euromicro Workshop on
    Parallel and Distributed Processing, IEEE
    Computer Society Press, 1994, pp. 240-244.
  • Ralha, R. and Mackiewicz, A., An efficient
    algorithm for the computation of singular values,
    Proc. 3rd Int. Congress on Numerical Methods in
    Engineering, Spanish Soc. of Numerical Methods in
    Engineering, 1996, pp.1371-1380.
  • Ralha, R., One-sided reduction to bidiagonal
    form, LAA 358 (2003), pp. 219-238.
  • Barlow, J., Bosner, N. and Drmac, Z., A new
    stable bidiagonal reduction algorithm, LAA 397
    (2005), pp.35-84.
  • Campos, C., Guerrero, D., Hernandez, V. and
    Ralha, R., Parallel bidiagonalization of a dense
    matrix, submited.
  • Barlow, J. and Bosner, N., Block and parallel
    versions of one-sided bidiagonalization, submited.
Write a Comment
User Comments (0)
About PowerShow.com