Title: Jack Dongarra
1Some of the Software Challenges for Numerical
Libraries on ManyCore Systems
- Jack Dongarra
- INNOVATIVE COMP ING LABORATORY
- University of Tennessee
- Oak Ridge National Laboratory
2Major Changes to Math Software
- Scalar
- Fortran code in EISPACK (1974)
- Vector
- Level 1 BLAS use in LINPACK (1979)
- SMP
- Level 3 BLAS use in LAPACK (1992)
- Distributed Memory
- Message Passing w/MPI in ScaLAPACK (1995)
- Many-Core
- Event driven multi-threading in PLASMA
- Parallel Linear Algebra Software for Multicore
Architectures
3ManyCore - Parallelism for the Masses
- We are looking at the following concepts in
designing the next library implementation - Dynamic Data Driven Execution
- Self Adapting
- Block Data Layout
- Mixed Precision in the Algorithm
- Exploit Hybrid Architectures
- Fault Tolerant Methods
4Parallelism in LAPACK / ScaLAPACK
Distributed Memory
Shared Memory
ScaLAPACK
LAPACK
PBLAS
ATLAS
Specialized BLAS
Parallel
BLACS
threads
MPI
BLAS
Two well known open source software efforts for
dense matrix problems.
5Steps in the LAPACK LU
(Factor a panel)
(Backward swap)
(Forward swap)
(Triangular solve)
(Matrix multiply)
Most of the work done here
6LU Timing Profile (4 Core System)
Threads no lookahead
1D decomposition
Time for each component
DGETF2
DLASWP(L)
DLASWP(R)
DTRSM
DGEMM
Bulk Sync Phases
7Adaptive Lookahead - Dynamic
Reorganizing algorithms to use this approach
Event Driven Multithreading Out of Order Execution
8Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
9Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
DAG-based dynamic scheduling
Time saved
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
10Cholesky Factorization DAG-based Dependency
Tracking
11
11
12
13
14
12
22
22
23
24
13
23
33
22
14
24
34
44
23
24
- Dependencies expressed by the DAG
- are enforced on a tile basis
- fine-grained parallelization
- flexible scheduling
33
34
33
11Cholesky on the IBM Cell
- Pipelining
- Between loop iterations.
- Double Buffering
- Within BLAS,
- Between BLAS,
- Between loop iterations.
- Result
- Minimum load imbalance,
- Minimum dependency stalls,
- Minimum memory stalls
- (no waiting for data).
Achieves 174 Gflop/s 85 of peak in SP.
12How to Deal with Complexity?
- Adaptivity is the key for applications to
effectively use available resources whose
complexity is exponentially increasing - Goal
- Automatically bridge the gap between the
application and computers that are rapidly
changing and getting more and more complex
13Examples of Automatic Performance Tuning
Proceedings of the IEEE,
V 93 Â 2 Â Feb. 2005 Issue on Program
Generation, Optimization, and Platform
Adaptation
- Dense BLAS
- Sequential
- ATLAS (UTK) PHiPAC (UCB)
- Fast Fourier Transform (FFT) variations
- FFTW (MIT)
- Sequential and Parallel
- www.fftw.org
- Digital Signal Processing
- SPIRAL www.spiral.net (CMU)
- MPI Collectives (UCB, UTK)
- More projects, conferences, government reports,
14Generic Code Optimization
- Can ATLAS-like techniques be applied to arbitrary
code? - What do we mean by ATLAS-like techniques?
- Blocking
- Loop unrolling
- Data prefetch
- Functional unit scheduling
- etc.
- Referred to as empirical optimization
- Generate many variations
- Pick the best implementation by measuring the
performance
15Applying Self Adapting Software
- Numerical and Non-numerical applications
- BLAS like ops / message passing collectives
- Static or Dynamic determine code to be used
- Perform at make time / every time invoked
- Independent or dependent on data presented
- Same on each data set / depends on properties of
data
16Multi, Many, , Many-More
- Parallelism for the masses
- Multi, Many, Many-MoreCore
are here and coming fast - Use Dynamic DAG based scheduling
- Minimize sync - Non-blocking communication
- Maximize locality - Block data layout
- Autotuners should take on a larger, or at least
complementary, role to compilers in translating
parallel programs. - Whats needed is a long-term, balanced investment
in hardware, software, algorithms and
applications in the HPC Ecosystem.
17Collaborators / Support
- Alfredo Buttari
- Julien Langou
- Julie Langou
- Piotr Luszczek
- Jakub Kurzak
- Stan Tomov
18(No Transcript)
19Summary of Current Unmet Needs
- Performance / Portability
- Memory bandwidth/Latency
- Fault tolerance
- Adaptability Some degree of autonomy to self
optimize, test, or monitor. - Able to change mode of operation static or
dynamic - Better programming models
- Global shared address space
- Visible locality
- Maybe coming soon (incremental, yet offering real
benefits) - Global Address Space (GAS) languages UPC,
Co-Array Fortran, Titanium, Chapel, X10,
Fortress) - Minor extensions to existing languages
- More convenient than MPI
- Have performance transparency via explicit remote
memory references - Whats needed is a long-term, balanced investment
in hardware, software, algorithms and
applications in the HPC Ecosystem.