Jack Dongarra

About This Presentation

Title:

Jack Dongarra

Description:

Jack Dongarra. INNOVATIVE COMP ING LABORATORY. University of Tennessee. Oak Ridge ... Fork-Join vs. Dynamic Execution. Fork-Join parallel BLAS. Experiments on ... – PowerPoint PPT presentation

Number of Views:129

Avg rating:3.0/5.0

Slides: 20

Provided by: jack242

Category:

more less

Transcript and Presenter's Notes

Title: Jack Dongarra

1
Some of the Software Challenges for Numerical
Libraries on ManyCore Systems

Jack Dongarra
INNOVATIVE COMP ING LABORATORY
University of Tennessee
Oak Ridge National Laboratory

2
Major Changes to Math Software

Scalar
Fortran code in EISPACK (1974)
Vector
Level 1 BLAS use in LINPACK (1979)
SMP
Level 3 BLAS use in LAPACK (1992)
Distributed Memory
Message Passing w/MPI in ScaLAPACK (1995)
Many-Core
Event driven multi-threading in PLASMA
Parallel Linear Algebra Software for Multicore
Architectures

3
ManyCore - Parallelism for the Masses

We are looking at the following concepts in
designing the next library implementation
Dynamic Data Driven Execution
Self Adapting
Block Data Layout
Mixed Precision in the Algorithm
Exploit Hybrid Architectures
Fault Tolerant Methods

4
Parallelism in LAPACK / ScaLAPACK
Distributed Memory
Shared Memory
ScaLAPACK
LAPACK
PBLAS
ATLAS
Specialized BLAS
Parallel
BLACS
threads
MPI
BLAS
Two well known open source software efforts for
dense matrix problems.
5
Steps in the LAPACK LU
(Factor a panel)
(Backward swap)
(Forward swap)
(Triangular solve)
(Matrix multiply)
Most of the work done here
6
LU Timing Profile (4 Core System)
Threads no lookahead
1D decomposition
Time for each component
DGETF2
DLASWP(L)
DLASWP(R)
DTRSM
DGEMM
Bulk Sync Phases
7
Adaptive Lookahead - Dynamic
Reorganizing algorithms to use this approach
Event Driven Multithreading Out of Order Execution
8
Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
9
Fork-Join vs. Dynamic Execution
Fork-Join parallel BLAS
Time
DAG-based dynamic scheduling
Time saved
Experiments on Intels Quad Core Clovertown
with 2 Sockets w/ 8 Treads
10
Cholesky Factorization DAG-based Dependency
Tracking
11
11
12
13
14
12
22
22
23
24
13
23
33
22
14
24
34
44
23
24

Dependencies expressed by the DAG
are enforced on a tile basis
fine-grained parallelization
flexible scheduling

33
34
33
11
Cholesky on the IBM Cell

Pipelining
Between loop iterations.

Double Buffering
Within BLAS,
Between BLAS,
Between loop iterations.

Result
Minimum load imbalance,
Minimum dependency stalls,
Minimum memory stalls
(no waiting for data).

Achieves 174 Gflop/s 85 of peak in SP.
12
How to Deal with Complexity?

Adaptivity is the key for applications to
effectively use available resources whose
complexity is exponentially increasing
Goal
Automatically bridge the gap between the
application and computers that are rapidly
changing and getting more and more complex

13
Examples of Automatic Performance Tuning
Proceedings of the IEEE,
V 93 2 Feb. 2005 Issue on Program
Generation, Optimization, and Platform
Adaptation

Dense BLAS
Sequential
ATLAS (UTK) PHiPAC (UCB)
Fast Fourier Transform (FFT) variations
FFTW (MIT)
Sequential and Parallel
www.fftw.org
Digital Signal Processing
SPIRAL www.spiral.net (CMU)
MPI Collectives (UCB, UTK)
More projects, conferences, government reports,

14
Generic Code Optimization

Can ATLAS-like techniques be applied to arbitrary
code?
What do we mean by ATLAS-like techniques?
Blocking
Loop unrolling
Data prefetch
Functional unit scheduling
etc.
Referred to as empirical optimization
Generate many variations
Pick the best implementation by measuring the
performance

15
Applying Self Adapting Software

Numerical and Non-numerical applications
BLAS like ops / message passing collectives
Static or Dynamic determine code to be used
Perform at make time / every time invoked
Independent or dependent on data presented
Same on each data set / depends on properties of
data

16
Multi, Many, , Many-More

Parallelism for the masses
Multi, Many, Many-MoreCore
are here and coming fast
Use Dynamic DAG based scheduling
Minimize sync - Non-blocking communication
Maximize locality - Block data layout
Autotuners should take on a larger, or at least
complementary, role to compilers in translating
parallel programs.
Whats needed is a long-term, balanced investment
in hardware, software, algorithms and
applications in the HPC Ecosystem.

17
Collaborators / Support

Alfredo Buttari
Julien Langou
Julie Langou
Piotr Luszczek
Jakub Kurzak
Stan Tomov

18
(No Transcript)
19
Summary of Current Unmet Needs

Performance / Portability
Memory bandwidth/Latency
Fault tolerance
Adaptability Some degree of autonomy to self
optimize, test, or monitor.
Able to change mode of operation static or
dynamic
Better programming models
Global shared address space
Visible locality
Maybe coming soon (incremental, yet offering real
benefits)
Global Address Space (GAS) languages UPC,
Co-Array Fortran, Titanium, Chapel, X10,
Fortress)
Minor extensions to existing languages
More convenient than MPI
Have performance transparency via explicit remote
memory references
Whats needed is a long-term, balanced investment
in hardware, software, algorithms and
applications in the HPC Ecosystem.

Write a Comment

User Comments (0)

About PowerShow.com

Jack Dongarra - PowerPoint PPT Presentation

Jack Dongarra

Jack Dongarra. INNOVATIVE COMP ING LABORATORY. University of Tennessee. Oak Ridge ... Fork-Join vs. Dynamic Execution. Fork-Join parallel BLAS. Experiments on ... – PowerPoint PPT presentation