Programming for High Performance Computers - PowerPoint PPT Presentation

About This Presentation

Title:

Programming for High Performance Computers

Description:

Caches are introduced to facilitate the re-use of data. 2-3 levels of cache L1, L2, L3 ... A language was developed that was difficult to compile efficiently. ... – PowerPoint PPT presentation

Number of Views:15

Avg rating:3.0/5.0

Slides: 23

Provided by: johnle9

Learn more at: https://www.csm.ornl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Programming for High Performance Computers

1
Programming for High Performance Computers

John M. Levesque
Director
Crays Supercomputing Center
Of
Excellence

2
Outline

Building a Petascale Computer
Challenges for utilizing a Petascale System
Utilizing the Core
Utilizing the Socket
Scaling to 100,000 cores
How one programs for the Petascale System
Conclusion

3
Petascale Computer

First we need to define what we mean by a
Petascale computer
Google already has a Petaflop on their floor
Embarrassingly Parallel Application
My Definition
Petascale computer is a computer system that
delivers a sustained Petaflop to a several real
science applications

4
A Petascale Computer Requires

A state-of-the-art Commodity Micro-processor
An ultra-fast proprietary Interconnect
A sophisticated LWK Operating System to stay out
of the way of application scaling
Efficient messaging between processors
MPI may not be efficient enough!!

5
Potential Petascale Computer

32,768 sockets
More dense circuitry results in more processors
(cores) on the chip (socket)
Each core produces 4 results
Each socket contains 4 cores sharing memory
We expect by the end of 2009, micro-processor
technology to supply 3 GHZ sockets, each
capable of delivering 16 floating point
operations per clock cycle.

32768163 1,572,864 GFLOPS 1.572 PFLOPS
6
Petascale Challenge for Interconnect

Connect 32,768 Sockets together with an
interconnect that has 2-3 microseconds latency
across the entire system
Supply a cross-section bandwidth to facilitate
ALLTOALL communication across the entire system

7
Petascale Challenge for Programming

Use as 131,072 Uni-processors or 32,768 4-way
Shared Memory sockets
MPI across all the processors
Hard on Socket Memory bandwidth and injection
bandwidth into the network
MPI between sockets and OpenMP across socket
Hybrid programming is difficult

8
Petascale Challenge for Software

OS must be able to supply required facilities and
not be over-loaded with demons that steal cpu
cycles and get cores out of sync
The notion of a Light Weight Kernel (LWK) that
only has what is needed to run app
No keyboard demon, no kernel threads, no sockets,
.

Two systems are using this very successfully
today, Crays XT4 and IBMs Bluegene
9
The Programming Challenge

We start with 1.5 Petaflops and want to sustain gt
1 Petaflop
Must achieve 67 of peak across the entire system
Inhibitors
On-socket memory bandwidth
Scaling across 131,072 processors or,
Utilizing OpenMP on socket, Messaging across
system

10
The Programming Challenge

Inhibitors
On-socket memory bandwidth
Today we see between 5-80 of sustained
performance on the core. This single core
sustained performance is the maximum we will
achieve.
Scaling across 131,072 processors or,
Today few applications scale as high as 5000
processors
Utilizing OpenMP on socket, Messaging across
system
OpenMP must be used on a very high percentage of
the application or else, Amdahls law applies
and peak of Socket may be degraded

11
Programming for the Core

Each core produces 4 floating point
results/clock cycle, the memory can only supply
16 bytes/clock cycle
Best case contiguous on 16 byte boundaries
32 bit arithmetic 4 words/cycle
64 bit arithmetic 2 words/cycle
Worse case
One word every 2-4 cycles

12
Consider a Triad Kernel

A B Scalar C

Need 2 loads and 1 store to produce 1 result
How can we produce 4 results each clock cycle,
When we need to fetch 16 bytes/clock cycle and
store 8 bytes/clock cycle?
13
Programming for the Core

Each core produces 4 floating point
results/clock cycle, the memory can only supply
16 bytes/clock cycle
Best case contiguous on 16 byte boundaries
32 bit arithmetic 4 words/cycle
64 bit arithmetic 2 words/cycle
Worse case
One word every 2-4 cycles

IMBALANCE
14
CACHE to the rescue?

To solve the processor/memory mismatch
Caches are introduced to facilitate the re-use of
data
2-3 levels of cache L1, L2, L3
L1 and L2 are dedicated to a core
L3 is typically shared across the cores
To improve performance, users must understand how
to take advantage of cache
User can improve cache utilization by blocking
their algorithms to have a working set that fits
in cache
Efficient libraries tend to be cache-friendly
ZGEMM achieves 80-90 of peak performance

15
Programming Challenge

Minimize loads/stores and maximize floating point
operations
Fortran compilers have been and are extremely
good at optimizing Fortran code
C compilers are hindered by use of pointers which
confuse the compilers data dependency analysis
unless one writes C-tran.
C compilers completely give up

16
Programming Challenge

80 of ORNL major science applications are
written in Fortran
University students are being taught about new
architectures and C, C and Java
No classes are teaching how to write Fortran and
C to take advantage of cache and utilize SSE
instructions through the language

17
We must have more Fortran Programmers
18
Why Fortran?

Legacy codes are mostly written in Fortran
Compiler writers tend to develop better Fortran
optimizations because of the existing code base
83 of ORNLs major codes are Fortran
Fortran allows the users to relay more
information about memory access to the compiler
Compilers can generate better optimized code from
Fortran than from C and C code is just awful
Scientific Programmers tend to use Fortran to get
the most out of the system
Even large C Frameworks use Fortran
computational kernels

19
What about new Languages?

Famous Question
What languages are going to be used in the year
2000?
Famous Answer
Dont know what it will be called however, it
will look a lot like Fortran

20
Seriously

HPF High Performance Fortran, was a complete
failure. A language was developed that was
difficult to compile efficiently. Since use was
unsuccessful, programmers quit using the new
language before the compiler got better
ARPA HPCC Three new language proposals, will
they suffer from the HPF syndrome?

21
The Hybrid Programming Model

OpenMP on the socket
Master/Slave model
MPI or CAF or UPC across the system
Single program, Multiple Data (SPMD)
Few Multi-instruction, Multiple Data (MIMD)

Co-array Fortran and UPC greatly simplify this
into a single programming Model
22
Shared Memory Programming

OpenMP
Directives for Fortran and Pragmas for C
Co-Arrays
User specifies a processor
A(I,J)nproc B(I,J)nproc1 C(I,J)

If nproc or nproc1 is on the socket this is a
store into memory, if off processor, it is a
remote Memory store. C always comes from memory
23
How to create a new Language

Extend an old one
Co-Array Fortran
Extension of Fortran
UPC
Extension of C
This way the compiler writers only have to
address the extension when generating efficient
code.

24
We must start teaching Co-array Fortran and UPC
25
The Programming Challenge

Scaling to 131,072 processors
MPI is a more coarse grain messaging, requiring
hand-holding between communicating processors
User is protected to some degree
Co-Array Fortran and UPC are Fortran and C
extensions that facilitate low latency gets and
puts into remote memory. These two languages
are commonly known as Global Address Space
languages, where the user can address all of the
memory of the MPP
User must be cognizant of synchronization between
processors

26
Conclusions

Scientific Programmers must start learning
how to utilize 100,000s of processors
how to utilize 4-8 cores per socket
Fortran is the best language to use for
controlling cache usage
utilizing SSE2 instructions required to obtain
gt1 result per clock cycle
working with the compiler to get the most out of
the core
GAS languages such as Co-Arrays and UPC
facilitate efficient utilization of 100,000s of
processors