Title: Programming for High Performance Computers
1Programming for High Performance Computers
- John M. Levesque
- Director
- Crays Supercomputing Center
- Of
- Excellence
2Outline
- Building a Petascale Computer
- Challenges for utilizing a Petascale System
- Utilizing the Core
- Utilizing the Socket
- Scaling to 100,000 cores
- How one programs for the Petascale System
- Conclusion
-
3Petascale Computer
- First we need to define what we mean by a
Petascale computer - Google already has a Petaflop on their floor
- Embarrassingly Parallel Application
- My Definition
- Petascale computer is a computer system that
delivers a sustained Petaflop to a several real
science applications
4A Petascale Computer Requires
- A state-of-the-art Commodity Micro-processor
- An ultra-fast proprietary Interconnect
- A sophisticated LWK Operating System to stay out
of the way of application scaling - Efficient messaging between processors
- MPI may not be efficient enough!!
5Potential Petascale Computer
- 32,768 sockets
- More dense circuitry results in more processors
(cores) on the chip (socket) - Each core produces 4 results
- Each socket contains 4 cores sharing memory
- We expect by the end of 2009, micro-processor
technology to supply 3 GHZ sockets, each
capable of delivering 16 floating point
operations per clock cycle.
32768163 1,572,864 GFLOPS 1.572 PFLOPS
6Petascale Challenge for Interconnect
- Connect 32,768 Sockets together with an
interconnect that has 2-3 microseconds latency
across the entire system - Supply a cross-section bandwidth to facilitate
ALLTOALL communication across the entire system
7Petascale Challenge for Programming
- Use as 131,072 Uni-processors or 32,768 4-way
Shared Memory sockets - MPI across all the processors
- Hard on Socket Memory bandwidth and injection
bandwidth into the network - MPI between sockets and OpenMP across socket
- Hybrid programming is difficult
8Petascale Challenge for Software
- OS must be able to supply required facilities and
not be over-loaded with demons that steal cpu
cycles and get cores out of sync - The notion of a Light Weight Kernel (LWK) that
only has what is needed to run app - No keyboard demon, no kernel threads, no sockets,
.
Two systems are using this very successfully
today, Crays XT4 and IBMs Bluegene
9The Programming Challenge
- We start with 1.5 Petaflops and want to sustain gt
1 Petaflop - Must achieve 67 of peak across the entire system
- Inhibitors
- On-socket memory bandwidth
- Scaling across 131,072 processors or,
- Utilizing OpenMP on socket, Messaging across
system
10The Programming Challenge
- Inhibitors
- On-socket memory bandwidth
- Today we see between 5-80 of sustained
performance on the core. This single core
sustained performance is the maximum we will
achieve. - Scaling across 131,072 processors or,
- Today few applications scale as high as 5000
processors - Utilizing OpenMP on socket, Messaging across
system - OpenMP must be used on a very high percentage of
the application or else, Amdahls law applies
and peak of Socket may be degraded
11Programming for the Core
- Each core produces 4 floating point
results/clock cycle, the memory can only supply
16 bytes/clock cycle - Best case contiguous on 16 byte boundaries
- 32 bit arithmetic 4 words/cycle
- 64 bit arithmetic 2 words/cycle
- Worse case
- One word every 2-4 cycles
12Consider a Triad Kernel
Need 2 loads and 1 store to produce 1 result
How can we produce 4 results each clock cycle,
When we need to fetch 16 bytes/clock cycle and
store 8 bytes/clock cycle?
13Programming for the Core
- Each core produces 4 floating point
results/clock cycle, the memory can only supply
16 bytes/clock cycle - Best case contiguous on 16 byte boundaries
- 32 bit arithmetic 4 words/cycle
- 64 bit arithmetic 2 words/cycle
- Worse case
- One word every 2-4 cycles
IMBALANCE
14CACHE to the rescue?
- To solve the processor/memory mismatch
- Caches are introduced to facilitate the re-use of
data - 2-3 levels of cache L1, L2, L3
- L1 and L2 are dedicated to a core
- L3 is typically shared across the cores
- To improve performance, users must understand how
to take advantage of cache - User can improve cache utilization by blocking
their algorithms to have a working set that fits
in cache - Efficient libraries tend to be cache-friendly
- ZGEMM achieves 80-90 of peak performance
15Programming Challenge
- Minimize loads/stores and maximize floating point
operations - Fortran compilers have been and are extremely
good at optimizing Fortran code - C compilers are hindered by use of pointers which
confuse the compilers data dependency analysis
unless one writes C-tran. - C compilers completely give up
16Programming Challenge
- 80 of ORNL major science applications are
written in Fortran - University students are being taught about new
architectures and C, C and Java - No classes are teaching how to write Fortran and
C to take advantage of cache and utilize SSE
instructions through the language
17We must have more Fortran Programmers
18Why Fortran?
- Legacy codes are mostly written in Fortran
- Compiler writers tend to develop better Fortran
optimizations because of the existing code base - 83 of ORNLs major codes are Fortran
- Fortran allows the users to relay more
information about memory access to the compiler - Compilers can generate better optimized code from
Fortran than from C and C code is just awful - Scientific Programmers tend to use Fortran to get
the most out of the system - Even large C Frameworks use Fortran
computational kernels
19What about new Languages?
- Famous Question
- What languages are going to be used in the year
2000? - Famous Answer
- Dont know what it will be called however, it
will look a lot like Fortran
20Seriously
- HPF High Performance Fortran, was a complete
failure. A language was developed that was
difficult to compile efficiently. Since use was
unsuccessful, programmers quit using the new
language before the compiler got better - ARPA HPCC Three new language proposals, will
they suffer from the HPF syndrome?
21The Hybrid Programming Model
- OpenMP on the socket
- Master/Slave model
- MPI or CAF or UPC across the system
- Single program, Multiple Data (SPMD)
- Few Multi-instruction, Multiple Data (MIMD)
Co-array Fortran and UPC greatly simplify this
into a single programming Model
22Shared Memory Programming
- OpenMP
- Directives for Fortran and Pragmas for C
- Co-Arrays
- User specifies a processor
- A(I,J)nproc B(I,J)nproc1 C(I,J)
If nproc or nproc1 is on the socket this is a
store into memory, if off processor, it is a
remote Memory store. C always comes from memory
23How to create a new Language
- Extend an old one
- Co-Array Fortran
- Extension of Fortran
- UPC
- Extension of C
- This way the compiler writers only have to
address the extension when generating efficient
code.
24We must start teaching Co-array Fortran and UPC
25The Programming Challenge
- Scaling to 131,072 processors
- MPI is a more coarse grain messaging, requiring
hand-holding between communicating processors - User is protected to some degree
- Co-Array Fortran and UPC are Fortran and C
extensions that facilitate low latency gets and
puts into remote memory. These two languages
are commonly known as Global Address Space
languages, where the user can address all of the
memory of the MPP - User must be cognizant of synchronization between
processors
26Conclusions
- Scientific Programmers must start learning
- how to utilize 100,000s of processors
- how to utilize 4-8 cores per socket
- Fortran is the best language to use for
- controlling cache usage
- utilizing SSE2 instructions required to obtain
gt1 result per clock cycle - working with the compiler to get the most out of
the core - GAS languages such as Co-Arrays and UPC
facilitate efficient utilization of 100,000s of
processors