Programming for High Performance Computers - PowerPoint PPT Presentation

About This Presentation
Title:

Programming for High Performance Computers

Description:

Caches are introduced to facilitate the re-use of data. 2-3 levels of cache L1, L2, L3 ... A language was developed that was difficult to compile efficiently. ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 23
Provided by: johnle9
Learn more at: https://www.csm.ornl.gov
Category:

less

Transcript and Presenter's Notes

Title: Programming for High Performance Computers


1
Programming for High Performance Computers
  • John M. Levesque
  • Director
  • Crays Supercomputing Center
  • Of
  • Excellence

2
Outline
  • Building a Petascale Computer
  • Challenges for utilizing a Petascale System
  • Utilizing the Core
  • Utilizing the Socket
  • Scaling to 100,000 cores
  • How one programs for the Petascale System
  • Conclusion

3
Petascale Computer
  • First we need to define what we mean by a
    Petascale computer
  • Google already has a Petaflop on their floor
  • Embarrassingly Parallel Application
  • My Definition
  • Petascale computer is a computer system that
    delivers a sustained Petaflop to a several real
    science applications

4
A Petascale Computer Requires
  • A state-of-the-art Commodity Micro-processor
  • An ultra-fast proprietary Interconnect
  • A sophisticated LWK Operating System to stay out
    of the way of application scaling
  • Efficient messaging between processors
  • MPI may not be efficient enough!!

5
Potential Petascale Computer
  • 32,768 sockets
  • More dense circuitry results in more processors
    (cores) on the chip (socket)
  • Each core produces 4 results
  • Each socket contains 4 cores sharing memory
  • We expect by the end of 2009, micro-processor
    technology to supply 3 GHZ sockets, each
    capable of delivering 16 floating point
    operations per clock cycle.

32768163 1,572,864 GFLOPS 1.572 PFLOPS
6
Petascale Challenge for Interconnect
  • Connect 32,768 Sockets together with an
    interconnect that has 2-3 microseconds latency
    across the entire system
  • Supply a cross-section bandwidth to facilitate
    ALLTOALL communication across the entire system

7
Petascale Challenge for Programming
  • Use as 131,072 Uni-processors or 32,768 4-way
    Shared Memory sockets
  • MPI across all the processors
  • Hard on Socket Memory bandwidth and injection
    bandwidth into the network
  • MPI between sockets and OpenMP across socket
  • Hybrid programming is difficult

8
Petascale Challenge for Software
  • OS must be able to supply required facilities and
    not be over-loaded with demons that steal cpu
    cycles and get cores out of sync
  • The notion of a Light Weight Kernel (LWK) that
    only has what is needed to run app
  • No keyboard demon, no kernel threads, no sockets,
    .

Two systems are using this very successfully
today, Crays XT4 and IBMs Bluegene
9
The Programming Challenge
  • We start with 1.5 Petaflops and want to sustain gt
    1 Petaflop
  • Must achieve 67 of peak across the entire system
  • Inhibitors
  • On-socket memory bandwidth
  • Scaling across 131,072 processors or,
  • Utilizing OpenMP on socket, Messaging across
    system

10
The Programming Challenge
  • Inhibitors
  • On-socket memory bandwidth
  • Today we see between 5-80 of sustained
    performance on the core. This single core
    sustained performance is the maximum we will
    achieve.
  • Scaling across 131,072 processors or,
  • Today few applications scale as high as 5000
    processors
  • Utilizing OpenMP on socket, Messaging across
    system
  • OpenMP must be used on a very high percentage of
    the application or else, Amdahls law applies
    and peak of Socket may be degraded

11
Programming for the Core
  • Each core produces 4 floating point
    results/clock cycle, the memory can only supply
    16 bytes/clock cycle
  • Best case contiguous on 16 byte boundaries
  • 32 bit arithmetic 4 words/cycle
  • 64 bit arithmetic 2 words/cycle
  • Worse case
  • One word every 2-4 cycles

12
Consider a Triad Kernel
  • A B Scalar C

Need 2 loads and 1 store to produce 1 result
How can we produce 4 results each clock cycle,
When we need to fetch 16 bytes/clock cycle and
store 8 bytes/clock cycle?
13
Programming for the Core
  • Each core produces 4 floating point
    results/clock cycle, the memory can only supply
    16 bytes/clock cycle
  • Best case contiguous on 16 byte boundaries
  • 32 bit arithmetic 4 words/cycle
  • 64 bit arithmetic 2 words/cycle
  • Worse case
  • One word every 2-4 cycles

IMBALANCE
14
CACHE to the rescue?
  • To solve the processor/memory mismatch
  • Caches are introduced to facilitate the re-use of
    data
  • 2-3 levels of cache L1, L2, L3
  • L1 and L2 are dedicated to a core
  • L3 is typically shared across the cores
  • To improve performance, users must understand how
    to take advantage of cache
  • User can improve cache utilization by blocking
    their algorithms to have a working set that fits
    in cache
  • Efficient libraries tend to be cache-friendly
  • ZGEMM achieves 80-90 of peak performance

15
Programming Challenge
  • Minimize loads/stores and maximize floating point
    operations
  • Fortran compilers have been and are extremely
    good at optimizing Fortran code
  • C compilers are hindered by use of pointers which
    confuse the compilers data dependency analysis
    unless one writes C-tran.
  • C compilers completely give up

16
Programming Challenge
  • 80 of ORNL major science applications are
    written in Fortran
  • University students are being taught about new
    architectures and C, C and Java
  • No classes are teaching how to write Fortran and
    C to take advantage of cache and utilize SSE
    instructions through the language

17
We must have more Fortran Programmers
18
Why Fortran?
  • Legacy codes are mostly written in Fortran
  • Compiler writers tend to develop better Fortran
    optimizations because of the existing code base
  • 83 of ORNLs major codes are Fortran
  • Fortran allows the users to relay more
    information about memory access to the compiler
  • Compilers can generate better optimized code from
    Fortran than from C and C code is just awful
  • Scientific Programmers tend to use Fortran to get
    the most out of the system
  • Even large C Frameworks use Fortran
    computational kernels

19
What about new Languages?
  • Famous Question
  • What languages are going to be used in the year
    2000?
  • Famous Answer
  • Dont know what it will be called however, it
    will look a lot like Fortran

20
Seriously
  • HPF High Performance Fortran, was a complete
    failure. A language was developed that was
    difficult to compile efficiently. Since use was
    unsuccessful, programmers quit using the new
    language before the compiler got better
  • ARPA HPCC Three new language proposals, will
    they suffer from the HPF syndrome?

21
The Hybrid Programming Model
  • OpenMP on the socket
  • Master/Slave model
  • MPI or CAF or UPC across the system
  • Single program, Multiple Data (SPMD)
  • Few Multi-instruction, Multiple Data (MIMD)

Co-array Fortran and UPC greatly simplify this
into a single programming Model
22
Shared Memory Programming
  • OpenMP
  • Directives for Fortran and Pragmas for C
  • Co-Arrays
  • User specifies a processor
  • A(I,J)nproc B(I,J)nproc1 C(I,J)

If nproc or nproc1 is on the socket this is a
store into memory, if off processor, it is a
remote Memory store. C always comes from memory
23
How to create a new Language
  • Extend an old one
  • Co-Array Fortran
  • Extension of Fortran
  • UPC
  • Extension of C
  • This way the compiler writers only have to
    address the extension when generating efficient
    code.

24
We must start teaching Co-array Fortran and UPC
25
The Programming Challenge
  • Scaling to 131,072 processors
  • MPI is a more coarse grain messaging, requiring
    hand-holding between communicating processors
  • User is protected to some degree
  • Co-Array Fortran and UPC are Fortran and C
    extensions that facilitate low latency gets and
    puts into remote memory. These two languages
    are commonly known as Global Address Space
    languages, where the user can address all of the
    memory of the MPP
  • User must be cognizant of synchronization between
    processors

26
Conclusions
  • Scientific Programmers must start learning
  • how to utilize 100,000s of processors
  • how to utilize 4-8 cores per socket
  • Fortran is the best language to use for
  • controlling cache usage
  • utilizing SSE2 instructions required to obtain
    gt1 result per clock cycle
  • working with the compiler to get the most out of
    the core
  • GAS languages such as Co-Arrays and UPC
    facilitate efficient utilization of 100,000s of
    processors
Write a Comment
User Comments (0)
About PowerShow.com