High Performance Parallel Programming - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

High Performance Parallel Programming

Description:

valves and relays, acoustic delay lines, paper tape, punch cards, ... Intel Paragon - many produced (at a loss?) 2D mesh architecture. ASCI Red is last example ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 37
Provided by: dirk75
Category:

less

Transcript and Presenter's Notes

Title: High Performance Parallel Programming


1
High Performance Parallel Programming
  • Dirk van der Knijff
  • Advanced Research Computing
  • Information Division

2
  • Lecture 2 History of Supercomputing

3
  • prehistory
  • for the first 10 years, all computers were
    supercomputers.
  • ENIAC
  • CSIRAC
  • UNIVAC
  • valves and relays, acoustic delay lines, paper
    tape, punch cards, machine language coding, etc.

4
  • late 50s till late 60s
  • differentiation
  • scientific computers
  • business computers
  • IBM 7030 (stretch)
  • CDC 6600
  • ILLIAC IV
  • False Starts TI-ASC, CDC Star100, IBM 360/95

5
  • CDC 6600
  • first production scientific computer (stretch was
    a one-off)
  • 60 bit words (most others 36 bit)
  • lots of registers (most others 1 only)
  • 10 functional units (used scorepad)
  • upto 15 peripheral processors (time sliced cpus!)
  • push-down operand stack (cache?)
  • 3 MFlops in 1964

6
  • the 70s - the Supercomputer arrives
  • Cray 1
  • 10 times faster than any other computer available
  • 100 times more expensive
  • 1000 times more style
  • Some others
  • Cyber 205, ETA 10, attached vector processors on
    IBMs
  • Successors
  • Cray X-MP (2-way multiprocessing)
  • Cray 2 (4-way multiprocessing - 1GFlop in 1985)

7
  • Characteristics of traditional Supercomputer
  • High Memory Bandwidth
  • Vector Processor
  • Advanced Technology
  • Balanced Systems - All parts are high-performance
  • Memory Bandwidth (STREAMS)
  • cpu (FLOPS)
  • I/O rates
  • Memory Size

8
  • Streams results (Mflops)
  • Machine ID ncpus SCALE
    ADD TRIAD
  • Cray_T932_321024-3E 16 9680.0
    8055.6 16213.5
  • Cray_T932_321024-3E 4 2438.6
    2091.8 4226.5
  • Cray_T932_321024-3E 1 638.8
    542.2 1140.2
  • Cray_T3E-1200 256 7691.5
    6028.3 11881.7
  • Cray_T3E-1200 64 1922.8
    1507.1 2970.5
  • Cray_T3E-1200 8 240.6
    188.5 371.5
  • Cray_T3E-1200 4 119.9
    94.2 185.8
  • Cray_T3E-1200 2 60.1
    47.1 92.9
  • Cray_T3E-1200 1 30.0
    23.6 46.5
  • IBM_SP-PWR3smp-High-222 8 176.3
    162.0 322.7
  • Compaq_AlphaServer_GS160 16 603.0
    439.6 851.1
  • Compaq_AlphaServer_GS160 4 150.8
    110.2 213.7
  • Compaq_AlphaServer_GS160 1 59.9
    41.8 83.5
  • DEC_8400_5-300 8 52.1
    37.2 81.6
  • DEC_8400_5-300 1 10.9
    8.3 16.3

9
  • Early supercomputers were technology leaders
  • Transistors
  • Multiprocessing
  • Cyrogenic cooling
  • GaS, SOS, Josephson Junction, etc
  • Improving technology is no longer
    cost-effective,the rate of improvement in
    performance will always beat the individual
    design

10
  • Vector Processing
  • Pipelined Processor
  • Usually Multiple Pipes
  • Multiple Load and Store Paths
  • Operands pre-fetched
  • Special machine instructions

11

12

Divide
n0 - 31 (VectorLength 256)
Logical
Add/Shift
Multiply
Vector Reg. (Elem.8n)
Vector Reg. (Elem.8n1)
Vector Reg. (Elem.8n2)
Vector Reg. (Elem.8n3)
Vector Reg. (Elem.8n4)
Vector Reg. (Elem.8n5)
Vector Reg. (Elem.8n6)
Vector Reg. (Elem.8n7)
13

SX-4 Vector Performance
  • 1.8 Gflops with a 114 MHz Clock (8.8ns).
  • Each clock period the vector floating point add
    and multiply units can each produce 8 results for
    a total of 16 results every 8.8 ns.
  • (16 results / 1 cycle) (1 cycle / 8.8 ns) 1.8
    Gflops
  • Note that this does not include the vector
    floating point divide and the superscalar unit
    which can execute concurrently with the vector
    unit.
  • Key is that peak performance requires only one
    floating point add and one floating point
    multiply.

14
  • Vectorization
  • Most work is done by compilers
  • Loops usual targets for vectorization
  • Can add directives to advise compiler
  • Compiler must be able to determine a simple
    expression to load all vector elements
  • Can include a single level of indirection

15

Amdahls Law
In any system having two or more processing
modes of differing speeds, the performance of the
system will be dominated by the slowest mode.
16

Amdahls Law
  • For vector processors
  • Ts the time required to perform an operation
    in scalar mode
  • Tv the time required to perform an operation
    in vector mode
  • Fs the fraction of operations performed in
    scalar mode
  • Fv the fraction of operations performed in
    vector mode
  • then the time T to perform N operations is
  • T N (Fs Ts Fv Tv)

17
  • Given that Fs Fv 1, then
  • T N (1 - Fv) Ts Fv Tv
  • Normalizing to Ts 1 and defining vector
    speedup
  • Ts
  • VS
  • Tv
  • Then
  • Fv
  • T N (1 - Fv)
  • VS
  • (VS - 1)
  • N 1- Fv
  • VS

18
  • Now let performance be defined as the number of
    operations performed per unit time
  • N
  • P
  • T
  • 1
  • P
  • (VS - 1)
  • 1- Fv
  • VS

19


20
  • Amdahls Law for Parallel Processors
  • Amdahls Law assumes that the time to run a
    parallel program on N processors depends on the
    fraction ? of the program that is inherently
    serial, and the fraction (1-?) that is inherently
    parallel.
  • That is TN ?T1 (T1 (1 - ?))/Nand SA N /
    (?N (1 - ?))

21
  • Scalability
  • Suppose that we define any algorithm that can be
    run N-fold times faster when run on N processors
    as a scalable algorithm. Then S O(N).
  • According to Amdahls Law SA O(1)so it is not
    scalable...

22
  • Gustafson-Barsis
  • Uses with same ratio for speedup but different
    assumptions for T1 and TN
  • Suppose we normalize TN to 1 then T1 ? (1 -
    ?)N and by substitution SGB N - (N - 1) ?
  • This is O(N) which is scalable.

23
  • Scalability
  • Suppose ? 0.5 then SA 2N / (N 1)i.e. as N
    increases the speedup goes asymptotically to 2,
    but SGB (N 1)/2i.e. speedup is proportional
    to N
  • Why?

24
  • late 80s and early 90s - Mini-supers
  • Achieved High Peak performance by large scale
    parallelism of cheap components
  • First use of Commodity Components
  • Many Different Interconnect Designs
  • Only successful in limited areas
  • Not balanced!
  • Cheap (and Nasty?)

25
  • Examples
  • cm
  • Tree structured
  • cm1 65,536 1-bit processors
  • Intel iPSC
  • Hypercube
  • Initially i386 later i860 processors
  • MasPar MP-1
  • 2D Mesh
  • 16384 4-bit processors

26
  • Most successful
  • Intel Paragon - many produced (at a loss?)
  • 2D mesh architecture
  • ASCI Red is last example
  • Cray T3E
  • most successful massively parallel machine ever
  • 3D torus interconnection
  • uses Alpha chips

27
  • COTS Systems
  • Commodity Off The Shelf Systems
  • Use Commodity Processors and Commodity
    Interconnects
  • May be physically distributed as in a Workstation
    Farm or packaged as a single system like IBM SP-2
  • Very Good /Peak Performance
  • Build-your-own supercomputers
  • Beowulf systems.

28

29

30

31

32

33

34

35
  • Next week - Architectures.

36
Write a Comment
User Comments (0)
About PowerShow.com