High Performance Parallel Programming - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

High Performance Parallel Programming

Description:

valves and relays, acoustic delay lines, paper tape, punch cards, ... Intel Paragon - many produced (at a loss?) 2D mesh architecture. ASCI Red is last example ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 37

Provided by: dirk75

Category:

more less

Transcript and Presenter's Notes

Title: High Performance Parallel Programming

1
High Performance Parallel Programming

Dirk van der Knijff
Advanced Research Computing
Information Division

Lecture 2 History of Supercomputing

prehistory
for the first 10 years, all computers were
supercomputers.
ENIAC
CSIRAC
UNIVAC
valves and relays, acoustic delay lines, paper
tape, punch cards, machine language coding, etc.

late 50s till late 60s
differentiation
scientific computers
business computers
IBM 7030 (stretch)
CDC 6600
ILLIAC IV
False Starts TI-ASC, CDC Star100, IBM 360/95

CDC 6600
first production scientific computer (stretch was
a one-off)
60 bit words (most others 36 bit)
lots of registers (most others 1 only)
10 functional units (used scorepad)
upto 15 peripheral processors (time sliced cpus!)
push-down operand stack (cache?)
3 MFlops in 1964

the 70s - the Supercomputer arrives
Cray 1
10 times faster than any other computer available
100 times more expensive
1000 times more style
Some others
Cyber 205, ETA 10, attached vector processors on
IBMs
Successors
Cray X-MP (2-way multiprocessing)
Cray 2 (4-way multiprocessing - 1GFlop in 1985)

Characteristics of traditional Supercomputer
High Memory Bandwidth
Vector Processor
Advanced Technology
Balanced Systems - All parts are high-performance
Memory Bandwidth (STREAMS)
cpu (FLOPS)
I/O rates
Memory Size

Streams results (Mflops)
Machine ID ncpus SCALE
ADD TRIAD
Cray_T932_321024-3E 16 9680.0
8055.6 16213.5
Cray_T932_321024-3E 4 2438.6
2091.8 4226.5
Cray_T932_321024-3E 1 638.8
542.2 1140.2
Cray_T3E-1200 256 7691.5
6028.3 11881.7
Cray_T3E-1200 64 1922.8
1507.1 2970.5
Cray_T3E-1200 8 240.6
188.5 371.5
Cray_T3E-1200 4 119.9
94.2 185.8
Cray_T3E-1200 2 60.1
47.1 92.9
Cray_T3E-1200 1 30.0
23.6 46.5
IBM_SP-PWR3smp-High-222 8 176.3
162.0 322.7
Compaq_AlphaServer_GS160 16 603.0
439.6 851.1
Compaq_AlphaServer_GS160 4 150.8
110.2 213.7
Compaq_AlphaServer_GS160 1 59.9
41.8 83.5
DEC_8400_5-300 8 52.1
37.2 81.6
DEC_8400_5-300 1 10.9
8.3 16.3

Early supercomputers were technology leaders
Transistors
Multiprocessing
Cyrogenic cooling
GaS, SOS, Josephson Junction, etc
Improving technology is no longer
cost-effective,the rate of improvement in
performance will always beat the individual
design

Vector Processing
Pipelined Processor
Usually Multiple Pipes
Multiple Load and Store Paths
Operands pre-fetched
Special machine instructions

11

12

Divide
n0 - 31 (VectorLength 256)
Logical
Add/Shift
Multiply
Vector Reg. (Elem.8n)
Vector Reg. (Elem.8n1)
Vector Reg. (Elem.8n2)
Vector Reg. (Elem.8n3)
Vector Reg. (Elem.8n4)
Vector Reg. (Elem.8n5)
Vector Reg. (Elem.8n6)
Vector Reg. (Elem.8n7)
13

SX-4 Vector Performance

1.8 Gflops with a 114 MHz Clock (8.8ns).
Each clock period the vector floating point add
and multiply units can each produce 8 results for
a total of 16 results every 8.8 ns.
(16 results / 1 cycle) (1 cycle / 8.8 ns) 1.8
Gflops
Note that this does not include the vector
floating point divide and the superscalar unit
which can execute concurrently with the vector
unit.
Key is that peak performance requires only one
floating point add and one floating point
multiply.

Vectorization
Most work is done by compilers
Loops usual targets for vectorization
Can add directives to advise compiler
Compiler must be able to determine a simple
expression to load all vector elements
Can include a single level of indirection

15

Amdahls Law
In any system having two or more processing
modes of differing speeds, the performance of the
system will be dominated by the slowest mode.
16

Amdahls Law

For vector processors
Ts the time required to perform an operation
in scalar mode
Tv the time required to perform an operation
in vector mode
Fs the fraction of operations performed in
scalar mode
Fv the fraction of operations performed in
vector mode
then the time T to perform N operations is
T N (Fs Ts Fv Tv)

Given that Fs Fv 1, then
T N (1 - Fv) Ts Fv Tv
Normalizing to Ts 1 and defining vector
speedup
Ts
VS
Tv
Then
Fv
T N (1 - Fv)
VS
(VS - 1)
N 1- Fv
VS

Now let performance be defined as the number of
operations performed per unit time
N
P
T
1
P
(VS - 1)
1- Fv
VS

19

20

Amdahls Law for Parallel Processors
Amdahls Law assumes that the time to run a
parallel program on N processors depends on the
fraction ? of the program that is inherently
serial, and the fraction (1-?) that is inherently
parallel.
That is TN ?T1 (T1 (1 - ?))/Nand SA N /
(?N (1 - ?))

Scalability
Suppose that we define any algorithm that can be
run N-fold times faster when run on N processors
as a scalable algorithm. Then S O(N).
According to Amdahls Law SA O(1)so it is not
scalable...

Gustafson-Barsis
Uses with same ratio for speedup but different
assumptions for T1 and TN
Suppose we normalize TN to 1 then T1 ? (1 -
?)N and by substitution SGB N - (N - 1) ?
This is O(N) which is scalable.

Scalability
Suppose ? 0.5 then SA 2N / (N 1)i.e. as N
increases the speedup goes asymptotically to 2,
but SGB (N 1)/2i.e. speedup is proportional
to N
Why?

late 80s and early 90s - Mini-supers
Achieved High Peak performance by large scale
parallelism of cheap components
First use of Commodity Components
Many Different Interconnect Designs
Only successful in limited areas
Not balanced!
Cheap (and Nasty?)

Examples
cm
Tree structured
cm1 65,536 1-bit processors
Intel iPSC
Hypercube
Initially i386 later i860 processors
MasPar MP-1
2D Mesh
16384 4-bit processors

Most successful
Intel Paragon - many produced (at a loss?)
2D mesh architecture
ASCI Red is last example
Cray T3E
most successful massively parallel machine ever
3D torus interconnection
uses Alpha chips

COTS Systems
Commodity Off The Shelf Systems
Use Commodity Processors and Commodity
Interconnects
May be physically distributed as in a Workstation
Farm or packaged as a single system like IBM SP-2
Very Good /Peak Performance
Build-your-own supercomputers
Beowulf systems.