Parallel

About This Presentation

Title:

Parallel

Description:

High-Performance Grid Computing and Research Networking Concurrent Computers Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ – PowerPoint PPT presentation

Number of Views:162

Avg rating:3.0/5.0

Slides: 89

Provided by: Henri204

Learn more at: http://users.cis.fiu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Parallel

1
High-Performance Grid Computing and Research
Networking
Concurrent Computers
Instructor S. Masoud Sadjadi http//www.cs.fiu.ed
u/sadjadi/Teaching/ sadjadi At cs Dot fiu Dot
edu
2
Acknowledgements

The content of many of the slides in this lecture
notes have been adopted from the online resources
prepared previously by the people listed below.
Many thanks!
Henri Casanova
Principles of High Performance Computing
http//navet.ics.hawaii.edu/casanova
henric_at_hawaii.edu
Kai Wang
Department of Computer Science
University of South Dakota
http//www.usd.edu/Kai.Wang
Andrew Tanenbaum

3
Concurrency and Computers

We will see computer systems designed to allow
concurrency (for performance benefits)
Concurrency occurs at many levels in computer
systems
Within a CPU
For example, On-Chip Parallelism
Within a Box
For example, Coprocessor and Multiprocessor
Across Boxes
For example, Multicomputers, Clusters, and Grids

4
Parallel Computer Architectures

(a) On-chip parallelism. (b) A coprocessor. (c) A
multiprocessor.
(d) A multicomputer. (e) A grid.

5
Concurrency and Computers

We will see computer systems designed to allow
concurrency (for performance benefits)
Concurrency occurs at many levels in computer
systems
Within a CPU
Within a Box
Across Boxes

6
Concurrency within a CPU
Registers ALUs Hardware to
decode instructions and do all types of useful
things
CPU
Caches
Busses
RAM
adapters
Controllers
Controllers
I/O devices Displays Keyboards
Networks
7
Concurrency within a CPU

Several techniques to allow concurrency within a
single CPU
Pipelining
RISC architectures
Pipelined functional units
ILP
Vector units
On-Chip Multithreading
Lets look at them briefly

8
Concurrency within a CPU

Several techniques to allow concurrency within a
single CPU
Pipelining
RISC architectures
Pipelined functional units
ILP
Vector units
On-Chip Multithreading
Lets look at them briefly

9
Pipelining

If one has a sequence of tasks to do
If each task consists of the same n steps or
stages
If different steps can be done simultaneously
Then one can have a pipelined execution of the
tasks
e.g., for assembly line
Goal higher throughput (i.e., number of tasks
per time unit)

Time to do 1 task 9 Time to do 2 tasks
13 Time to do 3 tasks 17 Time to do 4 tasks
21 Time to do 10 tasks 45 Time to do 100
tasks 409 Pays off if many tasks
10
Pipelining

Each step goes as fast as the slowest stage
Therefore, the asymptotic throughput (i.e., the
throughput when the number of tasks tends to
infinity) is equal to
1 / (duration of the slowest stage)
Therefore, in an ideal pipeline, all stages would
be identical (balanced pipeline)
Question Can we make computer instructions all
consist of the same number of stage, where all
stages take the same number of clock cycles?

duration of the slowest stage
11
RISC

Having all instructions doable in the same number
of stages of the same durations is the RISC idea
Example
MIPS architecture (See THE architecture book by
Patterson and Hennessy)
5 stages
Instruction Fetch (IF)
Instruction Decode (ID)
Instruction Execute (EX)
Memory accesses (MEM)
Register Write Back (WB)
Each stage takes one clock cycle

12
Pipelined Functional Units

Although the RISC idea is attractive, some
operations are just too expensive to be done in
one clock cycle (during the EX stage)
Common example floating point operations
Solution implement them as a sequence of stages,
so that they can be pipelined

EX Integer unit
FP/integer multiply
IF
ID
MEM
WB
M1
M2
M3
M4
M5
M6
M7
FP/integer add
A1
A2
A3
A4
13
Pipelining Today

Pipelined functional units are common
Fallacy All computers today are RISC
RISC was of course one of the most fundamental
new ideas in computer architectures
x86 Most commonly used Instruction Set
Architecture today
Kept around for backwards compatibility reasons,
because its easy to implement (not to program
for)
BUT modern x86 processors decode instructions
into micro-ops, which are then executed in a
RISC manner
Bottom line pipelining is a pervasive (and
conveniently hidden) form of concurrency in
computers today
Take a computer architecture course to know all
about it

14
Concurrency within a CPU

Several techniques to allow concurrency within a
single CPU
Pipelining
ILP
Vector units
On-Chip Multithreading

15
Instruction Level Parallelism

Instruction Level Parallelism is the set of
techniques by which performance of a pipelined
processor can be pushed even further
ILP can be done by the hardware
Dynamic instruction scheduling
Dynamic branch predictions
Multi-issue superscalar processors
ILP can be done by the compiler
Static instruction scheduling
Multi-issue VLIW (Very Long Instruction Word)
processors
with multiple functional units
Broad concept More than one instruction is
issued per clock cycle
e.g., 8-way multi-issue processor

16
Concurrency within a CPU

Several techniques to allow concurrency within a
single CPU
Pipelining
ILP
Vector units
On-Chip Multithreading

17
Vector Units

A functional unit that can do elt-wise operations
on entire vectors with a single instruction,
called a vector instruction
These are specified as operations on vector
registers
A vector processor comes with some number of
such registers
MMX extension on x86 architectures

elts
elts

elts adds in parallel
elts
18
Vector Units

Typically, a vector register holds 32-64
elements
But the number of elements is always larger than
the amount of parallel hardware, called vector
pipes or lanes, say 2-4

elts
elts
elts / pipes adds in parallel
elts
19
MMX Extension

Many techniques that are initially implemented in
the supercomputer market, find their way to the
mainstream
Vector units were pioneered in supercomputers
Supercomputers are mostly used for scientific
computing
Scientific computing uses tons of arrays (to
represent mathematical vectors and often does
regular computation with these arrays
Therefore, scientific code is easy to
vectorize, i.e., to generate assembly that uses
the vector registers and the vector instructions
Intels MMX or PowerPCs AltiVec
MMX vector registers
eight 8-bit elements
four 16-bit elements
two 32-bit elements
AltiVec twice the lengths
Used for multi-media applications
image processing
rendering
...

20
Vectorization Example

Conversion from RGB to YUV
Y (9798R 19235G 3736B) / 32768
U (-4784R - 9437G 4221B) / 32768 128
V (20218R - 16941G - 3277B) / 32768 128
This kind of code is perfectly parallel as all
pixels can be computed independently
Can be done easily with MMX vector capabilities
Load 8 R values into an MMX vector register
Load 8 G values into an MMX vector register
Load 8 B values into an MMX vector register
Do the , , and / in parallel
Repeat

21
Concurrency within a CPU

Several techniques to allow concurrency within a
single CPU
Pipelining
ILP
Vector units
On-Chip Multithreading

22
Multi-threaded Architectures

Computer architecture is a difficult field to
make innovations in
Whos going to spend money to manufacture your
new idea?
Whos going to be convinced that a new compiler
can/should be written
Whos going to be convinced of a new approach to
computing?
One of the cool innovations in the last decade
has been the concept of a Multi-threaded
Architecture

23
On-Chip Multithreading

Multithreading has been arounds for years, so
whats new about this???
Here were talking about Hardware Support for
threads
Simultaneous Multi Threading (SMT)
SuperThreading
HyperThreading
Lets try to understand what all of these mean
before looking at multi-threaded Supercomputers

24
Single-treaded Processor

The processor provides the illusion of concurrent
execution
Front-end fetching/decoding/reordering
Execution core
actual execution
Multiple programs in memory
Only one executes at a time
4-issue CPU with bubbles
7-unit CPU with pipeline bubbles
Time-slicing via context switching

25
Single-threaded SMP?

Two threads execute at once, so threads spend
less time waiting
The number of bubbles is also doubled
Twice as much speed and twice as much waste

26
Super-threading

Principle the processor can execute more than
one thread at a time
Also called time-slice multithreading
The processor is then called a multithreaded
processor
Requires more hardware cleverness
logic switches at each cycle
Leads to less Waste
A thread can run during a cycle while another
thread is waiting for the memory
Just a finer grain of interleaving
But there is a restriction
Each stage of the front end or the execution core
only runs instructions from ONE thread!
Does not help with poor instruction parallelism
within one thread
Does not reduce bubbles within a row

27
Hyper-threading

Principle the processor can execute more than
one thread at a time, even within a single clock
cycle!!
Requires even more hardware cleverness
logic switches within each cycle
On the diagram Only two threads execute
simultaneously.
Intels hyper-threading only adds 5 to the die
area
Some people argue that two is not hyper ?
Finest level of interleaving
From the OS perspective, there are two logical
processors

28
Concurrency and Computers

We will see computer systems designed to allow
concurrency (for performance benefits)
Concurrency occurs at many levels in computer
systems
Within a CPU
Within a Box
Across Boxes

29
Concurrency within a Box

Two main techniques
SMP
Multi-core
Lets look at both of them

30
SMPs

Symmetric Multi-Processors
often mislabeled as Shared-Memory Processors,
which has now become tolerated
Processors are all connected to a single memory
Symmetric each memory cell is equally close to
all processors
Many dual-proc and quad-proc systems
e.g., for servers

31
Distributed caches

The problem with distributed caches is that of
memory consistency
Intuitive memory model
Reading an address should return the last value
written to that address
Easy to do in uniprocessors
although there may be some I/O issues
But difficult in multi-processor / multi-core
Memory consistency A multiprocessor is
sequentially consistent if the result of any
execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual
processor appear in this sequence in the order
specified by its program. Lamport, 1979

32
Cache Coherency

Memory consistency is jeopardized by having
multiple caches
P1 and P2 both have a cached copy of a data item
P1 write to it, possibly write-through to memory
At this point P2 owns a stale copy
When designing a multi-processor system, one must
ensure that this cannot happen
By defining protocols for cache coherence

33
Snoopy Cache-Coherence
Pn
P0
bus snoop

memory bus
memory op from Pn
Mem
Mem

Memory bus is a broadcast medium
Caches contain information on which addresses
they store
Cache Controller snoops all transactions on the
bus
A transaction is a relevant transaction if it
involves a cache block currently contained in
this cache
Take action to ensure coherence
invalidate, update, or supply value

34
Limits of Snoopy Coherence

Assume
4 GHz processor
gt 16 GB/s inst BW per processor (32-bit)
gt 9.6 GB/s data BW at 30 load-store of 8-byte
elements
Suppose 98 inst hit rate and 90 data hit rate
gt 320 MB/s inst BW per processor
gt 960 MB/s data BW per processor
gt 1.28 GB/s combined BW
Assuming 10 GB/s bus bandwidth
8 processors will saturate the bus

MEM
MEM

1.28 GB/s

cache
cache
25.6 GB/s
PROC
PROC
35
Sample Machines

Intel Pentium Pro Quad
Coherent
4 processors
Sun Enterprise server
Coherent
Up to 16 processor and/or memory-I/O cards

36
Directory-based Coherence

Idea Implement a directory that keeps track of
where each copy of a data item is stored
The directory acts as a filter
processors must ask permission for loading data
from memory to cache
when an entry is changed the directory either
update or invalidate cached copies
Eliminate the overhead of broadcasting/snooping,
a thus bandwidth consumption
But is slower in terms of latency
Used to scale up to numbers of processors that
would saturate the memory bus

37
Example machine

SGI Altix 3000
A node contains up to 4 Itanium 2 processors and
32GB of memory
Uses a mixture of snoopy and directory-based
coherence
Up to 512 processors that are cache coherent
(global address space is possible for larger
machines)

38
Sequential Consistency?

A lot of hardware and technology to ensure cache
coherence
But the sequential consistency model may be
broken anyway
The compiler reorders/removes code
Prefetch instructions cause reordering
The network may reorder two write messages
Basically, a bunch of things can happen
Virtually all commercial systems give up on the
idea of maintaining strong sequential consistency
The programmer must program with weaker memory
models than Sequential Consistency
Done with some rules
Avoid race conditions
Use system-provided synchronization primitives

39
Weaker models

The programmer must program with weaker memory
models than Sequential Consistency
Done with some rules
Avoid race conditions
Use system-provided synchronization primitives
We will see how to program shared-memory machines

40
Concurrency within a Box

Two main techniques
SMP
Multi-core

41
Moores Law!

Many people interpret Moores law as computer
gets twice as fast every 18 months
which is not technically true
its all about microprocessor density
But this is no longer true
We should have 10GHz processors right now
And we dont!

42
No more Moore?

We are used to getting faster CPUs all the time
We are used for them to keep up with more
demanding software
Known as Andy giveth, and Bill taketh away
Andy Grove
Bill Gates
Its a nice way to force people to buy computers
often
But basically, our computers get better, do more
things, and it just happens automatically
Some people call this the performance free
lunch
Conventional wisdom Not to worry, tomorrows
processors will have even more throughput, and
anyway todays applications are increasingly
throttled by factors other than CPU throughput
and memory speed (e.g., theyre often I/O-bound,
network-bound, database-bound).

43
Commodity improvements

There are three main ways in which commodity
processors keep improving
Higher clock rate
More aggressive instruction reordering and
concurrent units
Bigger/faster caches
All applications can easily benefit from these
improvements
at the cost of perhaps a recompilation
Unfortunately, the first two are hitting their
limit
Higher clock rate lead to high heat, power
consumption
No more instruction reordering without
compromising correctness

44
Is Moores laws not true?

Ironically, Moores law is still true
The density indeed still doubles
But its wrong interpretation is not
Clock rates do not doubled any more
But we cant let this happen computers have to
get more powerful
Therefore, the industry has thought of new ways
to improve them multi-core
Multiple CPUs on a single chip
Multi-core adds another level of concurrency
But unlike, say multiple functional units, hard
to compile for them
Therefore, applications must be rewritten to
benefit from the (nowadays expected) performance
increase
Concurrency is the next major revolution in how
we write software (Dr. Dobbs Journal, 30(3),
March 2005)

45
Multi-core processors

In addition to putting concurrency in the
publics eye, multi-core architectures will have
deep impact
Languages will be forced to deal well with
concurrency
New language designs?
New language extensions?
New compilers?
Efficiency and Performance optimization will
become more important write code that is fast on
one core with limited clock rate
The CPU may very well become a bottleneck (again)
for single-core programs
Other factors will improve, but not the clock
rate
Prediction many companies will be hiring people
to (re)write concurrent applications

46
Multi-Core

Quote from PC World Magazine Summer 2005
Don't expect dual-core to be the top performer
today for games and other demanding
single-threaded applications. But that will
change as applications are rewritten. For
example, by year's end, Unreal Tournament should
have released a new game engine that takes
advantage of dual-core processing.

47
Concurrency and Computers

We will see computer systems designed to allow
concurrency (for performance benefits)
Concurrency occurs at many levels in computer
systems
Within a CPU
Within a Box
Across Boxes

48
Multiple boxes together

Example
Take four boxes
e.g., four Intel Itaniums bought at Dell
Hook them up to a network
e.g., a switch bought at CISCO, Myricom, etc.
Install software that allows you to write/run
applications that can utilize these four boxes
concurrently
This is a simple way to achieve concurrency
across computer systems
Everybody has heard of clusters by now
They are basically like the above example and can
be purchased already built from vendors
We will talk about this kind of concurrent
platform at length during this class

49
Multiple Boxes Together

Why do we use multiple boxes?
Every programmer would rather have an
SMP/multi-core architecture that provides all the
power/memory she/he needs
The problem is that single boxes do not scale to
meet the needs of many scientific applications
Cant have enough processors or powerful enough
cores
Cant have enough memory
But if you can live with a single box, do it!
We will see that single-box programming is much
easier than multi-box programming

50
Where does this leave us?

So far we have seen many ways in which
concurrency can be achieved/implemented in
computer systems
Within a box
Across boxes
So we could look at a system and just list all
the ways in which it does concurrency
It would be nice to have a great taxonomy of
parallel platforms in which we can pigeon-hole
all (past and present) systems
Provides simple names that everybody can use and
understand quickly

51
Taxonomy of parallel machines?

Its not going to happen
Up until last year Gordon Bell and Jim Gray
published an article in Comm. of the ACM,
discussing what the taxonomy should be
Dongarra, Sterling, etc. answered telling them
they were wrong and saying what the taxonomy
should be, and proposing a new multi-dimensional
scheme!
Both papers agree that most terms are conflated,
misused, etc. (MPP)
Complicated by the fact that concurrency appears
at so many levels
Example A 16-node cluster, where each node is a
4-way multi-processor, where each processor is
hyperthreaded, has vector units, and is fully
pipelined with multiple, pipelined functional
units

52
Taxonomy of platforms?

Well look at one traditional taxonomy
Well look at current categorizations from Top500
Well look at examples of platforms
Well look at interesting/noteworthy
architectural features that one should know as
part of ones parallel computing culture

53
The Flynn taxonomy

Proposed in 1966!!!
Functional taxonomy based on the notion of
streams of information data and instructions
Platforms are classified according to whether
they have a single (S) or multiple (M) stream of
each of the above
Four possibilities
SISD (sequential machine)
SIMD
MIMD
MISD (rare, no commercial system... systolic
arrays)

54
Taxonomy of Parallel Computers

Flynns taxonomy of parallel computers.

55
SIMD

PEs can be deactivated and activated on-the-fly
Vector processing (e.g., vector add) is easy to
implement on SIMD
Debate is a vector processor an SIMD machine?
often confused
strictly not true according to the taxonomy (its
really SISD with pipelined operations)
but its convenient to think of the two as
equivalent

56
MIMD

Most general category
Pretty much every supercomputer in existence
today is a MIMD machine at some level
This limits the usefulness of the taxonomy
But you had to have heard of it at least once
because people keep referring to it, somehow...
Other taxonomies have been proposed, none very
satisfying
Shared- vs. Distributed- memory is a common
distinction among machines, but these days many
are hybrid anyway

57
Taxonomy of Parallel Computers

A taxonomy of parallel computers.

58
A host of parallel machines

There are (have been) many kinds of parallel
machines
For the last 12 years their performance has been
measured and recorded with the LINPACK benchmark,
as part of Top500
It is a good source of information about what
machines are (were) and
how they have evolved
Note that its really about supercomputers
http//www.top500.org

59
LINPACK Benchmark?

LINPACK LINear algebra PACKage
A FORTRAN
Matrix multiply, LU/QR/Choleski factorizations,
eigensolvers, SVD, etc.
LINPACK Benchmark
Dense linear system solve with LU factorization
2/3 n3 O(n2)
Measure MFlops
The problem size can be chosen
You have to report the best performance for the
best n, and the n that achieves half of the best
performance.

60
What can we find on the Top500?
61
Pies
62
Top Ten Computers (http//www.top500.org)
63
Top 500 Computers--Countries (http//www.top500.or
g)
64
Top 500 Computers--Manufacturers
(http//www.top500.org)
65
Top 500 ComputersManufacturers Trend
(http//www.top500.org)
66
Top 500 Computers--Operating Systems
(http//www.top500.org)
67
Top 500 ComputersOperating Systems Trend
(http//www.top500.org)
68
Top 500 Computers--Processors (http//www.top500.o
rg)
69
Top 500 ComputersProcessors Trend
(http//www.top500.org)
70
Top 500 Computers--Customers (http//www.top500.or
g)
71
Top 500 ComputersCustomers Trend
(http//www.top500.org)
72
Top 500 Computers--Applications
(http//www.top500.org)
73
Top 500 Computers--Applications Trend
(http//www.top500.org)
74
Top 500 ComputersArchitecture (http//www.top500.
org)
75
Top 500 ComputersArchitecture Trend
(http//www.top500.org)
76
SIMD

ILLIAC-IV, TMC CM-1, MasPar MP-1
Expensive logic for CU, but there is only one
Cheap logic for PEs and there can be a lot of
them
32 procs on 1 chip of the MasPar, 1024-proc
system with 32 chips that fit on a single board!
65,536 processors for the CM-1
Thinking Machines gimmick was
that the human brain consists of
many simple neurons that are
turned
on and off, and so was their machine
CM-5
hybrid SIMD and MIMD
Death
Machines not popular, but the programming
model is.
Vector processors often labeled SIMD
because thats in effect what they do,
but they are not SIMD machines
Led to the MPP terminology (Massively Parallel
Processor)
Ironic because none of todays MPPs are SIMD

77
SMPs

Symmetric MultiProcessors (often mislabeled as
Shared-Memory Processors, which has now become
tolerated)
Processors all connected to a (large) memory
UMA Uniform Memory Access, makes is easy to
program
Symmetric all memory is equally close to all
processors
Difficult to scale to many processors (lt32
typically)
Cache Coherence via snoopy caches or
directories

78
Distributed Shared Memory

Memory is logically shared, but physically
distributed in banks
Any processor can access any address in memory
Cache lines (or pages) are passed around the
machine
Cache coherence Distributed Directories
NUMA Non-Uniform Memory Access (some processors
may be closer to some banks)
SGI Origin2000 is a canonical example
Scales to 100s of processors
Hypercube topology for the memory (later)

P1
P2
Pn

memory
memory
network
memory
memory
79
Clusters, Constellations, MPPs

These are the only 3 categories today in the
Top500
They all belong to the Distributed Memory model
(MIMD) (with many twists)
Each processor/node has its own memory and cache
but cannot directly access another processors
memory.
nodes may be SMPs
Each node has a network interface (NI) for all
communication and synchronization.
So what are these 3 categories?

80
Clusters

58.2 of the Top500 machines are labeled as
clusters
Definition Parallel computer system comprising
an integrated collection of independent nodes,
each of which is a system in its own right
capable on independent operation and derived from
products developed and marketed for other
standalone purposes
A commodity cluster is one in which both the
network and the compute nodes are available in
the market
In the Top500, cluster means commodity
cluster
A well-known type of commodity clusters are
Beowulf-class PC clusters, or Beowulfs

81
What is Beowulf?

An experiment in parallel computing systems
Established vision of low cost, high end
computing, with public domain software (and led
to software development)
Tutorials and book for best practice on how to
build such platforms
Today by Beowulf cluster one means a
commodity cluster that runs Linux and
GNU-type software
Project initiated by T. Sterling and D.
Becker at NASA in 1994

82
Constellations???

Commodity clusters that differ from the previous
ones by the dominant level of parallelism
Clusters consist of nodes, and nodes are
typically SMPs
If there are more procs in a node than nodes in
the cluster, then we have a constellation
Typically, constellations are space-shared among
users, with each user running openMP on a node,
although an app could run on the whole machine
using MPI/openMP
To be honest, this term is not very useful and
not very used.

83
MPP????????

Probably the most imprecise term for describing a
machine (isnt a 256-node cluster of 4-way SMPs
massively parallel?)
May use proprietary networks, vector processors,
as opposed to commodity component
Cray T3E, Cray X1, and Earth Simulator are
distributed memory machines, but the nodes are
SMPs.
Basicallly, everything thats fast and not
commodity is an MPP, in terms of todays Top500.
Lets look at these non-commodity things
Peoples definition of commodity varies

84
Vector Processors

Vector architectures were based on a single
processor
Multiple functional units
All performing the same operation
Instructions may specify large amounts of
parallelism (e.g., 64-way) but hardware executes
only a subset in parallel
Historically important
Overtaken by MPPs in the 90s as seen in Top500
Re-emerging in recent years
At a large scale in the Earth Simulator (NEC SX6)
and Cray X1
At a small scale in SIMD media extensions to
microprocessors
SSE, SSE2 (Intel Pentium/IA64)
Altivec (IBM/Motorola/Apple PowerPC)
VIS (Sun Sparc)
Key idea Compiler does some of the difficult
work of finding parallelism, so the hardware
doesnt have to

85
Vector Processors

Advantages
quick fetch and decode of a single instruction
for multiple operations
the instruction provides the processor with a
regular source of data, which can arrive at each
cycle, and processed in a pipelined fashion
The compiler does the work for you of course
Memory-to-memory
no registers
can process very long vectors, but startup time
is large
appeared in the 70s and died in the 80s
Cray, Fujitsu, Hitachi, NEC

86
Global Address Space

Cray T3D, T3E, X1, and HP Alphaserver cluster
Network interface supports Remote Direct Memory
Access
NI can directly access memory without
interrupting the CPU
One processor can read/write memory with
one-sided operations (put/get)
Not just a load/store as on a shared memory
machine
Remote data is typically not cached locally

87
Cray X1 Parallel Vector Architecture

Cray combines several technologies in the X1
12.8 Gflop/s Vector processors (MSP)
Shared caches (unusual on earlier vector
machines)
4 processor nodes sharing up to 64 GB of memory
Single System Image to 4096 Processors
Remote put/get between nodes (faster than
explicit messaging)

88
Cray X1 the MSP

Cray X1 building block is the MSP
Multi-Streaming vector Processor
4 SSPs (each a 2-pipe vector processor)
Compiler will (try to) vectorize/parallelize
across the MSP, achieving streaming

custom blocks
12.8 Gflops (64 bit)
S
S
S
S
25.6 Gflops (32 bit)
V
V
V
V
V
V
V
V
25-41 GB/s
0.5 MB
0.5 MB
0.5 MB
0.5 MB
shared caches
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
89
Cray X1 A node

Shared memory
32 network links and four I/O links per node

90
Cray X1 32 nodes
Fast Switch
91
Cray X1 128 nodes
92
Cray X1 Parallelism

Many levels of parallelism
Within a processor vectorization
Within an MSP streaming
Within a node shared memory
Across nodes message passing
Some are automated by the compiler, some require
work by the programmer
This is a common theme
The more complex the architecture, the more
difficult it is for the programmer to exploit it
Hard to fit this machine into a simple taxonomy
Similar story for the Earth Simulator

93
The Earth Simulator (NEC)

Each node
Shared memory (16GB)
8 vector processors I/O processor
640 nodes fully-connected by a 640x640 crossbar
switch
Total 5120 8GFlop processors -gt 40GFlop peak

94
Blue Gene/L

65,536 processors
Relatively modest clock rates, so that power
consumption is low, cooling is easy, and space is
small (1024 nodes in the same rack)
Besides, processor speed is on par with the
memory speed so faster clock rate does not help
2-way SMP nodes (really different from the X1)
several networks
64x32x32 3-D torus for point-to-point
tree for collective operations and for I/O
plus other Ethernet, etc.

95
BlueGene

The BlueGene/L custom processor chip.

96
BlueGene

The BlueGene/L. (a) Chip. (b) Card. (c) Board.
(d) Cabinet. (e) System.

97
Red Storm

Packaging of the Red Storm components.

98
Red Storm

The Red Storm system as viewed from above.

99
If you like dead Supercomputers

Lots of old supercomputers w/ pictures
http//www.geocities.com/Athens/6270/superp.html
Dead Supercomputers
http//www.paralogos.com/DeadSuper/Projects.html
e-Bay
Cray Y-MP/C90, 1993
45,100.70
From the Pittsburgh Supercomputer Center who
wanted to get rid of it to make space in their
machine room
Original cost 35,000,000
Weight 30 tons
Cost 400,000 to make it work at the buyers
ranch in Northern California

100
Network Topologies

People have experimented with different
topologies for distributed memory machines, or to
arrange memory banks in NUMA shared-memory
machines
Examples include
Ring KSR (1991)
2-D grid Intel Paragon (1992)
Torus
Hypercube nCube, Intel iPSC/860, used in the
SGI Origin 2000 for memory
Fat-tree IBM Colony and Federation Interconnects
(SP-x)
Arrangement of switches
pioneered with Butterfly networks like in the
BBN TC2000 in the early 1990
200 MHz processors in a multi-stage network of
switches
Virtually Shared Distributed memory (NUMA)
I actually worked with that one!

101
Hypercube

Defined by its dimension, d

1D
2D
3D
4D
102
Hypercube

Properties
Has 2d nodes
The number of hops between two nodes is at most d
The diameter of the network grows logarithmically
with the number of nodes, which was the key for
interest in hypercubes
But each node needs d neighbors, which is a
problem
Routing and Addressing

1111
1110
0110
0111

d-bit address
routing from xxxx to yyyy just keep going to a
neighbor that has a smaller Hamming distance
reminiscent of some p2p things
TONS of Hypercube research (even today!!)

1010
0011
0010
1011
1101
0101
1100
0100
1001
1000
0001
0000
103
Conclusion

Concurrency appears at all levels
Both in commodity systems and in
supercomputers
The distinction is rather annoying
When needing performance, one has to exploit
concurrency to the best of its capabilities
e.g., as a developer of a geophysics application
to run on a 10,000 heavy-iron supercomputers at
the SANDIA national lab
e.g., as a game developer on a 8-way multi-core
hyper-threaded desktop system sold by Dell
In this course well gain hands-on understanding
of how to write concurrent/parallel software
Using GCB cluster