Title: Parallel
1High-Performance Grid Computing and Research
Networking
Concurrent Computers
Instructor S. Masoud Sadjadi http//www.cs.fiu.ed
u/sadjadi/Teaching/ sadjadi At cs Dot fiu Dot
edu
2Acknowledgements
- The content of many of the slides in this lecture
notes have been adopted from the online resources
prepared previously by the people listed below.
Many thanks! - Henri Casanova
- Principles of High Performance Computing
- http//navet.ics.hawaii.edu/casanova
- henric_at_hawaii.edu
- Kai Wang
- Department of Computer Science
- University of South Dakota
- http//www.usd.edu/Kai.Wang
- Andrew Tanenbaum
3Concurrency and Computers
- We will see computer systems designed to allow
concurrency (for performance benefits) - Concurrency occurs at many levels in computer
systems - Within a CPU
- For example, On-Chip Parallelism
- Within a Box
- For example, Coprocessor and Multiprocessor
- Across Boxes
- For example, Multicomputers, Clusters, and Grids
4Parallel Computer Architectures
- (a) On-chip parallelism. (b) A coprocessor. (c) A
multiprocessor. - (d) A multicomputer. (e) A grid.
5Concurrency and Computers
- We will see computer systems designed to allow
concurrency (for performance benefits) - Concurrency occurs at many levels in computer
systems - Within a CPU
- Within a Box
- Across Boxes
6Concurrency within a CPU
Registers ALUs Hardware to
decode instructions and do all types of useful
things
CPU
Caches
Busses
RAM
adapters
Controllers
Controllers
I/O devices Displays Keyboards
Networks
7Concurrency within a CPU
- Several techniques to allow concurrency within a
single CPU - Pipelining
- RISC architectures
- Pipelined functional units
- ILP
- Vector units
- On-Chip Multithreading
- Lets look at them briefly
8Concurrency within a CPU
- Several techniques to allow concurrency within a
single CPU - Pipelining
- RISC architectures
- Pipelined functional units
- ILP
- Vector units
- On-Chip Multithreading
- Lets look at them briefly
9Pipelining
- If one has a sequence of tasks to do
- If each task consists of the same n steps or
stages - If different steps can be done simultaneously
- Then one can have a pipelined execution of the
tasks - e.g., for assembly line
- Goal higher throughput (i.e., number of tasks
per time unit)
Time to do 1 task 9 Time to do 2 tasks
13 Time to do 3 tasks 17 Time to do 4 tasks
21 Time to do 10 tasks 45 Time to do 100
tasks 409 Pays off if many tasks
10Pipelining
- Each step goes as fast as the slowest stage
- Therefore, the asymptotic throughput (i.e., the
throughput when the number of tasks tends to
infinity) is equal to - 1 / (duration of the slowest stage)
- Therefore, in an ideal pipeline, all stages would
be identical (balanced pipeline) - Question Can we make computer instructions all
consist of the same number of stage, where all
stages take the same number of clock cycles?
duration of the slowest stage
11RISC
- Having all instructions doable in the same number
of stages of the same durations is the RISC idea - Example
- MIPS architecture (See THE architecture book by
Patterson and Hennessy) - 5 stages
- Instruction Fetch (IF)
- Instruction Decode (ID)
- Instruction Execute (EX)
- Memory accesses (MEM)
- Register Write Back (WB)
- Each stage takes one clock cycle
12Pipelined Functional Units
- Although the RISC idea is attractive, some
operations are just too expensive to be done in
one clock cycle (during the EX stage) - Common example floating point operations
- Solution implement them as a sequence of stages,
so that they can be pipelined
EX Integer unit
FP/integer multiply
IF
ID
MEM
WB
M1
M2
M3
M4
M5
M6
M7
FP/integer add
A1
A2
A3
A4
13Pipelining Today
- Pipelined functional units are common
- Fallacy All computers today are RISC
- RISC was of course one of the most fundamental
new ideas in computer architectures - x86 Most commonly used Instruction Set
Architecture today - Kept around for backwards compatibility reasons,
because its easy to implement (not to program
for) - BUT modern x86 processors decode instructions
into micro-ops, which are then executed in a
RISC manner - Bottom line pipelining is a pervasive (and
conveniently hidden) form of concurrency in
computers today - Take a computer architecture course to know all
about it
14Concurrency within a CPU
- Several techniques to allow concurrency within a
single CPU - Pipelining
- ILP
- Vector units
- On-Chip Multithreading
15Instruction Level Parallelism
- Instruction Level Parallelism is the set of
techniques by which performance of a pipelined
processor can be pushed even further - ILP can be done by the hardware
- Dynamic instruction scheduling
- Dynamic branch predictions
- Multi-issue superscalar processors
- ILP can be done by the compiler
- Static instruction scheduling
- Multi-issue VLIW (Very Long Instruction Word)
processors - with multiple functional units
- Broad concept More than one instruction is
issued per clock cycle - e.g., 8-way multi-issue processor
16Concurrency within a CPU
- Several techniques to allow concurrency within a
single CPU - Pipelining
- ILP
- Vector units
- On-Chip Multithreading
17Vector Units
- A functional unit that can do elt-wise operations
on entire vectors with a single instruction,
called a vector instruction - These are specified as operations on vector
registers - A vector processor comes with some number of
such registers - MMX extension on x86 architectures
elts
elts
elts adds in parallel
elts
18Vector Units
- Typically, a vector register holds 32-64
elements - But the number of elements is always larger than
the amount of parallel hardware, called vector
pipes or lanes, say 2-4
elts
elts
elts / pipes adds in parallel
elts
19MMX Extension
- Many techniques that are initially implemented in
the supercomputer market, find their way to the
mainstream - Vector units were pioneered in supercomputers
- Supercomputers are mostly used for scientific
computing - Scientific computing uses tons of arrays (to
represent mathematical vectors and often does
regular computation with these arrays - Therefore, scientific code is easy to
vectorize, i.e., to generate assembly that uses
the vector registers and the vector instructions - Intels MMX or PowerPCs AltiVec
- MMX vector registers
- eight 8-bit elements
- four 16-bit elements
- two 32-bit elements
- AltiVec twice the lengths
- Used for multi-media applications
- image processing
- rendering
- ...
20Vectorization Example
- Conversion from RGB to YUV
- Y (9798R 19235G 3736B) / 32768
- U (-4784R - 9437G 4221B) / 32768 128
- V (20218R - 16941G - 3277B) / 32768 128
- This kind of code is perfectly parallel as all
pixels can be computed independently - Can be done easily with MMX vector capabilities
- Load 8 R values into an MMX vector register
- Load 8 G values into an MMX vector register
- Load 8 B values into an MMX vector register
- Do the , , and / in parallel
- Repeat
21Concurrency within a CPU
- Several techniques to allow concurrency within a
single CPU - Pipelining
- ILP
- Vector units
- On-Chip Multithreading
22Multi-threaded Architectures
- Computer architecture is a difficult field to
make innovations in - Whos going to spend money to manufacture your
new idea? - Whos going to be convinced that a new compiler
can/should be written - Whos going to be convinced of a new approach to
computing? - One of the cool innovations in the last decade
has been the concept of a Multi-threaded
Architecture
23On-Chip Multithreading
- Multithreading has been arounds for years, so
whats new about this??? - Here were talking about Hardware Support for
threads - Simultaneous Multi Threading (SMT)
- SuperThreading
- HyperThreading
- Lets try to understand what all of these mean
before looking at multi-threaded Supercomputers
24Single-treaded Processor
- The processor provides the illusion of concurrent
execution - Front-end fetching/decoding/reordering
- Execution core
- actual execution
- Multiple programs in memory
- Only one executes at a time
- 4-issue CPU with bubbles
- 7-unit CPU with pipeline bubbles
- Time-slicing via context switching
25Single-threaded SMP?
- Two threads execute at once, so threads spend
less time waiting - The number of bubbles is also doubled
- Twice as much speed and twice as much waste
26Super-threading
- Principle the processor can execute more than
one thread at a time - Also called time-slice multithreading
- The processor is then called a multithreaded
processor - Requires more hardware cleverness
- logic switches at each cycle
- Leads to less Waste
- A thread can run during a cycle while another
thread is waiting for the memory - Just a finer grain of interleaving
- But there is a restriction
- Each stage of the front end or the execution core
only runs instructions from ONE thread! - Does not help with poor instruction parallelism
within one thread - Does not reduce bubbles within a row
27Hyper-threading
- Principle the processor can execute more than
one thread at a time, even within a single clock
cycle!! - Requires even more hardware cleverness
- logic switches within each cycle
- On the diagram Only two threads execute
simultaneously. - Intels hyper-threading only adds 5 to the die
area - Some people argue that two is not hyper ?
- Finest level of interleaving
- From the OS perspective, there are two logical
processors
28Concurrency and Computers
- We will see computer systems designed to allow
concurrency (for performance benefits) - Concurrency occurs at many levels in computer
systems - Within a CPU
- Within a Box
- Across Boxes
29Concurrency within a Box
- Two main techniques
- SMP
- Multi-core
- Lets look at both of them
30SMPs
- Symmetric Multi-Processors
- often mislabeled as Shared-Memory Processors,
which has now become tolerated - Processors are all connected to a single memory
- Symmetric each memory cell is equally close to
all processors - Many dual-proc and quad-proc systems
- e.g., for servers
31Distributed caches
- The problem with distributed caches is that of
memory consistency - Intuitive memory model
- Reading an address should return the last value
written to that address - Easy to do in uniprocessors
- although there may be some I/O issues
- But difficult in multi-processor / multi-core
- Memory consistency A multiprocessor is
sequentially consistent if the result of any
execution is the same as if the operations of all
the processors were executed in some sequential
order, and the operations of each individual
processor appear in this sequence in the order
specified by its program. Lamport, 1979
32Cache Coherency
- Memory consistency is jeopardized by having
multiple caches - P1 and P2 both have a cached copy of a data item
- P1 write to it, possibly write-through to memory
- At this point P2 owns a stale copy
- When designing a multi-processor system, one must
ensure that this cannot happen - By defining protocols for cache coherence
33Snoopy Cache-Coherence
Pn
P0
bus snoop
memory bus
memory op from Pn
Mem
Mem
- Memory bus is a broadcast medium
- Caches contain information on which addresses
they store - Cache Controller snoops all transactions on the
bus - A transaction is a relevant transaction if it
involves a cache block currently contained in
this cache - Take action to ensure coherence
- invalidate, update, or supply value
34Limits of Snoopy Coherence
- Assume
- 4 GHz processor
- gt 16 GB/s inst BW per processor (32-bit)
- gt 9.6 GB/s data BW at 30 load-store of 8-byte
elements - Suppose 98 inst hit rate and 90 data hit rate
- gt 320 MB/s inst BW per processor
- gt 960 MB/s data BW per processor
- gt 1.28 GB/s combined BW
- Assuming 10 GB/s bus bandwidth
- 8 processors will saturate the bus
MEM
MEM
1.28 GB/s
cache
cache
25.6 GB/s
PROC
PROC
35Sample Machines
- Intel Pentium Pro Quad
- Coherent
- 4 processors
- Sun Enterprise server
- Coherent
- Up to 16 processor and/or memory-I/O cards
36Directory-based Coherence
- Idea Implement a directory that keeps track of
where each copy of a data item is stored - The directory acts as a filter
- processors must ask permission for loading data
from memory to cache - when an entry is changed the directory either
update or invalidate cached copies - Eliminate the overhead of broadcasting/snooping,
a thus bandwidth consumption - But is slower in terms of latency
- Used to scale up to numbers of processors that
would saturate the memory bus
37Example machine
- SGI Altix 3000
- A node contains up to 4 Itanium 2 processors and
32GB of memory - Uses a mixture of snoopy and directory-based
coherence - Up to 512 processors that are cache coherent
(global address space is possible for larger
machines)
38Sequential Consistency?
- A lot of hardware and technology to ensure cache
coherence - But the sequential consistency model may be
broken anyway - The compiler reorders/removes code
- Prefetch instructions cause reordering
- The network may reorder two write messages
- Basically, a bunch of things can happen
- Virtually all commercial systems give up on the
idea of maintaining strong sequential consistency - The programmer must program with weaker memory
models than Sequential Consistency - Done with some rules
- Avoid race conditions
- Use system-provided synchronization primitives
39Weaker models
- The programmer must program with weaker memory
models than Sequential Consistency - Done with some rules
- Avoid race conditions
- Use system-provided synchronization primitives
- We will see how to program shared-memory machines
40Concurrency within a Box
- Two main techniques
- SMP
- Multi-core
41Moores Law!
- Many people interpret Moores law as computer
gets twice as fast every 18 months - which is not technically true
- its all about microprocessor density
- But this is no longer true
- We should have 10GHz processors right now
- And we dont!
42No more Moore?
- We are used to getting faster CPUs all the time
- We are used for them to keep up with more
demanding software - Known as Andy giveth, and Bill taketh away
- Andy Grove
- Bill Gates
- Its a nice way to force people to buy computers
often - But basically, our computers get better, do more
things, and it just happens automatically - Some people call this the performance free
lunch - Conventional wisdom Not to worry, tomorrows
processors will have even more throughput, and
anyway todays applications are increasingly
throttled by factors other than CPU throughput
and memory speed (e.g., theyre often I/O-bound,
network-bound, database-bound).
43Commodity improvements
- There are three main ways in which commodity
processors keep improving - Higher clock rate
- More aggressive instruction reordering and
concurrent units - Bigger/faster caches
- All applications can easily benefit from these
improvements - at the cost of perhaps a recompilation
- Unfortunately, the first two are hitting their
limit - Higher clock rate lead to high heat, power
consumption - No more instruction reordering without
compromising correctness
44Is Moores laws not true?
- Ironically, Moores law is still true
- The density indeed still doubles
- But its wrong interpretation is not
- Clock rates do not doubled any more
- But we cant let this happen computers have to
get more powerful - Therefore, the industry has thought of new ways
to improve them multi-core - Multiple CPUs on a single chip
- Multi-core adds another level of concurrency
- But unlike, say multiple functional units, hard
to compile for them - Therefore, applications must be rewritten to
benefit from the (nowadays expected) performance
increase - Concurrency is the next major revolution in how
we write software (Dr. Dobbs Journal, 30(3),
March 2005)
45Multi-core processors
- In addition to putting concurrency in the
publics eye, multi-core architectures will have
deep impact - Languages will be forced to deal well with
concurrency - New language designs?
- New language extensions?
- New compilers?
- Efficiency and Performance optimization will
become more important write code that is fast on
one core with limited clock rate - The CPU may very well become a bottleneck (again)
for single-core programs - Other factors will improve, but not the clock
rate - Prediction many companies will be hiring people
to (re)write concurrent applications
46Multi-Core
- Quote from PC World Magazine Summer 2005
- Don't expect dual-core to be the top performer
today for games and other demanding
single-threaded applications. But that will
change as applications are rewritten. For
example, by year's end, Unreal Tournament should
have released a new game engine that takes
advantage of dual-core processing.
47Concurrency and Computers
- We will see computer systems designed to allow
concurrency (for performance benefits) - Concurrency occurs at many levels in computer
systems - Within a CPU
- Within a Box
- Across Boxes
48Multiple boxes together
- Example
- Take four boxes
- e.g., four Intel Itaniums bought at Dell
- Hook them up to a network
- e.g., a switch bought at CISCO, Myricom, etc.
- Install software that allows you to write/run
applications that can utilize these four boxes
concurrently - This is a simple way to achieve concurrency
across computer systems - Everybody has heard of clusters by now
- They are basically like the above example and can
be purchased already built from vendors - We will talk about this kind of concurrent
platform at length during this class
49Multiple Boxes Together
- Why do we use multiple boxes?
- Every programmer would rather have an
SMP/multi-core architecture that provides all the
power/memory she/he needs - The problem is that single boxes do not scale to
meet the needs of many scientific applications - Cant have enough processors or powerful enough
cores - Cant have enough memory
- But if you can live with a single box, do it!
- We will see that single-box programming is much
easier than multi-box programming
50Where does this leave us?
- So far we have seen many ways in which
concurrency can be achieved/implemented in
computer systems - Within a box
- Across boxes
- So we could look at a system and just list all
the ways in which it does concurrency - It would be nice to have a great taxonomy of
parallel platforms in which we can pigeon-hole
all (past and present) systems - Provides simple names that everybody can use and
understand quickly
51Taxonomy of parallel machines?
- Its not going to happen
- Up until last year Gordon Bell and Jim Gray
published an article in Comm. of the ACM,
discussing what the taxonomy should be - Dongarra, Sterling, etc. answered telling them
they were wrong and saying what the taxonomy
should be, and proposing a new multi-dimensional
scheme! - Both papers agree that most terms are conflated,
misused, etc. (MPP) - Complicated by the fact that concurrency appears
at so many levels - Example A 16-node cluster, where each node is a
4-way multi-processor, where each processor is
hyperthreaded, has vector units, and is fully
pipelined with multiple, pipelined functional
units
52Taxonomy of platforms?
- Well look at one traditional taxonomy
- Well look at current categorizations from Top500
- Well look at examples of platforms
- Well look at interesting/noteworthy
architectural features that one should know as
part of ones parallel computing culture
53The Flynn taxonomy
- Proposed in 1966!!!
- Functional taxonomy based on the notion of
streams of information data and instructions - Platforms are classified according to whether
they have a single (S) or multiple (M) stream of
each of the above - Four possibilities
- SISD (sequential machine)
- SIMD
- MIMD
- MISD (rare, no commercial system... systolic
arrays)
54Taxonomy of Parallel Computers
- Flynns taxonomy of parallel computers.
55SIMD
- PEs can be deactivated and activated on-the-fly
- Vector processing (e.g., vector add) is easy to
implement on SIMD - Debate is a vector processor an SIMD machine?
- often confused
- strictly not true according to the taxonomy (its
really SISD with pipelined operations) - but its convenient to think of the two as
equivalent
56MIMD
- Most general category
- Pretty much every supercomputer in existence
today is a MIMD machine at some level - This limits the usefulness of the taxonomy
- But you had to have heard of it at least once
because people keep referring to it, somehow... - Other taxonomies have been proposed, none very
satisfying - Shared- vs. Distributed- memory is a common
distinction among machines, but these days many
are hybrid anyway
57Taxonomy of Parallel Computers
- A taxonomy of parallel computers.
58A host of parallel machines
- There are (have been) many kinds of parallel
machines - For the last 12 years their performance has been
measured and recorded with the LINPACK benchmark,
as part of Top500 - It is a good source of information about what
machines are (were) and
how they have evolved - Note that its really about supercomputers
- http//www.top500.org
59LINPACK Benchmark?
- LINPACK LINear algebra PACKage
- A FORTRAN
- Matrix multiply, LU/QR/Choleski factorizations,
eigensolvers, SVD, etc. - LINPACK Benchmark
- Dense linear system solve with LU factorization
- 2/3 n3 O(n2)
- Measure MFlops
- The problem size can be chosen
- You have to report the best performance for the
best n, and the n that achieves half of the best
performance.
60What can we find on the Top500?
61Pies
62Top Ten Computers (http//www.top500.org)
63Top 500 Computers--Countries (http//www.top500.or
g)
64Top 500 Computers--Manufacturers
(http//www.top500.org)
65Top 500 ComputersManufacturers Trend
(http//www.top500.org)
66Top 500 Computers--Operating Systems
(http//www.top500.org)
67Top 500 ComputersOperating Systems Trend
(http//www.top500.org)
68Top 500 Computers--Processors (http//www.top500.o
rg)
69Top 500 ComputersProcessors Trend
(http//www.top500.org)
70Top 500 Computers--Customers (http//www.top500.or
g)
71Top 500 ComputersCustomers Trend
(http//www.top500.org)
72Top 500 Computers--Applications
(http//www.top500.org)
73Top 500 Computers--Applications Trend
(http//www.top500.org)
74Top 500 ComputersArchitecture (http//www.top500.
org)
75Top 500 ComputersArchitecture Trend
(http//www.top500.org)
76SIMD
- ILLIAC-IV, TMC CM-1, MasPar MP-1
- Expensive logic for CU, but there is only one
- Cheap logic for PEs and there can be a lot of
them - 32 procs on 1 chip of the MasPar, 1024-proc
system with 32 chips that fit on a single board! - 65,536 processors for the CM-1
- Thinking Machines gimmick was
- that the human brain consists of
many simple neurons that are
turned - on and off, and so was their machine
- CM-5
- hybrid SIMD and MIMD
- Death
- Machines not popular, but the programming
model is. - Vector processors often labeled SIMD
because thats in effect what they do,
but they are not SIMD machines - Led to the MPP terminology (Massively Parallel
Processor) - Ironic because none of todays MPPs are SIMD
77SMPs
- Symmetric MultiProcessors (often mislabeled as
Shared-Memory Processors, which has now become
tolerated) - Processors all connected to a (large) memory
- UMA Uniform Memory Access, makes is easy to
program - Symmetric all memory is equally close to all
processors - Difficult to scale to many processors (lt32
typically) - Cache Coherence via snoopy caches or
directories
78Distributed Shared Memory
- Memory is logically shared, but physically
distributed in banks - Any processor can access any address in memory
- Cache lines (or pages) are passed around the
machine - Cache coherence Distributed Directories
- NUMA Non-Uniform Memory Access (some processors
may be closer to some banks) - SGI Origin2000 is a canonical example
- Scales to 100s of processors
- Hypercube topology for the memory (later)
P1
P2
Pn
memory
memory
network
memory
memory
79Clusters, Constellations, MPPs
- These are the only 3 categories today in the
Top500 - They all belong to the Distributed Memory model
(MIMD) (with many twists) - Each processor/node has its own memory and cache
but cannot directly access another processors
memory. - nodes may be SMPs
- Each node has a network interface (NI) for all
communication and synchronization. - So what are these 3 categories?
80Clusters
- 58.2 of the Top500 machines are labeled as
clusters - Definition Parallel computer system comprising
an integrated collection of independent nodes,
each of which is a system in its own right
capable on independent operation and derived from
products developed and marketed for other
standalone purposes - A commodity cluster is one in which both the
network and the compute nodes are available in
the market - In the Top500, cluster means commodity
cluster - A well-known type of commodity clusters are
Beowulf-class PC clusters, or Beowulfs
81What is Beowulf?
- An experiment in parallel computing systems
- Established vision of low cost, high end
computing, with public domain software (and led
to software development) - Tutorials and book for best practice on how to
build such platforms - Today by Beowulf cluster one means a
commodity cluster that runs Linux and
GNU-type software - Project initiated by T. Sterling and D.
Becker at NASA in 1994
82Constellations???
- Commodity clusters that differ from the previous
ones by the dominant level of parallelism - Clusters consist of nodes, and nodes are
typically SMPs - If there are more procs in a node than nodes in
the cluster, then we have a constellation - Typically, constellations are space-shared among
users, with each user running openMP on a node,
although an app could run on the whole machine
using MPI/openMP - To be honest, this term is not very useful and
not very used.
83MPP????????
- Probably the most imprecise term for describing a
machine (isnt a 256-node cluster of 4-way SMPs
massively parallel?) - May use proprietary networks, vector processors,
as opposed to commodity component - Cray T3E, Cray X1, and Earth Simulator are
distributed memory machines, but the nodes are
SMPs. - Basicallly, everything thats fast and not
commodity is an MPP, in terms of todays Top500. - Lets look at these non-commodity things
- Peoples definition of commodity varies
84Vector Processors
- Vector architectures were based on a single
processor - Multiple functional units
- All performing the same operation
- Instructions may specify large amounts of
parallelism (e.g., 64-way) but hardware executes
only a subset in parallel - Historically important
- Overtaken by MPPs in the 90s as seen in Top500
- Re-emerging in recent years
- At a large scale in the Earth Simulator (NEC SX6)
and Cray X1 - At a small scale in SIMD media extensions to
microprocessors - SSE, SSE2 (Intel Pentium/IA64)
- Altivec (IBM/Motorola/Apple PowerPC)
- VIS (Sun Sparc)
- Key idea Compiler does some of the difficult
work of finding parallelism, so the hardware
doesnt have to
85Vector Processors
- Advantages
- quick fetch and decode of a single instruction
for multiple operations - the instruction provides the processor with a
regular source of data, which can arrive at each
cycle, and processed in a pipelined fashion - The compiler does the work for you of course
- Memory-to-memory
- no registers
- can process very long vectors, but startup time
is large - appeared in the 70s and died in the 80s
- Cray, Fujitsu, Hitachi, NEC
86Global Address Space
- Cray T3D, T3E, X1, and HP Alphaserver cluster
- Network interface supports Remote Direct Memory
Access - NI can directly access memory without
interrupting the CPU - One processor can read/write memory with
one-sided operations (put/get) - Not just a load/store as on a shared memory
machine - Remote data is typically not cached locally
87Cray X1 Parallel Vector Architecture
- Cray combines several technologies in the X1
- 12.8 Gflop/s Vector processors (MSP)
- Shared caches (unusual on earlier vector
machines) - 4 processor nodes sharing up to 64 GB of memory
- Single System Image to 4096 Processors
- Remote put/get between nodes (faster than
explicit messaging)
88Cray X1 the MSP
- Cray X1 building block is the MSP
- Multi-Streaming vector Processor
- 4 SSPs (each a 2-pipe vector processor)
- Compiler will (try to) vectorize/parallelize
across the MSP, achieving streaming
custom blocks
12.8 Gflops (64 bit)
S
S
S
S
25.6 Gflops (32 bit)
V
V
V
V
V
V
V
V
25-41 GB/s
0.5 MB
0.5 MB
0.5 MB
0.5 MB
shared caches
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
89Cray X1 A node
- Shared memory
- 32 network links and four I/O links per node
90Cray X1 32 nodes
Fast Switch
91Cray X1 128 nodes
92Cray X1 Parallelism
- Many levels of parallelism
- Within a processor vectorization
- Within an MSP streaming
- Within a node shared memory
- Across nodes message passing
- Some are automated by the compiler, some require
work by the programmer - This is a common theme
- The more complex the architecture, the more
difficult it is for the programmer to exploit it - Hard to fit this machine into a simple taxonomy
- Similar story for the Earth Simulator
93The Earth Simulator (NEC)
- Each node
- Shared memory (16GB)
- 8 vector processors I/O processor
- 640 nodes fully-connected by a 640x640 crossbar
switch - Total 5120 8GFlop processors -gt 40GFlop peak
94Blue Gene/L
- 65,536 processors
- Relatively modest clock rates, so that power
consumption is low, cooling is easy, and space is
small (1024 nodes in the same rack) - Besides, processor speed is on par with the
memory speed so faster clock rate does not help - 2-way SMP nodes (really different from the X1)
- several networks
- 64x32x32 3-D torus for point-to-point
- tree for collective operations and for I/O
- plus other Ethernet, etc.
95BlueGene
- The BlueGene/L custom processor chip.
96BlueGene
- The BlueGene/L. (a) Chip. (b) Card. (c) Board.
- (d) Cabinet. (e) System.
97Red Storm
- Packaging of the Red Storm components.
98Red Storm
- The Red Storm system as viewed from above.
99If you like dead Supercomputers
- Lots of old supercomputers w/ pictures
- http//www.geocities.com/Athens/6270/superp.html
- Dead Supercomputers
- http//www.paralogos.com/DeadSuper/Projects.html
- e-Bay
- Cray Y-MP/C90, 1993
- 45,100.70
- From the Pittsburgh Supercomputer Center who
wanted to get rid of it to make space in their
machine room - Original cost 35,000,000
- Weight 30 tons
- Cost 400,000 to make it work at the buyers
ranch in Northern California
100Network Topologies
- People have experimented with different
topologies for distributed memory machines, or to
arrange memory banks in NUMA shared-memory
machines - Examples include
- Ring KSR (1991)
- 2-D grid Intel Paragon (1992)
- Torus
- Hypercube nCube, Intel iPSC/860, used in the
SGI Origin 2000 for memory - Fat-tree IBM Colony and Federation Interconnects
(SP-x) - Arrangement of switches
- pioneered with Butterfly networks like in the
BBN TC2000 in the early 1990 - 200 MHz processors in a multi-stage network of
switches - Virtually Shared Distributed memory (NUMA)
- I actually worked with that one!
101Hypercube
- Defined by its dimension, d
1D
2D
3D
4D
102Hypercube
- Properties
- Has 2d nodes
- The number of hops between two nodes is at most d
- The diameter of the network grows logarithmically
with the number of nodes, which was the key for
interest in hypercubes - But each node needs d neighbors, which is a
problem - Routing and Addressing
1111
1110
0110
0111
- d-bit address
- routing from xxxx to yyyy just keep going to a
neighbor that has a smaller Hamming distance - reminiscent of some p2p things
- TONS of Hypercube research (even today!!)
1010
0011
0010
1011
1101
0101
1100
0100
1001
1000
0001
0000
103Conclusion
- Concurrency appears at all levels
- Both in commodity systems and in
supercomputers - The distinction is rather annoying
- When needing performance, one has to exploit
concurrency to the best of its capabilities - e.g., as a developer of a geophysics application
to run on a 10,000 heavy-iron supercomputers at
the SANDIA national lab - e.g., as a game developer on a 8-way multi-core
hyper-threaded desktop system sold by Dell - In this course well gain hands-on understanding
of how to write concurrent/parallel software - Using GCB cluster