Parallel - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel

Description:

High-Performance Grid Computing and Research Networking Concurrent Computers Instructor: S. Masoud Sadjadi http://www.cs.fiu.edu/~sadjadi/Teaching/ – PowerPoint PPT presentation

Number of Views:154
Avg rating:3.0/5.0
Slides: 89
Provided by: Henri204
Learn more at: http://users.cis.fiu.edu
Category:

less

Transcript and Presenter's Notes

Title: Parallel


1
High-Performance Grid Computing and Research
Networking
Concurrent Computers
Instructor S. Masoud Sadjadi http//www.cs.fiu.ed
u/sadjadi/Teaching/ sadjadi At cs Dot fiu Dot
edu
2
Acknowledgements
  • The content of many of the slides in this lecture
    notes have been adopted from the online resources
    prepared previously by the people listed below.
    Many thanks!
  • Henri Casanova
  • Principles of High Performance Computing
  • http//navet.ics.hawaii.edu/casanova
  • henric_at_hawaii.edu
  • Kai Wang
  • Department of Computer Science
  • University of South Dakota
  • http//www.usd.edu/Kai.Wang
  • Andrew Tanenbaum

3
Concurrency and Computers
  • We will see computer systems designed to allow
    concurrency (for performance benefits)
  • Concurrency occurs at many levels in computer
    systems
  • Within a CPU
  • For example, On-Chip Parallelism
  • Within a Box
  • For example, Coprocessor and Multiprocessor
  • Across Boxes
  • For example, Multicomputers, Clusters, and Grids

4
Parallel Computer Architectures
  • (a) On-chip parallelism. (b) A coprocessor. (c) A
    multiprocessor.
  • (d) A multicomputer. (e) A grid.

5
Concurrency and Computers
  • We will see computer systems designed to allow
    concurrency (for performance benefits)
  • Concurrency occurs at many levels in computer
    systems
  • Within a CPU
  • Within a Box
  • Across Boxes

6
Concurrency within a CPU
Registers ALUs Hardware to
decode instructions and do all types of useful
things
CPU
Caches
Busses
RAM
adapters
Controllers
Controllers
I/O devices Displays Keyboards
Networks
7
Concurrency within a CPU
  • Several techniques to allow concurrency within a
    single CPU
  • Pipelining
  • RISC architectures
  • Pipelined functional units
  • ILP
  • Vector units
  • On-Chip Multithreading
  • Lets look at them briefly

8
Concurrency within a CPU
  • Several techniques to allow concurrency within a
    single CPU
  • Pipelining
  • RISC architectures
  • Pipelined functional units
  • ILP
  • Vector units
  • On-Chip Multithreading
  • Lets look at them briefly

9
Pipelining
  • If one has a sequence of tasks to do
  • If each task consists of the same n steps or
    stages
  • If different steps can be done simultaneously
  • Then one can have a pipelined execution of the
    tasks
  • e.g., for assembly line
  • Goal higher throughput (i.e., number of tasks
    per time unit)

Time to do 1 task 9 Time to do 2 tasks
13 Time to do 3 tasks 17 Time to do 4 tasks
21 Time to do 10 tasks 45 Time to do 100
tasks 409 Pays off if many tasks
10
Pipelining
  • Each step goes as fast as the slowest stage
  • Therefore, the asymptotic throughput (i.e., the
    throughput when the number of tasks tends to
    infinity) is equal to
  • 1 / (duration of the slowest stage)
  • Therefore, in an ideal pipeline, all stages would
    be identical (balanced pipeline)
  • Question Can we make computer instructions all
    consist of the same number of stage, where all
    stages take the same number of clock cycles?

duration of the slowest stage
11
RISC
  • Having all instructions doable in the same number
    of stages of the same durations is the RISC idea
  • Example
  • MIPS architecture (See THE architecture book by
    Patterson and Hennessy)
  • 5 stages
  • Instruction Fetch (IF)
  • Instruction Decode (ID)
  • Instruction Execute (EX)
  • Memory accesses (MEM)
  • Register Write Back (WB)
  • Each stage takes one clock cycle

12
Pipelined Functional Units
  • Although the RISC idea is attractive, some
    operations are just too expensive to be done in
    one clock cycle (during the EX stage)
  • Common example floating point operations
  • Solution implement them as a sequence of stages,
    so that they can be pipelined

EX Integer unit
FP/integer multiply
IF
ID
MEM
WB
M1
M2
M3
M4
M5
M6
M7
FP/integer add
A1
A2
A3
A4
13
Pipelining Today
  • Pipelined functional units are common
  • Fallacy All computers today are RISC
  • RISC was of course one of the most fundamental
    new ideas in computer architectures
  • x86 Most commonly used Instruction Set
    Architecture today
  • Kept around for backwards compatibility reasons,
    because its easy to implement (not to program
    for)
  • BUT modern x86 processors decode instructions
    into micro-ops, which are then executed in a
    RISC manner
  • Bottom line pipelining is a pervasive (and
    conveniently hidden) form of concurrency in
    computers today
  • Take a computer architecture course to know all
    about it

14
Concurrency within a CPU
  • Several techniques to allow concurrency within a
    single CPU
  • Pipelining
  • ILP
  • Vector units
  • On-Chip Multithreading

15
Instruction Level Parallelism
  • Instruction Level Parallelism is the set of
    techniques by which performance of a pipelined
    processor can be pushed even further
  • ILP can be done by the hardware
  • Dynamic instruction scheduling
  • Dynamic branch predictions
  • Multi-issue superscalar processors
  • ILP can be done by the compiler
  • Static instruction scheduling
  • Multi-issue VLIW (Very Long Instruction Word)
    processors
  • with multiple functional units
  • Broad concept More than one instruction is
    issued per clock cycle
  • e.g., 8-way multi-issue processor

16
Concurrency within a CPU
  • Several techniques to allow concurrency within a
    single CPU
  • Pipelining
  • ILP
  • Vector units
  • On-Chip Multithreading

17
Vector Units
  • A functional unit that can do elt-wise operations
    on entire vectors with a single instruction,
    called a vector instruction
  • These are specified as operations on vector
    registers
  • A vector processor comes with some number of
    such registers
  • MMX extension on x86 architectures

elts
elts

elts adds in parallel
elts
18
Vector Units
  • Typically, a vector register holds 32-64
    elements
  • But the number of elements is always larger than
    the amount of parallel hardware, called vector
    pipes or lanes, say 2-4

elts
elts
elts / pipes adds in parallel
elts
19
MMX Extension
  • Many techniques that are initially implemented in
    the supercomputer market, find their way to the
    mainstream
  • Vector units were pioneered in supercomputers
  • Supercomputers are mostly used for scientific
    computing
  • Scientific computing uses tons of arrays (to
    represent mathematical vectors and often does
    regular computation with these arrays
  • Therefore, scientific code is easy to
    vectorize, i.e., to generate assembly that uses
    the vector registers and the vector instructions
  • Intels MMX or PowerPCs AltiVec
  • MMX vector registers
  • eight 8-bit elements
  • four 16-bit elements
  • two 32-bit elements
  • AltiVec twice the lengths
  • Used for multi-media applications
  • image processing
  • rendering
  • ...

20
Vectorization Example
  • Conversion from RGB to YUV
  • Y (9798R 19235G 3736B) / 32768
  • U (-4784R - 9437G 4221B) / 32768 128
  • V (20218R - 16941G - 3277B) / 32768 128
  • This kind of code is perfectly parallel as all
    pixels can be computed independently
  • Can be done easily with MMX vector capabilities
  • Load 8 R values into an MMX vector register
  • Load 8 G values into an MMX vector register
  • Load 8 B values into an MMX vector register
  • Do the , , and / in parallel
  • Repeat

21
Concurrency within a CPU
  • Several techniques to allow concurrency within a
    single CPU
  • Pipelining
  • ILP
  • Vector units
  • On-Chip Multithreading

22
Multi-threaded Architectures
  • Computer architecture is a difficult field to
    make innovations in
  • Whos going to spend money to manufacture your
    new idea?
  • Whos going to be convinced that a new compiler
    can/should be written
  • Whos going to be convinced of a new approach to
    computing?
  • One of the cool innovations in the last decade
    has been the concept of a Multi-threaded
    Architecture

23
On-Chip Multithreading
  • Multithreading has been arounds for years, so
    whats new about this???
  • Here were talking about Hardware Support for
    threads
  • Simultaneous Multi Threading (SMT)
  • SuperThreading
  • HyperThreading
  • Lets try to understand what all of these mean
    before looking at multi-threaded Supercomputers

24
Single-treaded Processor
  • The processor provides the illusion of concurrent
    execution
  • Front-end fetching/decoding/reordering
  • Execution core
  • actual execution
  • Multiple programs in memory
  • Only one executes at a time
  • 4-issue CPU with bubbles
  • 7-unit CPU with pipeline bubbles
  • Time-slicing via context switching

25
Single-threaded SMP?
  • Two threads execute at once, so threads spend
    less time waiting
  • The number of bubbles is also doubled
  • Twice as much speed and twice as much waste

26
Super-threading
  • Principle the processor can execute more than
    one thread at a time
  • Also called time-slice multithreading
  • The processor is then called a multithreaded
    processor
  • Requires more hardware cleverness
  • logic switches at each cycle
  • Leads to less Waste
  • A thread can run during a cycle while another
    thread is waiting for the memory
  • Just a finer grain of interleaving
  • But there is a restriction
  • Each stage of the front end or the execution core
    only runs instructions from ONE thread!
  • Does not help with poor instruction parallelism
    within one thread
  • Does not reduce bubbles within a row

27
Hyper-threading
  • Principle the processor can execute more than
    one thread at a time, even within a single clock
    cycle!!
  • Requires even more hardware cleverness
  • logic switches within each cycle
  • On the diagram Only two threads execute
    simultaneously.
  • Intels hyper-threading only adds 5 to the die
    area
  • Some people argue that two is not hyper ?
  • Finest level of interleaving
  • From the OS perspective, there are two logical
    processors

28
Concurrency and Computers
  • We will see computer systems designed to allow
    concurrency (for performance benefits)
  • Concurrency occurs at many levels in computer
    systems
  • Within a CPU
  • Within a Box
  • Across Boxes

29
Concurrency within a Box
  • Two main techniques
  • SMP
  • Multi-core
  • Lets look at both of them

30
SMPs
  • Symmetric Multi-Processors
  • often mislabeled as Shared-Memory Processors,
    which has now become tolerated
  • Processors are all connected to a single memory
  • Symmetric each memory cell is equally close to
    all processors
  • Many dual-proc and quad-proc systems
  • e.g., for servers

31
Distributed caches
  • The problem with distributed caches is that of
    memory consistency
  • Intuitive memory model
  • Reading an address should return the last value
    written to that address
  • Easy to do in uniprocessors
  • although there may be some I/O issues
  • But difficult in multi-processor / multi-core
  • Memory consistency A multiprocessor is
    sequentially consistent if the result of any
    execution is the same as if the operations of all
    the processors were executed in some sequential
    order, and the operations of each individual
    processor appear in this sequence in the order
    specified by its program. Lamport, 1979

32
Cache Coherency
  • Memory consistency is jeopardized by having
    multiple caches
  • P1 and P2 both have a cached copy of a data item
  • P1 write to it, possibly write-through to memory
  • At this point P2 owns a stale copy
  • When designing a multi-processor system, one must
    ensure that this cannot happen
  • By defining protocols for cache coherence

33
Snoopy Cache-Coherence
Pn
P0
bus snoop


memory bus
memory op from Pn
Mem
Mem
  • Memory bus is a broadcast medium
  • Caches contain information on which addresses
    they store
  • Cache Controller snoops all transactions on the
    bus
  • A transaction is a relevant transaction if it
    involves a cache block currently contained in
    this cache
  • Take action to ensure coherence
  • invalidate, update, or supply value

34
Limits of Snoopy Coherence
  • Assume
  • 4 GHz processor
  • gt 16 GB/s inst BW per processor (32-bit)
  • gt 9.6 GB/s data BW at 30 load-store of 8-byte
    elements
  • Suppose 98 inst hit rate and 90 data hit rate
  • gt 320 MB/s inst BW per processor
  • gt 960 MB/s data BW per processor
  • gt 1.28 GB/s combined BW
  • Assuming 10 GB/s bus bandwidth
  • 8 processors will saturate the bus

MEM
MEM

1.28 GB/s

cache
cache
25.6 GB/s
PROC
PROC
35
Sample Machines
  • Intel Pentium Pro Quad
  • Coherent
  • 4 processors
  • Sun Enterprise server
  • Coherent
  • Up to 16 processor and/or memory-I/O cards

36
Directory-based Coherence
  • Idea Implement a directory that keeps track of
    where each copy of a data item is stored
  • The directory acts as a filter
  • processors must ask permission for loading data
    from memory to cache
  • when an entry is changed the directory either
    update or invalidate cached copies
  • Eliminate the overhead of broadcasting/snooping,
    a thus bandwidth consumption
  • But is slower in terms of latency
  • Used to scale up to numbers of processors that
    would saturate the memory bus

37
Example machine
  • SGI Altix 3000
  • A node contains up to 4 Itanium 2 processors and
    32GB of memory
  • Uses a mixture of snoopy and directory-based
    coherence
  • Up to 512 processors that are cache coherent
    (global address space is possible for larger
    machines)

38
Sequential Consistency?
  • A lot of hardware and technology to ensure cache
    coherence
  • But the sequential consistency model may be
    broken anyway
  • The compiler reorders/removes code
  • Prefetch instructions cause reordering
  • The network may reorder two write messages
  • Basically, a bunch of things can happen
  • Virtually all commercial systems give up on the
    idea of maintaining strong sequential consistency
  • The programmer must program with weaker memory
    models than Sequential Consistency
  • Done with some rules
  • Avoid race conditions
  • Use system-provided synchronization primitives

39
Weaker models
  • The programmer must program with weaker memory
    models than Sequential Consistency
  • Done with some rules
  • Avoid race conditions
  • Use system-provided synchronization primitives
  • We will see how to program shared-memory machines

40
Concurrency within a Box
  • Two main techniques
  • SMP
  • Multi-core

41
Moores Law!
  • Many people interpret Moores law as computer
    gets twice as fast every 18 months
  • which is not technically true
  • its all about microprocessor density
  • But this is no longer true
  • We should have 10GHz processors right now
  • And we dont!

42
No more Moore?
  • We are used to getting faster CPUs all the time
  • We are used for them to keep up with more
    demanding software
  • Known as Andy giveth, and Bill taketh away
  • Andy Grove
  • Bill Gates
  • Its a nice way to force people to buy computers
    often
  • But basically, our computers get better, do more
    things, and it just happens automatically
  • Some people call this the performance free
    lunch
  • Conventional wisdom Not to worry, tomorrows
    processors will have even more throughput, and
    anyway todays applications are increasingly
    throttled by factors other than CPU throughput
    and memory speed (e.g., theyre often I/O-bound,
    network-bound, database-bound).

43
Commodity improvements
  • There are three main ways in which commodity
    processors keep improving
  • Higher clock rate
  • More aggressive instruction reordering and
    concurrent units
  • Bigger/faster caches
  • All applications can easily benefit from these
    improvements
  • at the cost of perhaps a recompilation
  • Unfortunately, the first two are hitting their
    limit
  • Higher clock rate lead to high heat, power
    consumption
  • No more instruction reordering without
    compromising correctness

44
Is Moores laws not true?
  • Ironically, Moores law is still true
  • The density indeed still doubles
  • But its wrong interpretation is not
  • Clock rates do not doubled any more
  • But we cant let this happen computers have to
    get more powerful
  • Therefore, the industry has thought of new ways
    to improve them multi-core
  • Multiple CPUs on a single chip
  • Multi-core adds another level of concurrency
  • But unlike, say multiple functional units, hard
    to compile for them
  • Therefore, applications must be rewritten to
    benefit from the (nowadays expected) performance
    increase
  • Concurrency is the next major revolution in how
    we write software (Dr. Dobbs Journal, 30(3),
    March 2005)

45
Multi-core processors
  • In addition to putting concurrency in the
    publics eye, multi-core architectures will have
    deep impact
  • Languages will be forced to deal well with
    concurrency
  • New language designs?
  • New language extensions?
  • New compilers?
  • Efficiency and Performance optimization will
    become more important write code that is fast on
    one core with limited clock rate
  • The CPU may very well become a bottleneck (again)
    for single-core programs
  • Other factors will improve, but not the clock
    rate
  • Prediction many companies will be hiring people
    to (re)write concurrent applications

46
Multi-Core
  • Quote from PC World Magazine Summer 2005
  • Don't expect dual-core to be the top performer
    today for games and other demanding
    single-threaded applications. But that will
    change as applications are rewritten. For
    example, by year's end, Unreal Tournament should
    have released a new game engine that takes
    advantage of dual-core processing.

47
Concurrency and Computers
  • We will see computer systems designed to allow
    concurrency (for performance benefits)
  • Concurrency occurs at many levels in computer
    systems
  • Within a CPU
  • Within a Box
  • Across Boxes

48
Multiple boxes together
  • Example
  • Take four boxes
  • e.g., four Intel Itaniums bought at Dell
  • Hook them up to a network
  • e.g., a switch bought at CISCO, Myricom, etc.
  • Install software that allows you to write/run
    applications that can utilize these four boxes
    concurrently
  • This is a simple way to achieve concurrency
    across computer systems
  • Everybody has heard of clusters by now
  • They are basically like the above example and can
    be purchased already built from vendors
  • We will talk about this kind of concurrent
    platform at length during this class

49
Multiple Boxes Together
  • Why do we use multiple boxes?
  • Every programmer would rather have an
    SMP/multi-core architecture that provides all the
    power/memory she/he needs
  • The problem is that single boxes do not scale to
    meet the needs of many scientific applications
  • Cant have enough processors or powerful enough
    cores
  • Cant have enough memory
  • But if you can live with a single box, do it!
  • We will see that single-box programming is much
    easier than multi-box programming

50
Where does this leave us?
  • So far we have seen many ways in which
    concurrency can be achieved/implemented in
    computer systems
  • Within a box
  • Across boxes
  • So we could look at a system and just list all
    the ways in which it does concurrency
  • It would be nice to have a great taxonomy of
    parallel platforms in which we can pigeon-hole
    all (past and present) systems
  • Provides simple names that everybody can use and
    understand quickly

51
Taxonomy of parallel machines?
  • Its not going to happen
  • Up until last year Gordon Bell and Jim Gray
    published an article in Comm. of the ACM,
    discussing what the taxonomy should be
  • Dongarra, Sterling, etc. answered telling them
    they were wrong and saying what the taxonomy
    should be, and proposing a new multi-dimensional
    scheme!
  • Both papers agree that most terms are conflated,
    misused, etc. (MPP)
  • Complicated by the fact that concurrency appears
    at so many levels
  • Example A 16-node cluster, where each node is a
    4-way multi-processor, where each processor is
    hyperthreaded, has vector units, and is fully
    pipelined with multiple, pipelined functional
    units

52
Taxonomy of platforms?
  • Well look at one traditional taxonomy
  • Well look at current categorizations from Top500
  • Well look at examples of platforms
  • Well look at interesting/noteworthy
    architectural features that one should know as
    part of ones parallel computing culture

53
The Flynn taxonomy
  • Proposed in 1966!!!
  • Functional taxonomy based on the notion of
    streams of information data and instructions
  • Platforms are classified according to whether
    they have a single (S) or multiple (M) stream of
    each of the above
  • Four possibilities
  • SISD (sequential machine)
  • SIMD
  • MIMD
  • MISD (rare, no commercial system... systolic
    arrays)

54
Taxonomy of Parallel Computers
  • Flynns taxonomy of parallel computers.

55
SIMD
  • PEs can be deactivated and activated on-the-fly
  • Vector processing (e.g., vector add) is easy to
    implement on SIMD
  • Debate is a vector processor an SIMD machine?
  • often confused
  • strictly not true according to the taxonomy (its
    really SISD with pipelined operations)
  • but its convenient to think of the two as
    equivalent

56
MIMD
  • Most general category
  • Pretty much every supercomputer in existence
    today is a MIMD machine at some level
  • This limits the usefulness of the taxonomy
  • But you had to have heard of it at least once
    because people keep referring to it, somehow...
  • Other taxonomies have been proposed, none very
    satisfying
  • Shared- vs. Distributed- memory is a common
    distinction among machines, but these days many
    are hybrid anyway

57
Taxonomy of Parallel Computers
  • A taxonomy of parallel computers.

58
A host of parallel machines
  • There are (have been) many kinds of parallel
    machines
  • For the last 12 years their performance has been
    measured and recorded with the LINPACK benchmark,
    as part of Top500
  • It is a good source of information about what
    machines are (were) and
    how they have evolved
  • Note that its really about supercomputers
  • http//www.top500.org

59
LINPACK Benchmark?
  • LINPACK LINear algebra PACKage
  • A FORTRAN
  • Matrix multiply, LU/QR/Choleski factorizations,
    eigensolvers, SVD, etc.
  • LINPACK Benchmark
  • Dense linear system solve with LU factorization
  • 2/3 n3 O(n2)
  • Measure MFlops
  • The problem size can be chosen
  • You have to report the best performance for the
    best n, and the n that achieves half of the best
    performance.

60
What can we find on the Top500?
61
Pies
62
Top Ten Computers (http//www.top500.org)
63
Top 500 Computers--Countries (http//www.top500.or
g)
64
Top 500 Computers--Manufacturers
(http//www.top500.org)
65
Top 500 ComputersManufacturers Trend
(http//www.top500.org)
66
Top 500 Computers--Operating Systems
(http//www.top500.org)
67
Top 500 ComputersOperating Systems Trend
(http//www.top500.org)
68
Top 500 Computers--Processors (http//www.top500.o
rg)
69
Top 500 ComputersProcessors Trend
(http//www.top500.org)
70
Top 500 Computers--Customers (http//www.top500.or
g)
71
Top 500 ComputersCustomers Trend
(http//www.top500.org)
72
Top 500 Computers--Applications
(http//www.top500.org)
73
Top 500 Computers--Applications Trend
(http//www.top500.org)
74
Top 500 ComputersArchitecture (http//www.top500.
org)
75
Top 500 ComputersArchitecture Trend
(http//www.top500.org)
76
SIMD
  • ILLIAC-IV, TMC CM-1, MasPar MP-1
  • Expensive logic for CU, but there is only one
  • Cheap logic for PEs and there can be a lot of
    them
  • 32 procs on 1 chip of the MasPar, 1024-proc
    system with 32 chips that fit on a single board!
  • 65,536 processors for the CM-1
  • Thinking Machines gimmick was
  • that the human brain consists of
    many simple neurons that are
    turned
  • on and off, and so was their machine
  • CM-5
  • hybrid SIMD and MIMD
  • Death
  • Machines not popular, but the programming
    model is.
  • Vector processors often labeled SIMD
    because thats in effect what they do,
    but they are not SIMD machines
  • Led to the MPP terminology (Massively Parallel
    Processor)
  • Ironic because none of todays MPPs are SIMD

77
SMPs
  • Symmetric MultiProcessors (often mislabeled as
    Shared-Memory Processors, which has now become
    tolerated)
  • Processors all connected to a (large) memory
  • UMA Uniform Memory Access, makes is easy to
    program
  • Symmetric all memory is equally close to all
    processors
  • Difficult to scale to many processors (lt32
    typically)
  • Cache Coherence via snoopy caches or
    directories

78
Distributed Shared Memory
  • Memory is logically shared, but physically
    distributed in banks
  • Any processor can access any address in memory
  • Cache lines (or pages) are passed around the
    machine
  • Cache coherence Distributed Directories
  • NUMA Non-Uniform Memory Access (some processors
    may be closer to some banks)
  • SGI Origin2000 is a canonical example
  • Scales to 100s of processors
  • Hypercube topology for the memory (later)

P1
P2
Pn



memory
memory
network
memory
memory
79
Clusters, Constellations, MPPs
  • These are the only 3 categories today in the
    Top500
  • They all belong to the Distributed Memory model
    (MIMD) (with many twists)
  • Each processor/node has its own memory and cache
    but cannot directly access another processors
    memory.
  • nodes may be SMPs
  • Each node has a network interface (NI) for all
    communication and synchronization.
  • So what are these 3 categories?

80
Clusters
  • 58.2 of the Top500 machines are labeled as
    clusters
  • Definition Parallel computer system comprising
    an integrated collection of independent nodes,
    each of which is a system in its own right
    capable on independent operation and derived from
    products developed and marketed for other
    standalone purposes
  • A commodity cluster is one in which both the
    network and the compute nodes are available in
    the market
  • In the Top500, cluster means commodity
    cluster
  • A well-known type of commodity clusters are
    Beowulf-class PC clusters, or Beowulfs

81
What is Beowulf?
  • An experiment in parallel computing systems
  • Established vision of low cost, high end
    computing, with public domain software (and led
    to software development)
  • Tutorials and book for best practice on how to
    build such platforms
  • Today by Beowulf cluster one means a
    commodity cluster that runs Linux and
    GNU-type software
  • Project initiated by T. Sterling and D.
    Becker at NASA in 1994

82
Constellations???
  • Commodity clusters that differ from the previous
    ones by the dominant level of parallelism
  • Clusters consist of nodes, and nodes are
    typically SMPs
  • If there are more procs in a node than nodes in
    the cluster, then we have a constellation
  • Typically, constellations are space-shared among
    users, with each user running openMP on a node,
    although an app could run on the whole machine
    using MPI/openMP
  • To be honest, this term is not very useful and
    not very used.

83
MPP????????
  • Probably the most imprecise term for describing a
    machine (isnt a 256-node cluster of 4-way SMPs
    massively parallel?)
  • May use proprietary networks, vector processors,
    as opposed to commodity component
  • Cray T3E, Cray X1, and Earth Simulator are
    distributed memory machines, but the nodes are
    SMPs.
  • Basicallly, everything thats fast and not
    commodity is an MPP, in terms of todays Top500.
  • Lets look at these non-commodity things
  • Peoples definition of commodity varies

84
Vector Processors
  • Vector architectures were based on a single
    processor
  • Multiple functional units
  • All performing the same operation
  • Instructions may specify large amounts of
    parallelism (e.g., 64-way) but hardware executes
    only a subset in parallel
  • Historically important
  • Overtaken by MPPs in the 90s as seen in Top500
  • Re-emerging in recent years
  • At a large scale in the Earth Simulator (NEC SX6)
    and Cray X1
  • At a small scale in SIMD media extensions to
    microprocessors
  • SSE, SSE2 (Intel Pentium/IA64)
  • Altivec (IBM/Motorola/Apple PowerPC)
  • VIS (Sun Sparc)
  • Key idea Compiler does some of the difficult
    work of finding parallelism, so the hardware
    doesnt have to

85
Vector Processors
  • Advantages
  • quick fetch and decode of a single instruction
    for multiple operations
  • the instruction provides the processor with a
    regular source of data, which can arrive at each
    cycle, and processed in a pipelined fashion
  • The compiler does the work for you of course
  • Memory-to-memory
  • no registers
  • can process very long vectors, but startup time
    is large
  • appeared in the 70s and died in the 80s
  • Cray, Fujitsu, Hitachi, NEC

86
Global Address Space
  • Cray T3D, T3E, X1, and HP Alphaserver cluster
  • Network interface supports Remote Direct Memory
    Access
  • NI can directly access memory without
    interrupting the CPU
  • One processor can read/write memory with
    one-sided operations (put/get)
  • Not just a load/store as on a shared memory
    machine
  • Remote data is typically not cached locally

87
Cray X1 Parallel Vector Architecture
  • Cray combines several technologies in the X1
  • 12.8 Gflop/s Vector processors (MSP)
  • Shared caches (unusual on earlier vector
    machines)
  • 4 processor nodes sharing up to 64 GB of memory
  • Single System Image to 4096 Processors
  • Remote put/get between nodes (faster than
    explicit messaging)

88
Cray X1 the MSP
  • Cray X1 building block is the MSP
  • Multi-Streaming vector Processor
  • 4 SSPs (each a 2-pipe vector processor)
  • Compiler will (try to) vectorize/parallelize
    across the MSP, achieving streaming

custom blocks
12.8 Gflops (64 bit)
S
S
S
S
25.6 Gflops (32 bit)
V
V
V
V
V
V
V
V
25-41 GB/s
0.5 MB
0.5 MB
0.5 MB
0.5 MB
shared caches
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
89
Cray X1 A node
  • Shared memory
  • 32 network links and four I/O links per node

90
Cray X1 32 nodes
Fast Switch
91
Cray X1 128 nodes
92
Cray X1 Parallelism
  • Many levels of parallelism
  • Within a processor vectorization
  • Within an MSP streaming
  • Within a node shared memory
  • Across nodes message passing
  • Some are automated by the compiler, some require
    work by the programmer
  • This is a common theme
  • The more complex the architecture, the more
    difficult it is for the programmer to exploit it
  • Hard to fit this machine into a simple taxonomy
  • Similar story for the Earth Simulator

93
The Earth Simulator (NEC)
  • Each node
  • Shared memory (16GB)
  • 8 vector processors I/O processor
  • 640 nodes fully-connected by a 640x640 crossbar
    switch
  • Total 5120 8GFlop processors -gt 40GFlop peak

94
Blue Gene/L
  • 65,536 processors
  • Relatively modest clock rates, so that power
    consumption is low, cooling is easy, and space is
    small (1024 nodes in the same rack)
  • Besides, processor speed is on par with the
    memory speed so faster clock rate does not help
  • 2-way SMP nodes (really different from the X1)
  • several networks
  • 64x32x32 3-D torus for point-to-point
  • tree for collective operations and for I/O
  • plus other Ethernet, etc.

95
BlueGene
  • The BlueGene/L custom processor chip.

96
BlueGene
  • The BlueGene/L. (a) Chip. (b) Card. (c) Board.
  • (d) Cabinet. (e) System.

97
Red Storm
  • Packaging of the Red Storm components.

98
Red Storm
  • The Red Storm system as viewed from above.

99
If you like dead Supercomputers
  • Lots of old supercomputers w/ pictures
  • http//www.geocities.com/Athens/6270/superp.html
  • Dead Supercomputers
  • http//www.paralogos.com/DeadSuper/Projects.html
  • e-Bay
  • Cray Y-MP/C90, 1993
  • 45,100.70
  • From the Pittsburgh Supercomputer Center who
    wanted to get rid of it to make space in their
    machine room
  • Original cost 35,000,000
  • Weight 30 tons
  • Cost 400,000 to make it work at the buyers
    ranch in Northern California

100
Network Topologies
  • People have experimented with different
    topologies for distributed memory machines, or to
    arrange memory banks in NUMA shared-memory
    machines
  • Examples include
  • Ring KSR (1991)
  • 2-D grid Intel Paragon (1992)
  • Torus
  • Hypercube nCube, Intel iPSC/860, used in the
    SGI Origin 2000 for memory
  • Fat-tree IBM Colony and Federation Interconnects
    (SP-x)
  • Arrangement of switches
  • pioneered with Butterfly networks like in the
    BBN TC2000 in the early 1990
  • 200 MHz processors in a multi-stage network of
    switches
  • Virtually Shared Distributed memory (NUMA)
  • I actually worked with that one!

101
Hypercube
  • Defined by its dimension, d

1D
2D
3D
4D
102
Hypercube
  • Properties
  • Has 2d nodes
  • The number of hops between two nodes is at most d
  • The diameter of the network grows logarithmically
    with the number of nodes, which was the key for
    interest in hypercubes
  • But each node needs d neighbors, which is a
    problem
  • Routing and Addressing

1111
1110
0110
0111
  • d-bit address
  • routing from xxxx to yyyy just keep going to a
    neighbor that has a smaller Hamming distance
  • reminiscent of some p2p things
  • TONS of Hypercube research (even today!!)

1010
0011
0010
1011
1101
0101
1100
0100
1001
1000
0001
0000
103
Conclusion
  • Concurrency appears at all levels
  • Both in commodity systems and in
    supercomputers
  • The distinction is rather annoying
  • When needing performance, one has to exploit
    concurrency to the best of its capabilities
  • e.g., as a developer of a geophysics application
    to run on a 10,000 heavy-iron supercomputers at
    the SANDIA national lab
  • e.g., as a game developer on a 8-way multi-core
    hyper-threaded desktop system sold by Dell
  • In this course well gain hands-on understanding
    of how to write concurrent/parallel software
  • Using GCB cluster
Write a Comment
User Comments (0)
About PowerShow.com