Introduction to DistributedMemory Computing - PowerPoint PPT Presentation

1 / 61
About This Presentation
Title:

Introduction to DistributedMemory Computing

Description:

Cluster of dual-core systems. Mem. L2. L1. L1. Mem. L2. L1. L1. Mem. L2. L1. L1. Mem. L2. L1. L1. Mem ... Called an 'out of core' implementation ... – PowerPoint PPT presentation

Number of Views:23
Avg rating:3.0/5.0
Slides: 62
Provided by: henrica
Category:

less

Transcript and Presenter's Notes

Title: Introduction to DistributedMemory Computing


1
Introduction to Distributed-Memory Computing
2
More Concurrency
  • So far we have talked about concurrency within a
    Box
  • Within a processor
  • Pipelining
  • Multiple functional units
  • Instruction Level Parallelism
  • Hyper-Threading
  • Across processors
  • Multi-proc systems
  • Multi-core systems
  • Multi-proc/core systems
  • But this can only get us so far for many
    applications
  • Were limited by the number of processors we can
    put in a single box
  • Were limited by the size of the memory we can
    put in a single box

3
Toward Distributed Memory
  • Although for many applications one would rather
    write simple non-concurrent code, one has to go
    to concurrency because we need more cycles and
    thus have to work with multi-core/proc systems
  • Number of cycles in a non-concurrent processor is
    limited by technology and cost
  • Although for many applications on would rather
    write code that runs within a single system, one
    has to use multiple systems because we need
    (even) more cycles and/or a larger memory than
    available in a single system
  • The size of memory is limited by technology and
    cost
  • The number of processor cores is limited by
    technology and cost
  • Therefore, we often have to use multiple systems
  • Note thats because were force to do it, not
    because we want to do it (although its
    intellectually challenging and sometimes
    considered fun and cool)

4
Distributed Memory Computing
  • Distributed memory platforms
  • so-called supercomputers
  • Issues when writing distributed memory programs

5
A host of parallel machines
  • There are (have been) many kinds of parallel
    machines
  • For the last 12 years their performance has been
    measured and recorded with the LINPACK benchmark,
    as part of Top500
  • It is a good source of information about what
    machines are and how
    they have evolved
  • Note that its really about supercomputers
  • http//www.top500.org

6
LINPACK Benchmark?
  • LINPACK LINear algebra PACKage
  • A FORTRAN
  • Matrix multiply, LU/QR/Choleski factorizations,
    eigensolvers, SVD, etc.
  • LINPACK Benchmark
  • Dense linear system solve with LU factorization
  • 2/3 n3 O(n2)
  • Measure MFlops
  • The problem size can be chosen
  • You have to report the best performance for the
    best n, and the n that achieves half of the best
    performance.

7
What can we find on the Top500?
8
Pies
9
Pies
10
Pies
11
Pies
12
Pies
13
Platform Architectures
14
Clusters, Constellations, MPPs
  • These are the only 3 categories today in the
    Top500
  • They all belong to the Distributed Memory model
    (MIMD) (with many twists)
  • Each processor/node has its own memory and cache
    but cannot directly access another processors
    memory.
  • nodes may be SMPs
  • Each node has a network interface (NI) for all
    communication and synchronization.

15
Clusters
  • 80 of the Top500 machines are labeled as
    clusters
  • Definition Parallel computer system comprising
    an integrated collection of independent nodes,
    each of which is a system in its own right
    capable on independent operation and derived from
    products developed and marketed for other
    standalone purposes
  • A commodity cluster is one in which both the
    network and the compute nodes are available in
    the market
  • In the Top500, cluster means commodity
    cluster
  • A well-known type of commodity clusters are
    Beowulf-class PC clusters, or Beowulfs

16
What is Beowulf?
  • An experiment in parallel computing systems
  • Established vision of low cost, high end
    computing, with public domain software (and led
    to software development)
  • Tutorials and book for best practice on how to
    build such platforms
  • Today by Beowulf cluster one means a
    commodity cluster that runs Linux and
    GNU-type software
  • Project initiated by T. Sterling and D.
    Becker at NASA in 1994

17
MPP????????
  • Probably the most imprecise term for describing a
    machine (isnt a 256-node cluster of 4-way SMPs
    massively parallel?)
  • May use proprietary networks, vector processors,
    as opposed to commodity component
  • Basically, everything thats fast and not
    commodity is an MPP, in terms of todays Top500.
  • Lets look at these non-commodity things
  • Peoples definition of commodity varies

18
Cray X1 Parallel Vector Architecture
  • Cray combines several technologies in the X1
  • 12.8 Gflop/s Vector processors (MSP)
  • Shared caches (unusual on earlier vector
    machines)
  • 4 processor nodes sharing up to 64 GB of memory
  • Single System Image to 4096 Processors
  • Remote put/get between nodes (faster than
    explicit messaging)

19
Cray X1 the MSP
  • Cray X1 building block is the MSP
  • Multi-Streaming vector Processor
  • 4 SSPs (each a 2-pipe vector processor)
  • Compiler will (try to) vectorize/parallelize
    across the MSP, achieving streaming

custom blocks
12.8 Gflops (64 bit)
S
S
S
S
25.6 Gflops (32 bit)
V
V
V
V
V
V
V
V
25-41 GB/s
0.5 MB
0.5 MB
0.5 MB
0.5 MB
shared caches
2 MB Ecache
At frequency of 400/800 MHz
To local memory and network
25.6 GB/s
12.8 - 20.5 GB/s
Figure source J. Levesque, Cray
20
Cray X1 A node
  • Shared memory
  • 32 network links and four I/O links per node

21
Cray X1 32 nodes
Fast Switch
22
Cray X1 128 nodes
23
Cray X1 Parallelism
  • Many levels of parallelism
  • Within a processor vectorization
  • Within an MSP streaming
  • Within a node shared memory
  • Across nodes message passing
  • Some are automated by the compiler, some require
    work by the programmer
  • This is a common theme
  • The more complex the architecture, the more
    difficult it is for the programmer to exploit it
  • Hard to fit this machine into a simple taxonomy
  • Similar story for the Earth Simulator

24
The Earth Simulator (NEC)
  • Each node
  • Shared memory (16GB)
  • 8 vector processors I/O processor
  • 640 nodes fully-connected by a 640x640 crossbar
    switch
  • Total 5120 8GFlop processors -gt 40TFlop peak

25
Blue Gene/L
  • 65,536 processors
  • Relatively modest clock rates, so that power
    consumption is low, cooling is easy, and space is
    small (1024 nodes in the same rack)
  • Besides, processor speed is on par with the
    memory speed so faster clock rate does not help
  • 2-way SMP nodes (really different from the X1)
  • several networks
  • 64x32x32 3-D torus for point-to-point
  • tree for collective operations and for I/O
  • plus other Ethernet, etc.

26
If you like dead Supercomputers
  • Lots of old supercomputers w/ pictures
  • http//www.geocities.com/Athens/6270/superp.html
  • Dead Supercomputers
  • http//www.paralogos.com/DeadSuper/Projects.html
  • e-Bay
  • Cray Y-MP/C90, 1993
  • 45,100.70
  • From the Pittsburgh Supercomputer Center who
    wanted to get rid of it to make space in their
    machine room
  • Original cost 35,000,000
  • Weight 30 tons
  • Cost 400,000 to make it work at the buyers
    ranch in Northern California

27
Distributed Memory Programming
  • So this is all well and good, we can put tons of
    machines together
  • The big question is How do we write code for
    something like this?
  • The application now consists of multiple
    processes running on different machines
  • Each process can consist of multiple threads!
  • Lets look at a picture

28
Distributed Memory Platform
hyper-threaded processor core
dual-core chip
dual-core system
L1
L1
L1
29
Distributed Memory Platform
hyper-threaded processor core
dual-core chip
dual-core system
L1
L1
L1
8-way Switch
Cluster of dual-core systems
30
Distributed Memory Program
8-way Switch
  • 8 processes
  • Each process may contain 4 threads
  • 2 threads are running on each core using hyper
    threading
  • Each process may contain more or fewer threads

31
Distributed Memory Program
8-way Switch
  • Each process stores some data in the memory of
    its box
  • Lets see an example

32
Distributed-Memory Heat Equation
  • Say you want to solve the Heat Transfer equation
  • This application really looks like an image
    processing filter
  • You just run it multiple times in a row

f( , , , )
33
Sample Stencil App Code
  • The code could look like something like this
  • int aNN, a_newNN
  • for (i1 iltN-1 i)
  • for (j1 jltN-1 j)
  • a_newij f(aij,
  • ai-1j,ai1j,
  • aij-1,aij1)
  • Probably with threads, etc.

34
Too Large?
  • This is all well and good, but what if my array
    requires 8GB of memory and I only have 1GB of
    RAM?
  • I could think of just relying on virtual memory
  • This is bound to be very slow
  • I could manage the reads and writes to disk
    myself
  • Could be a bit faster than virtual memory if I am
    really clever, but would be complicated and still
    slow
  • Called an out of core implementation
  • Or, I could use 8 machines with 1GB RAMs and run
    fast without really ever swapping between the
    memory and the disk!
  • For instance, we can use a cluster!

35
How do we write the program?
  • Of course, the big question is how do we write
    the code
  • We cannot have a declaration of an NxN array any
    more, because that would not fit in memory
  • Each process (running on a different system) must
    handle an array of size N x N/8
  • Each process allocates memory for 1/8 of the
    overall array

36
Data Distribution
37
Data Distribution
38
Data Distribution
process 1
process 3
process 2
  • Each piece of the image is stored in the memory
    of a different system
  • A process running on one system can only see
    (i.e., address) the local image piece, and has no
    way to address other pieces
  • This is what makes distributed memory programming
    MUCH harder than shared-memory programming

39
Boundaries!
  • One of the problems now is what happens at the
    boundaries/edges of the image tiles?
  • Process 1 needs pixels from process 2
  • Process 2 needs pixels from process 1
  • Both processes cannot share memory because
    theyre on different systems!
  • We cannot just change them into thread like in
    shared-memory programming

process 1
process 2
40
Message-Passing
  • Since processes cannot share memory, they have to
    exchange messages
  • here are the pixels you need from me, give me
    the ones I need from you
  • This type of programming is called
    message-passing
  • Uses network communication
  • e.g., socket and TCP
  • So your code will have special function calls
  • Send(...)
  • Receive(...)
  • Were getting further away from simple
    shared-memory programming

41
SPMD Program
  • So at this point, we could
  • implement 8 different programs
  • start them up somehow on different nodes of our
    cluster (for instance)
  • have them all somehow identify their left and
    right neighbors, if any
  • Turns out that this is really cumbersome
  • And if I want to use 1000 processes, I have to
    write 1000 programs?
  • Typically one uses/implements the notion of a
    process rank

42
Process Ranks
  • To identify the processes participating in the
    computation, each process is assigned an index
    from 0 to N-1
  • And each process can find out what its rank is
    and how many processes there are in total

0 1 2 3 4 5 6 7
43
Communication Patterns
0 1 2 3 4 5 6 7
  • Process 0 will send to 1 and receive from 1
  • Process 1 will send to 0, receive from 0, send to
    2, and receive from 2
  • ...
  • Process 7 will receive from 6 and send to 6

44
SPMD Programming
  • If every process can find out its rank and the
    total number of processes, then one can write a
    Single Program to operate on Multiple pieces of
    Data simultaneously (SPMD)

int main() if (my_rank() 0) //
talk to my below neighbor else if (my_rank()
num_processes() -1) // talk to my above
neighbor else // talk to my above and
below neighbors
45
Ranks and Number of Processes
  • For now were going to assume we have the
    my_rank() and the num_processes() functions, and
    the all the logistics of starting up the
    processes is taken care of
  • We will see later that there are standard ways to
    make this happen
  • But this can also be implemented by hand if
    necessary
  • At any rate, the way to write distributed memory
    programs is to rely on the process ranks
    assumption

46
Writing the SPMD Program
  • The pseudo-code of the SMPD program could then
    look like

int main() int M N/num_processes() //
assumed to be integer! int original_imageMN
int new_imageMN // load my part of the
image from disk // compute all the pixels that
do not require communication // send pixels to
my neighbor(s) // receive pixels from my
neighbors() // compute the remaining pixels
// save the new image to file
47
Writing the SPMD Program
  • For now, lets ignore the issue of
    loading/writing files to disk
  • There are a lot of options here, simple/slow
    ones, and complex/fast ones
  • Lets focus on computation and communication

48
Computing the easy pixels
N
Can be computed without communication
0
M
1
Requires pixels from neighbors
2
3
(note that process 0 and process N-1 can compute
one more row than the others without any
communication
4
5
6
49
Computing the easy pixels
for (j0 iltN j) if (my_rank() 0) //
top process can compute an extra row
new_image0j f ( original_image0j,

original_image0j-1, original_image0j1,

original_image1j ) if (my_rank()
num_processes()-1) // bottom process can
compute
// an extra row
new_imageM-1j f ( original_imageM-1j,

original_imageM-1j-1, original_imageM-1j1
,
original_imageM-2j ) for (i1
iltM-1 i) // Everybody computes the middle
M-2 rows new_imageij f (
original_imageij,
original_imagei1j,
original_imagei-1j,
original_imageij-1,
original_imageij1 )
50
Global/Local Index
  • One of the reason why distributed memory
    programming is difficult is because of the
    discrepancy between global and local indices
  • When I think globally of the whole image, I
    know where pixel at coordinates (100,100) is
  • But when I write the code, I will not reference
    the pixel as image100100!
  • Lets look at this on an example

51
Global/Local Index
Process 0
Process 1
  • The red pixels global coordinates are (5,1)
  • The pixel on the 6th row and the 2nd column of
    the big array
  • But when Process 1 references it, it must use
    coordinates (1,1)
  • The pixel on the 2nd row and the 2nd column of
    the tile thats stored in Process 1

52
Message Passing
  • Lets assume that we have a send() function that
    takes as argument
  • The rank of the destination process
  • An address in local memory
  • A size (in bytes)
  • Lets assume that we have a recv() function that
    takes as argument
  • An address in local memory
  • A size (in bytes)

53
A Process Memory
original_image MxN
sent to above neighbor
not communicated
sent to below neighbor
buffer_top 1xN
received from above neighbor
buffer_bottom 1xN
received from below neighbor
new_image MxN
updated with received data
updated w/o using received data
updated with received data
54
Sending/Receiving Pixels
  • int buffer_topM, buffer_bottomM
  • if (my_rank() ! 0) // receive from above
    neighbor
  • send(my_rank()-1,(original_image00),sizeof(
    double)N)
  • recv(buffer_top, sizeof(double)N)
  • if (my_rank() ! num_processes()-1) //
    receive from below neighbor
  • send(my_rank()1, (original_imageM-10),
    sizeof(double)N)
  • recv(buffer_bottom, sizeof(double)N)
  • // assumes non-blocking sending

55
Computing Remaining Pixels
  • if (my_rank() ! 0) // update top pixels
  • for (j0 jltN j)
  • new_imageij f (
    original_imageij,

  • original_imagei1j, buffer_top0j,

  • original_imageij-1, original_imageij1 )
  • if (my_rank() ! N-1) // update bottom
    pixels
  • for (j0 jltN j)
  • new_imageij f (
    original_imageij,

  • buffer_bottom0j, original_imagei1j,

  • original_imageij-1, original_imageij1 )

56
Were done!
  • At this point, we have written the whole code
  • Whats missing is I/O
  • Read the image in
  • Write the image out
  • Dealing with I/O (efficiently) is a difficult
    problem, and we wont really talk about it in
    depth
  • And of course we need to use a tool that provides
    the my_rank(), the num_processors(), the send()
    and the recv() functions
  • Each process allocates 1xN 1xN 2(M/P)xN
    (2M/P2)N pixels, where P is the number of
    processors
  • Therefore, the total number of pixels allocated
    is 2MN 2NP
  • 2NP extra pixels allocated than in the sequential
    version
  • But its insignificant when spread across
    multiple systems

57
The Code
  • int main()
  • int i, j, M N/num_processes() // assumed to
    be integer!
  • int original_imageMN, new_imageMN
  • double buffer_topM, buffer_bottomM
  • for (j0 iltN j)
  • if (my_rank() 0) // top process can
    compute an extra row
  • new_image0j f ( original_image0j,
    original_image0j-1, original_image0j1,
    original_image1j )
  • if (my_rank() num_processes()-1) //
    bottom process can compute an extra row
  • new_imageM-1j f ( original_imageM-1
    j, original_imageM-1j-1, original_imageM-1
    j1, original_imageM-2j )
  • for (i1 iltM-1 i) // Everybody computes
    the middle M-2 rows
  • new_imageij f ( original_imageij
    , original_imagei1j, original_imagei-1j,
    original_imageij-1, original_imageij1
    )
  • if (my_rank() ! 0) // receive from
    above neighbor
  • send(my_rank()-1,(original_image00),sizeo
    f(double)N)
  • recv(buffer_top, sizeof(double)N)

58
The Open/MP Code
  • int main()
  • int i,j
  • int old_imageNN, new_imageNN
  • pragma omp parallel for private(i,j)
    shared(original_image, new_image)
  • for (i0 iltN i)
  • for (j0 jltN j)
  • new_imageij f ( original_imageij,

  • original_imagei1j, original_imagei-1j,

  • original_imageij-1, original_imageij1 )
  • And in the distributed memory code we have made
    the simplifying assumption that P divides N,
    which would increase the complexity of the
    distributed memory code (but not of the
    shared-memory code!)

59
Overlapping Comp and Comm
  • One of the complexities of writing distributed
    memory programs is to hide the cost of
    communication
  • Again, wed like to pretend we have a big
    shared-memory machine without a network at all
  • Its all very similar to what we did with the
    image processing application (HW5)
  • In the previous example, as opposed to doing
  • compute easy pixels, send, recv, compute
    remaining pixels
  • one should do
  • send, compute easy pixels, recv, compute
    remaining pixels

60
Hybrid Parallelism
  • In a cluster, individual systems are
    multi-proc/core
  • Therefore, one should use multiple threads within
    each system
  • This can be done by adding a few deft Open/MP
    pragmas to the distributed memory code
  • For instance
  • pragma omp parallel for private(i,j)
    shared(original_image, new_image)
  • for (i1 iltM-1 i) // Everybody computes the
    middle M-2 rows
  • new_imageij f ( original_imageij,

  • original_imagei1j, original_imagei-1j,

  • original_imageij-1, original_imageij1 )

61
Conclusion
  • Writing distributed memory code is much more
    complex that shared memory code
  • One must identify what must be communicated
  • One must keep a mental picture of the memory
    across systems
  • All this in addition to all the concerns we have
    mentioned in class
  • e.g., cache reuse, synchronization among threads
  • And the typical problems of shared memory are
    still there
  • There can be communication deadlocks, for
    instance
  • An in-depth study of distributed-memory
    programming belongs to a graduate-level class
  • But its likely that youll end up at some point
    writing distributed applications with data
    distribution among disjoint processes
Write a Comment
User Comments (0)
About PowerShow.com