Shared Memory and Message Passing

About This Presentation

Title:

Shared Memory and Message Passing

Description:

Parallel program = program composed of tasks (processes) which communicate to ... 'Omega' networks (butterfly, shuffle-exchange) commonly used for multistage network ... – PowerPoint PPT presentation

Number of Views:64

Avg rating:3.0/5.0

Slides: 33

Provided by: csewe4

Learn more at: https://cseweb.ucsd.edu

Category:

more less

Transcript and Presenter's Notes

Title: Shared Memory and Message Passing

1
Shared Memory and Message Passing

WA 1.2.1, 1.2.2, 1.2.3, 2.1, 2.1.2, 2.1.3, 8.3,
9.2.1, 9.2.2 Akl 2.5.2

2
Models for Communication

Parallel program program composed of tasks
(processes) which communicate to accomplish an
overall computational goal
Two prevalent models for communication
Message passing (MP)
Shared memory (SM)
This lecture will focus on MP and SM computing

3
Message Passing Communication

Processes in message passing program communicate
by passing messages
Basic message passing primitives
Send(parameter list)
Receive(parameter list)
Parameters depend on the software and can be
complex

A
B
4
Flavors of message passing

Synchronous used for routines that return when
the message transfer has been completed
Synchronous send waits until the complete message
can be accepted by the receiving process before
sending the message (send suspends until receive)
Synchronous receive will wait until the message
it is expecting arrives (receive suspends until
message sent)
Also called blocking

request to send
A
B
acknowledgement
message
5
Nonblocking message passing

Nonblocking sends return whether or not the
message has been received
If receiving processor not ready, message may be
stored in message buffer
Message buffer used to hold messages being sent
by A prior to being accepted by receive in B
MPI
routines that use a message buffer and return
after their local actions complete are blocking
(even though message transfer may not be
complete)
Routines that return immediately are non-blocking

6
Architectural support for MP

Interconnection network should provide
connectivity, low latency, high bandwidth
Many interconnection networks developed over
last 2 decades
Hypercube
Mesh, torus
Ring, etc.

7
Shared Memory Communication

Processes in shared memory program communicate by
accessing shared variables and data structures
Basic shared memory primitives
Read to a shared variable
Write to a shared variable

8
Accessing Shared Variables

Conflicts may arise if multiple processes want to
write to a shared variable at the same time.
Programmer, language, and/or architecture must
provide means of resolving conflicts

Process A,B read x compute x1 write x
9
Architectural Support for SM

4 basic types of interconnection media
Bus
Crossbar switch
Multistage network
Interconnection network with distributed shared
memory

10
Limited Scalability Media I

Bus
Bus acts as a party line between processors and
shared memories
Bus provides uniform access to shared memory
(UMA)
When bus saturates, performance of system
degrades
For this reason, bus-based systems do not scale
to more than 30-40 processors Sequent Symmetry,
Balance

11
Limited Scalability Media II

Crossbar
Crossbar switch connects m processors and n
memories with distinct paths between each
processor/memory pair
Crossbar provides uniform access to shared memory
(UMA)
O(mn) switches required for m processors and n
memories
Crossbar scalable in terms of performance but not
in terms of cost, used for basic switching
mechanism in SP2

12
Multistage Networks

Multistage networks provide more scalable
performance than bus but less costly to scale
than crossbar
Typically maxlogn,logm stages connect n
processors and m shared memories
Omega networks (butterfly, shuffle-exchange)
commonly used for multistage network
Multistage network used for CM-5 (fat-tree
connects processor/memory pairs), BBN Butterfly
(butterfly), IBM RP3 (omega)

13
Omega Networks

Butterfly multistage
Used for BBN Butterfly, TC2000

Shuffle multistage
Used for RP3, SP2 high performance switch

14
Fat-tree Interconnect

Bandwidth is increased towards the root
Used for data network for CM-5 (MIMD MPP)
4 leaf nodes, internal nodes have 2 or 4 children
To route from leaf A to leaf B, pick random
switch C in the least common ancestor fat node of
A and B, take unique tree route from A to C and
from C to B

Binary fat-tree in which all internal nodes have
two children
15
Distributed Shared Memory

Memory is physically distributed but programmed
as shared memory
Programmers find shared memory paradigm
desirable
Shared memory distributed among processors,
accesses may be sent as messages
Access to local memory and global shared memory
creates NUMA (non-uniform memory access
architectures)
BBN butterfly is NUMA shared memory
multiprocessor

BBNbutterflyinterconnect
16
Alphabet Soup

Terms for shared memory multiprocessors
NUMA non-uniform memory access
BBN Butterfly, Cray T3E, Origin 2000
UMA uniform memory access
Sequent, Sun HPC 1000
COMA cache-only memory access
KSR
(NORMA no remote memory access
message-passing MPPs )

17
Using both SM and MP together

Common for platforms to support one model at a
time SM or MP
Clusters of SMPs may be effectively programmed
using both SM and MP
SM used within a multiple processor machine/node
MP used between nodes

18
SM Program Prefix Sums

Problem Given n processes P_i and n datum
a_i, want to compute the prefix sums (a_1
a_j ) A_1i such that A_1i is in P_i upon
termination of the algorithm.
Well look at an O(log n) SM parallel algorithm
which computes the prefix sums of n datum on n
processors

19
Data Movement for Prefix Sums Algorithm

Aij a_i a_i1 a_j

20
Pseudo-code for Prefix Sums Algorithm

Pseudo-code
Procedure ALLSUMS(a_1,,a_n)
Initialize P_i with data a_iAii
for j0 to (log n) 1 do
forall i 2j 1 to n do (parallel for)
Processor P_i
(i) obtains contents of P_i-2j through shared
memory and
(ii) replaces contents of P_i with contents of
P_i-2j current contents of P_i
end forall
end for

21
Programming Issues

Algorithm assumes that all additions with the
same offset (i.e. for each level) are performed
at the same time
Need some way of tagging or synchronizing
computations
May be cost-effective to do a barrier
synchronization (all processors must reach a
barrier before proceeding to the next level )
between levels
For this algorithm, there are no write conflicts
within a level since one of the values is already
in the shared variable, the other value need only
be summed with the existing value
If two values must be written with existing
variable, we would need to establish a
well-defined protocol for handling conflicting
writes

22
MP Program Sorting

Problem Sorting a list of numbers/keys
(rearranging them so as to be in non-decreasing
order)
Basic sorting operation compare/exchange
(compare/swap)
In serial computation (RAM) model, optimal
sorting for n keys is O(nlogn)

1 active, 1 passive Processor
23
Odd-Even Transposition Sort

Parallel version of bubblesort many
compare-exchanges done simultaneously
Algorithm consists of Odd Phases and Even Phases
In even phase, even-numbered processes exchange
numbers (via messages) with their right neighbor
In odd phase, odd-numbered processes exchange
numbers (via messages) with their right neighbor
Algorithm alternates odd phase and even phase for
O(n) iterations

24
Odd-Even Transposition Sort

Data Movement

P0 P1 P2 P3 P4
T1
T2
T3
T4
T5
General Pattern for n5
25
Odd-Even Transposition Sort

Example

3
10
4
8
1
T0
P0 P1 P2 P3 P4
3
10
4
8
1
T1
3
4
10
1
8
T2
3
4
1
10
8
T3
3
1
4
8
10
T4
1
3
4
8
10
T5
General Pattern for n5
26
Odd-Even Transposition Code

Compare-exchange accomplished through message
passing
Odd Phase
Even Phase

P_i 0, 2,4,,n-2 recv(A, P_i1) send(B,
P_i1) if (AltB) BA
P_i 1,3,5,,n-1 send(A, P_i-1) recv(B,
P_i-1) if (AltB) AB
P0 P1 P2 P3 P4
P_i 1,3,5,,n-3 recv(A, P_i1) send(B,
P_i1) if (AltB) BA
P_i 2,4,6,,n-2 send(A, P_i-1) recv(B,
P_i-1) if (AltB) AB
P0 P1 P2 P3 P4
27
Programming Issues

Algorithm that odd phases and even phases done in
sequence how to synchronize?
Synchronous execution
Need to have barrier between phases
Barrier synchronization costs may be high
Asynchronous execution
Need to tag iteration, phase so that correct
values combined
Program may be implemented as SPMD (single
program, multiple data) see HW

28
Programming Issues

Algorithm must be mapped to underlying platform
If communication costs gtgt computation costs, it
may be more cost-effective to map multiple
processes to a single processor and bundle
communication
granularity (ratio of time required for a basic
communication operation to the time required for
a basic computation) of underlying platform
required to determine best mapping

Processor A
Processor B
Processor A
Processor B
P0
P1
P2
P3
P0
P1
P2
P3
P4
P4
29
Is Odd-Even Transposition Sort Optimal?

What is optimal?
An algorithm is optimal if there is a lower bound
for the problem it addresses with respect to the
basic operation being counted which equals the
upper bound given by the algorithms complexity
function, i.e. lower bound upper bound

30
Odd-Even Transposition Sort is optimal on linear
array

Upper bound O(n)
Lower bound O(n)
Consider sorting algorithms on linear array where
basic operation being counted is compare-exchange
If minimum key is in rightmost array element, it
must move throughout the course of any algorithm
to the leftmost array element
Compare-exchange operations only allow keys to
move one process to the left each time-step.
Therefore, any sorting algorithm requires at
least O(n) time-steps to move the minimum key to
the first position

8
10
5
7
1
31
Optimality

O(nlogn) lower bound for serial sorting
algorithms on RAM wrt comparisons
O(n) lower bound for parallel sorting algorithms
on linear array wrt compare-exchange
No conflict since the platforms/computing
environments are different, apples vs. oranges
Note that in parallel world, different lower
bounds for sorting in different environments
O(logn) lower bound on PRAM (Parallel RAM)
O(n1/2) lower bound on 2D array, etc.

32
Optimality on a 2D Array

Same argument as linear array works for lower
bound
If we want to exchangeX and Y, we must do at
least O( ) stepson an X array
upper boundThompson and KungSorting on a
Mesh-Connected Parallel Computer CACM, Vol 20,
(April), 1977

Write a Comment

User Comments (0)

About PowerShow.com

Shared Memory and Message Passing - PowerPoint PPT Presentation

Shared Memory and Message Passing

Parallel program = program composed of tasks (processes) which communicate to ... 'Omega' networks (butterfly, shuffle-exchange) commonly used for multistage network ... – PowerPoint PPT presentation