Shared Memory and Message Passing - PowerPoint PPT Presentation

About This Presentation
Title:

Shared Memory and Message Passing

Description:

Parallel program = program composed of tasks (processes) which communicate to ... 'Omega' networks (butterfly, shuffle-exchange) commonly used for multistage network ... – PowerPoint PPT presentation

Number of Views:64
Avg rating:3.0/5.0
Slides: 33
Provided by: csewe4
Learn more at: https://cseweb.ucsd.edu
Category:

less

Transcript and Presenter's Notes

Title: Shared Memory and Message Passing


1
Shared Memory and Message Passing
  • WA 1.2.1, 1.2.2, 1.2.3, 2.1, 2.1.2, 2.1.3, 8.3,
    9.2.1, 9.2.2 Akl 2.5.2

2
Models for Communication
  • Parallel program program composed of tasks
    (processes) which communicate to accomplish an
    overall computational goal
  • Two prevalent models for communication
  • Message passing (MP)
  • Shared memory (SM)
  • This lecture will focus on MP and SM computing

3
Message Passing Communication
  • Processes in message passing program communicate
    by passing messages
  • Basic message passing primitives
  • Send(parameter list)
  • Receive(parameter list)
  • Parameters depend on the software and can be
    complex

A
B
4
Flavors of message passing
  • Synchronous used for routines that return when
    the message transfer has been completed
  • Synchronous send waits until the complete message
    can be accepted by the receiving process before
    sending the message (send suspends until receive)
  • Synchronous receive will wait until the message
    it is expecting arrives (receive suspends until
    message sent)
  • Also called blocking

request to send
A
B
acknowledgement
message
5
Nonblocking message passing
  • Nonblocking sends return whether or not the
    message has been received
  • If receiving processor not ready, message may be
    stored in message buffer
  • Message buffer used to hold messages being sent
    by A prior to being accepted by receive in B
  • MPI
  • routines that use a message buffer and return
    after their local actions complete are blocking
    (even though message transfer may not be
    complete)
  • Routines that return immediately are non-blocking

6
Architectural support for MP
  • Interconnection network should provide
    connectivity, low latency, high bandwidth
  • Many interconnection networks developed over
    last 2 decades
  • Hypercube
  • Mesh, torus
  • Ring, etc.

7
Shared Memory Communication
  • Processes in shared memory program communicate by
    accessing shared variables and data structures
  • Basic shared memory primitives
  • Read to a shared variable
  • Write to a shared variable

8
Accessing Shared Variables
  • Conflicts may arise if multiple processes want to
    write to a shared variable at the same time.
  • Programmer, language, and/or architecture must
    provide means of resolving conflicts

Process A,B read x compute x1 write x
9
Architectural Support for SM
  • 4 basic types of interconnection media
  • Bus
  • Crossbar switch
  • Multistage network
  • Interconnection network with distributed shared
    memory

10
Limited Scalability Media I
  • Bus
  • Bus acts as a party line between processors and
    shared memories
  • Bus provides uniform access to shared memory
    (UMA)
  • When bus saturates, performance of system
    degrades
  • For this reason, bus-based systems do not scale
    to more than 30-40 processors Sequent Symmetry,
    Balance

11
Limited Scalability Media II
  • Crossbar
  • Crossbar switch connects m processors and n
    memories with distinct paths between each
    processor/memory pair
  • Crossbar provides uniform access to shared memory
    (UMA)
  • O(mn) switches required for m processors and n
    memories
  • Crossbar scalable in terms of performance but not
    in terms of cost, used for basic switching
    mechanism in SP2

12
Multistage Networks
  • Multistage networks provide more scalable
    performance than bus but less costly to scale
    than crossbar
  • Typically maxlogn,logm stages connect n
    processors and m shared memories
  • Omega networks (butterfly, shuffle-exchange)
    commonly used for multistage network
  • Multistage network used for CM-5 (fat-tree
    connects processor/memory pairs), BBN Butterfly
    (butterfly), IBM RP3 (omega)

13
Omega Networks
  • Butterfly multistage
  • Used for BBN Butterfly, TC2000
  • Shuffle multistage
  • Used for RP3, SP2 high performance switch

14
Fat-tree Interconnect
  • Bandwidth is increased towards the root
  • Used for data network for CM-5 (MIMD MPP)
  • 4 leaf nodes, internal nodes have 2 or 4 children
  • To route from leaf A to leaf B, pick random
    switch C in the least common ancestor fat node of
    A and B, take unique tree route from A to C and
    from C to B

Binary fat-tree in which all internal nodes have
two children
15
Distributed Shared Memory
  • Memory is physically distributed but programmed
    as shared memory
  • Programmers find shared memory paradigm
    desirable
  • Shared memory distributed among processors,
    accesses may be sent as messages
  • Access to local memory and global shared memory
    creates NUMA (non-uniform memory access
    architectures)
  • BBN butterfly is NUMA shared memory
    multiprocessor

BBNbutterflyinterconnect
16
Alphabet Soup
  • Terms for shared memory multiprocessors
  • NUMA non-uniform memory access
  • BBN Butterfly, Cray T3E, Origin 2000
  • UMA uniform memory access
  • Sequent, Sun HPC 1000
  • COMA cache-only memory access
  • KSR
  • (NORMA no remote memory access
  • message-passing MPPs )

17
Using both SM and MP together
  • Common for platforms to support one model at a
    time SM or MP
  • Clusters of SMPs may be effectively programmed
    using both SM and MP
  • SM used within a multiple processor machine/node
  • MP used between nodes

18
SM Program Prefix Sums
  • Problem Given n processes P_i and n datum
    a_i, want to compute the prefix sums (a_1
    a_j ) A_1i such that A_1i is in P_i upon
    termination of the algorithm.
  • Well look at an O(log n) SM parallel algorithm
    which computes the prefix sums of n datum on n
    processors

19
Data Movement for Prefix Sums Algorithm
  • Aij a_i a_i1 a_j

20
Pseudo-code for Prefix Sums Algorithm
  • Pseudo-code
  • Procedure ALLSUMS(a_1,,a_n)
  • Initialize P_i with data a_iAii
  • for j0 to (log n) 1 do
  • forall i 2j 1 to n do (parallel for)
  • Processor P_i
  • (i) obtains contents of P_i-2j through shared
    memory and
  • (ii) replaces contents of P_i with contents of
    P_i-2j current contents of P_i
  • end forall
  • end for

21
Programming Issues
  • Algorithm assumes that all additions with the
    same offset (i.e. for each level) are performed
    at the same time
  • Need some way of tagging or synchronizing
    computations
  • May be cost-effective to do a barrier
    synchronization (all processors must reach a
    barrier before proceeding to the next level )
    between levels
  • For this algorithm, there are no write conflicts
    within a level since one of the values is already
    in the shared variable, the other value need only
    be summed with the existing value
  • If two values must be written with existing
    variable, we would need to establish a
    well-defined protocol for handling conflicting
    writes

22
MP Program Sorting
  • Problem Sorting a list of numbers/keys
    (rearranging them so as to be in non-decreasing
    order)
  • Basic sorting operation compare/exchange
    (compare/swap)
  • In serial computation (RAM) model, optimal
    sorting for n keys is O(nlogn)

1 active, 1 passive Processor
23
Odd-Even Transposition Sort
  • Parallel version of bubblesort many
    compare-exchanges done simultaneously
  • Algorithm consists of Odd Phases and Even Phases
  • In even phase, even-numbered processes exchange
    numbers (via messages) with their right neighbor
  • In odd phase, odd-numbered processes exchange
    numbers (via messages) with their right neighbor
  • Algorithm alternates odd phase and even phase for
    O(n) iterations

24
Odd-Even Transposition Sort
  • Data Movement

P0 P1 P2 P3 P4
T1
T2
T3
T4
T5
General Pattern for n5
25
Odd-Even Transposition Sort
  • Example

3
10
4
8
1
T0
P0 P1 P2 P3 P4
3
10
4
8
1
T1
3
4
10
1
8
T2
3
4
1
10
8
T3
3
1
4
8
10
T4
1
3
4
8
10
T5
General Pattern for n5
26
Odd-Even Transposition Code
  • Compare-exchange accomplished through message
    passing
  • Odd Phase
  • Even Phase

P_i 0, 2,4,,n-2 recv(A, P_i1) send(B,
P_i1) if (AltB) BA
P_i 1,3,5,,n-1 send(A, P_i-1) recv(B,
P_i-1) if (AltB) AB
P0 P1 P2 P3 P4
P_i 1,3,5,,n-3 recv(A, P_i1) send(B,
P_i1) if (AltB) BA
P_i 2,4,6,,n-2 send(A, P_i-1) recv(B,
P_i-1) if (AltB) AB
P0 P1 P2 P3 P4
27
Programming Issues
  • Algorithm that odd phases and even phases done in
    sequence how to synchronize?
  • Synchronous execution
  • Need to have barrier between phases
  • Barrier synchronization costs may be high
  • Asynchronous execution
  • Need to tag iteration, phase so that correct
    values combined
  • Program may be implemented as SPMD (single
    program, multiple data) see HW

28
Programming Issues
  • Algorithm must be mapped to underlying platform
  • If communication costs gtgt computation costs, it
    may be more cost-effective to map multiple
    processes to a single processor and bundle
    communication
  • granularity (ratio of time required for a basic
    communication operation to the time required for
    a basic computation) of underlying platform
    required to determine best mapping

Processor A
Processor B
Processor A
Processor B
P0
P1
P2
P3
P0
P1
P2
P3
P4
P4
29
Is Odd-Even Transposition Sort Optimal?
  • What is optimal?
  • An algorithm is optimal if there is a lower bound
    for the problem it addresses with respect to the
    basic operation being counted which equals the
    upper bound given by the algorithms complexity
    function, i.e. lower bound upper bound

30
Odd-Even Transposition Sort is optimal on linear
array
  • Upper bound O(n)
  • Lower bound O(n)
  • Consider sorting algorithms on linear array where
    basic operation being counted is compare-exchange
  • If minimum key is in rightmost array element, it
    must move throughout the course of any algorithm
    to the leftmost array element
  • Compare-exchange operations only allow keys to
    move one process to the left each time-step.
  • Therefore, any sorting algorithm requires at
    least O(n) time-steps to move the minimum key to
    the first position

8
10
5
7
1
31
Optimality
  • O(nlogn) lower bound for serial sorting
    algorithms on RAM wrt comparisons
  • O(n) lower bound for parallel sorting algorithms
    on linear array wrt compare-exchange
  • No conflict since the platforms/computing
    environments are different, apples vs. oranges
  • Note that in parallel world, different lower
    bounds for sorting in different environments
  • O(logn) lower bound on PRAM (Parallel RAM)
  • O(n1/2) lower bound on 2D array, etc.

32
Optimality on a 2D Array
  • Same argument as linear array works for lower
    bound
  • If we want to exchangeX and Y, we must do at
    least O( ) stepson an X array
  • upper boundThompson and KungSorting on a
    Mesh-Connected Parallel Computer CACM, Vol 20,
    (April), 1977
Write a Comment
User Comments (0)
About PowerShow.com