Communication Operations - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Communication Operations

Description:

... routing (e.g., many paths thru torus) Many routing algorithms exist ... E.g., 3D Torus. Packet header contains signed offset to destination (per dimension) ... – PowerPoint PPT presentation

Number of Views:132
Avg rating:3.0/5.0
Slides: 45
Provided by: tsbsud
Category:

less

Transcript and Presenter's Notes

Title: Communication Operations


1


CS / IS C422 Parallel Computing
  • Lecture 15
  • Communication Operations

2
Recap of Lecture 14
  • Directory Based
  • Centralized
  • Distributed
  • Communcation
  • Store Forward
  • CT
  • Costs

3
Packet Routing
  • There are two basic approaches to routing
    packets, based on what a switch does with a
    packet as its flits begin to arrive
  • Store-and-forward
  • Cut-through
  • Virtual cut-through
  • Wormhole

4
Communication time
  • Communication requires 3 costs
  • 1. Static start up time (ts)
  • It is the time required to handle a message at
    the sending processor
  • 2. Per-hop time (th) with l the Links that the
    message passes
  • It is take a finite amount of time to reach the
    next processor in its path after a message leaves
    a processor.
  • 3. Per-word transfer time (tw) with m the
    bytes
  • If the channel bandwidth is r words per second,
    then each word takes time tw1/r to traverse the
    link.

5
The 2 main communication schemes
6
Plan for Today
7
Plan for Today
  • Message Passing mechanisms
  • Routing Mechanisms for ICNs
  • Deterministic Routing
  • XY-routing
  • E-cube Routing
  • Adaptive Routing
  • One to All Broadcast
  • All to One Reduce, All Reduce
  • All to All Broadcast
  • Scatter, Gather

8
Collisions
What happens if a stream of flits arrives at a
switch, and the desired output port is busy?
  • Store whole packet in a buffer
  • (called virtual cut through)
  • Block in-place across multiple switches
  • (called wormhole routing)
  • Drop the data
  • Resources are lost!!!
  • Misroute keep moving, but in the wrong direction

9
Virtual Cut-Through
  • What to do if output port is blocked?
  • Allow the tail to continue when the head is
    blocked, absorbing the whole message into a
    single switch
  • Requires a buffer large enough to hold the
    largest packet
  • Degenerates to store-and-forward with high
    contention

10
Wormhole
  • When the head of the message is blocked, the
    message stays strung out over the network
  • Potentially blocks other messages (needs only
    buffer the piece of the packet that is sent
    between switches).
  • CM-5 used it, with each switch buffer being 4
    bits per port
  • Myrinet uses it
  • Can cause tree saturation

11
Deadlocks
  • In wormhole routing, packets hold switch
    resources while they move
  • Flit buffers
  • Output ports
  • Another packet may arrive that needs the same
    resources
  • Cyclic dependencies may lead to deadlock

12
Deadlocks
13
Dependencies
  • Deadlocks are the most dramatic problems
  • But can also just lead to inefficiency
  • A blocked packet still holds its channels
  • (because flits need to stay contiguous to
    maintain routing)
  • Another packet may be able to utilize these
    channels

14
Inefficiency
15
Virtual Channels
  • Divide the buffers in each switch into several
    virtual channels
  • Each virtual channel also has its own state and
    routing information
  • Virtual channels share the use of physical
    resources

Dally, IEEE Trans. Par. Dist. Syst., 1992
16
Efficiency!
Red packet occupies some (not all!!!) buffer space
Green packet actually uses link
17
Deadlock Free Routing
  • Virtual Channels
  • Not to be confused with virtual cut-through
  • Add buffers so flits of wormhole packets can be
    interleaved
  • You can read about this in Dallys paper
  • Up-Down
  • Number switches higher farther away from
    processors
  • Route up, make one turn, route down
  • Turn Model Routing
  • Restrict order of turns
  • West first
  • North last
  • Negative first
  • Can increase number of hops

18
Routing Algorithm
  • How do I know where a packet should go?
  • Topology does NOT determine routing (e.g., many
    paths thru torus)
  • Many routing algorithms exist
  • Arithmetic
  • Source-based
  • Table lookup
  • Adaptiveroute based on network state (e.g.,
    contention)

19
(1) Arithmetic Routing
  • For regular topology, use simple arithmetic to
    determine route
  • E.g., 3D Torus
  • Packet header contains signed offset to
    destination (per dimension)
  • At each hop, switch /- to reduce offset in a
    dimension
  • When x 0 and y 0, then at correct processor
  • Drawbacks
  • Requires ALU in switch
  • Must re-compute CRC at each hop

20
(2) Source Based (3) Table Lookup Routing
  • Source Based
  • Source specifies output port for each switch in
    route
  • Very simple switches
  • No control state
  • Strip output port off header
  • Myrinet uses this
  • Cant be made adaptive
  • Table Lookup
  • Very small header, index into table for output
    port
  • Big tables, must be kept up-to-date

21
Deterministic, E-cube Routing
  • Deterministicfollows a pre-specified route
  • K-ary d-cube dimension-order routing
  • (x1, y1) ? (x2, y2)
  • First Dx x2 - x1,
  • Then Dy y2 - y1,
  • Tree common ancestor
  • E-cuberoute determined by dimension k, where is
    the position of LS/MS nonzero bit in,
  • Source/Routenode .XOR. Destn
  • Ex 000 to 111
  • 010 to 111

110
010
111
011
100
000
101
001
22
(4) Adaptive Routing
  • Essential for fault tolerance
  • At least multipath
  • Can improve utilization of the network
  • Simple deterministic algorithms easily run into
    bad permutations
  • Fully/partially adaptive, minimal/non-minimal
  • Can introduce complexity or anomalies
  • A little adaptation goes a long way!

23
Hot Potato Routing
  • Every cycle, each switch takes each input and
    routes it to an output
  • But not necessarily to the desired output
  • No switch buffering!
  • Possibility of livelock if no precautions taken
  • E.g., could grant priority based on age of packet

24
Real Machines
25
Basic Communication Operations (Ch 4)
26
One to All Broadcast / All to One Reduce
  • Initially, only the source processor has the data
    of size m that need to be broadcast. At the end
    of the termination of the procedure, there are P
    copies of the initial data, one residing at each
    processor.
  • The reverse of Broadcast using the same algos
    will be Reduce.

27
Broadcast on ring (Store and Forward)
If the sender sends the messages consecutively to
the p-1 other processors, it takes p-1 steps. By
optimisation, we can reduce this to p/2
steps. Eg. a 8-processor ring requires 4 steps
28
NS diagram for broadcast on ring
29
Ring network, Cut-Through routing
  • With cut-through routing, messages can be sent
    faster to nodes that are multiple hops away in
    the network. By using this, we send the message
    first to the outermost node.

In general, in a p-processor ring the source
processor first sends the data to the processor
at distance p/2, then both processors sends the
message to the processors at distance of p/4 in
the same direction, then to p/8, etc.
30
Broadcast on ring (Cut-Through )
31
Broadcast on mesh (Store and Forward)
Most of the optimised communication algorithms on
a mesh are simple extensions of their ring
counterparts, by consecutively applying the ring
algorithm on each dimension of the mesh.
32
Broadcast on mesh (C-T)
33
Hypercube
  • The regular binary structure of the hypercube
    plays an important role in optimising
    communication.
  • Here, a broadcast is performed by sending the
    message along each dimension at each step. This
    results in log p or d steps.
  • It can be proved easily that log p is the minimal
    number of steps for every network.

34
(No Transcript)
35
Broadcast on hypercube (SF)
36
Broadcast on binary tree (C-T)
37
Gossiping
All-to-All Communication
38
Gossiping on Ring (Store and Forward)
39
Gossiping on Mesh (Store and Forward)
40
Gossiping on Hypercube (SF)
41
Gossiping on Ring (and Mesh)Cut-Through Routing
  • Each process sends m(p-1) words of data because
    it has an m-word packet for every other processor
  • The average distance that an m word packet
    travels is
  • Since there are p processors, each performing the
    same type of communication, the total traffic on
    the network is
  • The total number of communication channels in the
    network to share this load is p.

Hence this procedure cannot be improved by using
CT routing
42
Gossiping on Hypercube (CT routing)
43
Others (later)
  • Scatter
  • Gather

44
Next Class
  • Parallel Algorithms
  • Task Dependency Graphs
  • Data Decomposition
Write a Comment
User Comments (0)
About PowerShow.com