Title: Communication Operations
1 CS / IS C422 Parallel Computing
- Lecture 15
- Communication Operations
2Recap of Lecture 14
- Directory Based
- Centralized
- Distributed
- Communcation
- Store Forward
- CT
- Costs
3Packet Routing
- There are two basic approaches to routing
packets, based on what a switch does with a
packet as its flits begin to arrive - Store-and-forward
- Cut-through
- Virtual cut-through
- Wormhole
4Communication time
- Communication requires 3 costs
- 1. Static start up time (ts)
- It is the time required to handle a message at
the sending processor - 2. Per-hop time (th) with l the Links that the
message passes - It is take a finite amount of time to reach the
next processor in its path after a message leaves
a processor. - 3. Per-word transfer time (tw) with m the
bytes - If the channel bandwidth is r words per second,
then each word takes time tw1/r to traverse the
link.
5The 2 main communication schemes
6Plan for Today
7Plan for Today
- Message Passing mechanisms
- Routing Mechanisms for ICNs
- Deterministic Routing
- XY-routing
- E-cube Routing
- Adaptive Routing
- One to All Broadcast
- All to One Reduce, All Reduce
- All to All Broadcast
- Scatter, Gather
8Collisions
What happens if a stream of flits arrives at a
switch, and the desired output port is busy?
- Store whole packet in a buffer
- (called virtual cut through)
- Block in-place across multiple switches
- (called wormhole routing)
- Drop the data
- Resources are lost!!!
- Misroute keep moving, but in the wrong direction
9Virtual Cut-Through
- What to do if output port is blocked?
- Allow the tail to continue when the head is
blocked, absorbing the whole message into a
single switch - Requires a buffer large enough to hold the
largest packet - Degenerates to store-and-forward with high
contention
10Wormhole
- When the head of the message is blocked, the
message stays strung out over the network - Potentially blocks other messages (needs only
buffer the piece of the packet that is sent
between switches). - CM-5 used it, with each switch buffer being 4
bits per port - Myrinet uses it
- Can cause tree saturation
11Deadlocks
- In wormhole routing, packets hold switch
resources while they move - Flit buffers
- Output ports
- Another packet may arrive that needs the same
resources - Cyclic dependencies may lead to deadlock
12Deadlocks
13Dependencies
- Deadlocks are the most dramatic problems
- But can also just lead to inefficiency
- A blocked packet still holds its channels
- (because flits need to stay contiguous to
maintain routing) - Another packet may be able to utilize these
channels
14Inefficiency
15Virtual Channels
- Divide the buffers in each switch into several
virtual channels - Each virtual channel also has its own state and
routing information - Virtual channels share the use of physical
resources
Dally, IEEE Trans. Par. Dist. Syst., 1992
16Efficiency!
Red packet occupies some (not all!!!) buffer space
Green packet actually uses link
17Deadlock Free Routing
- Virtual Channels
- Not to be confused with virtual cut-through
- Add buffers so flits of wormhole packets can be
interleaved - You can read about this in Dallys paper
- Up-Down
- Number switches higher farther away from
processors - Route up, make one turn, route down
- Turn Model Routing
- Restrict order of turns
- West first
- North last
- Negative first
- Can increase number of hops
18Routing Algorithm
- How do I know where a packet should go?
- Topology does NOT determine routing (e.g., many
paths thru torus) - Many routing algorithms exist
- Arithmetic
- Source-based
- Table lookup
- Adaptiveroute based on network state (e.g.,
contention)
19(1) Arithmetic Routing
- For regular topology, use simple arithmetic to
determine route - E.g., 3D Torus
- Packet header contains signed offset to
destination (per dimension) - At each hop, switch /- to reduce offset in a
dimension - When x 0 and y 0, then at correct processor
- Drawbacks
- Requires ALU in switch
- Must re-compute CRC at each hop
20(2) Source Based (3) Table Lookup Routing
- Source Based
- Source specifies output port for each switch in
route - Very simple switches
- No control state
- Strip output port off header
- Myrinet uses this
- Cant be made adaptive
- Table Lookup
- Very small header, index into table for output
port - Big tables, must be kept up-to-date
21Deterministic, E-cube Routing
- Deterministicfollows a pre-specified route
- K-ary d-cube dimension-order routing
- (x1, y1) ? (x2, y2)
- First Dx x2 - x1,
- Then Dy y2 - y1,
- Tree common ancestor
- E-cuberoute determined by dimension k, where is
the position of LS/MS nonzero bit in, - Source/Routenode .XOR. Destn
- Ex 000 to 111
- 010 to 111
110
010
111
011
100
000
101
001
22(4) Adaptive Routing
- Essential for fault tolerance
- At least multipath
- Can improve utilization of the network
- Simple deterministic algorithms easily run into
bad permutations - Fully/partially adaptive, minimal/non-minimal
- Can introduce complexity or anomalies
- A little adaptation goes a long way!
23Hot Potato Routing
- Every cycle, each switch takes each input and
routes it to an output - But not necessarily to the desired output
- No switch buffering!
- Possibility of livelock if no precautions taken
- E.g., could grant priority based on age of packet
24Real Machines
25Basic Communication Operations (Ch 4)
26One to All Broadcast / All to One Reduce
- Initially, only the source processor has the data
of size m that need to be broadcast. At the end
of the termination of the procedure, there are P
copies of the initial data, one residing at each
processor. - The reverse of Broadcast using the same algos
will be Reduce.
27Broadcast on ring (Store and Forward)
If the sender sends the messages consecutively to
the p-1 other processors, it takes p-1 steps. By
optimisation, we can reduce this to p/2
steps. Eg. a 8-processor ring requires 4 steps
28NS diagram for broadcast on ring
29Ring network, Cut-Through routing
- With cut-through routing, messages can be sent
faster to nodes that are multiple hops away in
the network. By using this, we send the message
first to the outermost node.
In general, in a p-processor ring the source
processor first sends the data to the processor
at distance p/2, then both processors sends the
message to the processors at distance of p/4 in
the same direction, then to p/8, etc.
30Broadcast on ring (Cut-Through )
31Broadcast on mesh (Store and Forward)
Most of the optimised communication algorithms on
a mesh are simple extensions of their ring
counterparts, by consecutively applying the ring
algorithm on each dimension of the mesh.
32Broadcast on mesh (C-T)
33Hypercube
- The regular binary structure of the hypercube
plays an important role in optimising
communication. - Here, a broadcast is performed by sending the
message along each dimension at each step. This
results in log p or d steps. - It can be proved easily that log p is the minimal
number of steps for every network.
34(No Transcript)
35Broadcast on hypercube (SF)
36Broadcast on binary tree (C-T)
37Gossiping
All-to-All Communication
38Gossiping on Ring (Store and Forward)
39Gossiping on Mesh (Store and Forward)
40Gossiping on Hypercube (SF)
41Gossiping on Ring (and Mesh)Cut-Through Routing
- Each process sends m(p-1) words of data because
it has an m-word packet for every other processor - The average distance that an m word packet
travels is - Since there are p processors, each performing the
same type of communication, the total traffic on
the network is - The total number of communication channels in the
network to share this load is p.
Hence this procedure cannot be improved by using
CT routing
42Gossiping on Hypercube (CT routing)
43Others (later)
44Next Class
- Parallel Algorithms
- Task Dependency Graphs
- Data Decomposition