Communication Operations - PowerPoint PPT Presentation

1 / 44

About This Presentation

Title:

Communication Operations

Description:

... routing (e.g., many paths thru torus) Many routing algorithms exist ... E.g., 3D Torus. Packet header contains signed offset to destination (per dimension) ... – PowerPoint PPT presentation

Number of Views:132

Avg rating:3.0/5.0

Slides: 45

Provided by: tsbsud

Category:

more less

Transcript and Presenter's Notes

Title: Communication Operations

1

CS / IS C422 Parallel Computing

Lecture 15
Communication Operations

2
Recap of Lecture 14

Directory Based
Centralized
Distributed
Communcation
Store Forward
CT
Costs

3
Packet Routing

There are two basic approaches to routing
packets, based on what a switch does with a
packet as its flits begin to arrive
Store-and-forward
Cut-through
Virtual cut-through
Wormhole

4
Communication time

Communication requires 3 costs
1. Static start up time (ts)
It is the time required to handle a message at
the sending processor
2. Per-hop time (th) with l the Links that the
message passes
It is take a finite amount of time to reach the
next processor in its path after a message leaves
a processor.
3. Per-word transfer time (tw) with m the
bytes
If the channel bandwidth is r words per second,
then each word takes time tw1/r to traverse the
link.

5
The 2 main communication schemes
6
Plan for Today
7
Plan for Today

Message Passing mechanisms
Routing Mechanisms for ICNs
Deterministic Routing
XY-routing
E-cube Routing
Adaptive Routing
One to All Broadcast
All to One Reduce, All Reduce
All to All Broadcast
Scatter, Gather

8
Collisions
What happens if a stream of flits arrives at a
switch, and the desired output port is busy?

Store whole packet in a buffer
(called virtual cut through)
Block in-place across multiple switches
(called wormhole routing)
Drop the data
Resources are lost!!!
Misroute keep moving, but in the wrong direction

9
Virtual Cut-Through

What to do if output port is blocked?
Allow the tail to continue when the head is
blocked, absorbing the whole message into a
single switch
Requires a buffer large enough to hold the
largest packet
Degenerates to store-and-forward with high
contention

10
Wormhole

When the head of the message is blocked, the
message stays strung out over the network
Potentially blocks other messages (needs only
buffer the piece of the packet that is sent
between switches).
CM-5 used it, with each switch buffer being 4
bits per port
Myrinet uses it
Can cause tree saturation

11
Deadlocks

In wormhole routing, packets hold switch
resources while they move
Flit buffers
Output ports
Another packet may arrive that needs the same
resources
Cyclic dependencies may lead to deadlock

12
Deadlocks
13
Dependencies

Deadlocks are the most dramatic problems
But can also just lead to inefficiency
A blocked packet still holds its channels
(because flits need to stay contiguous to
maintain routing)
Another packet may be able to utilize these
channels

14
Inefficiency
15
Virtual Channels

Divide the buffers in each switch into several
virtual channels
Each virtual channel also has its own state and
routing information
Virtual channels share the use of physical
resources

Dally, IEEE Trans. Par. Dist. Syst., 1992
16
Efficiency!
Red packet occupies some (not all!!!) buffer space
Green packet actually uses link
17
Deadlock Free Routing

Virtual Channels
Not to be confused with virtual cut-through
Add buffers so flits of wormhole packets can be
interleaved
You can read about this in Dallys paper
Up-Down
Number switches higher farther away from
processors
Route up, make one turn, route down
Turn Model Routing
Restrict order of turns
West first
North last
Negative first
Can increase number of hops

18
Routing Algorithm

How do I know where a packet should go?
Topology does NOT determine routing (e.g., many
paths thru torus)
Many routing algorithms exist
Arithmetic
Source-based
Table lookup
Adaptiveroute based on network state (e.g.,
contention)

19
(1) Arithmetic Routing

For regular topology, use simple arithmetic to
determine route
E.g., 3D Torus
Packet header contains signed offset to
destination (per dimension)
At each hop, switch /- to reduce offset in a
dimension
When x 0 and y 0, then at correct processor
Drawbacks
Requires ALU in switch
Must re-compute CRC at each hop

20
(2) Source Based (3) Table Lookup Routing

Source Based
Source specifies output port for each switch in
route
Very simple switches
No control state
Strip output port off header
Myrinet uses this
Cant be made adaptive
Table Lookup
Very small header, index into table for output
port
Big tables, must be kept up-to-date

21
Deterministic, E-cube Routing

Deterministicfollows a pre-specified route
K-ary d-cube dimension-order routing
(x1, y1) ? (x2, y2)
First Dx x2 - x1,
Then Dy y2 - y1,
Tree common ancestor
E-cuberoute determined by dimension k, where is
the position of LS/MS nonzero bit in,
Source/Routenode .XOR. Destn
Ex 000 to 111
010 to 111

110
010
111
011
100
000
101
001
22
(4) Adaptive Routing

Essential for fault tolerance
At least multipath
Can improve utilization of the network
Simple deterministic algorithms easily run into
bad permutations
Fully/partially adaptive, minimal/non-minimal
Can introduce complexity or anomalies
A little adaptation goes a long way!

23
Hot Potato Routing

Every cycle, each switch takes each input and
routes it to an output
But not necessarily to the desired output
No switch buffering!
Possibility of livelock if no precautions taken
E.g., could grant priority based on age of packet

24
Real Machines
25
Basic Communication Operations (Ch 4)
26
One to All Broadcast / All to One Reduce

Initially, only the source processor has the data
of size m that need to be broadcast. At the end
of the termination of the procedure, there are P
copies of the initial data, one residing at each
processor.
The reverse of Broadcast using the same algos
will be Reduce.

27
Broadcast on ring (Store and Forward)
If the sender sends the messages consecutively to
the p-1 other processors, it takes p-1 steps. By
optimisation, we can reduce this to p/2
steps. Eg. a 8-processor ring requires 4 steps
28
NS diagram for broadcast on ring
29
Ring network, Cut-Through routing

With cut-through routing, messages can be sent
faster to nodes that are multiple hops away in
the network. By using this, we send the message
first to the outermost node.

In general, in a p-processor ring the source
processor first sends the data to the processor
at distance p/2, then both processors sends the
message to the processors at distance of p/4 in
the same direction, then to p/8, etc.
30
Broadcast on ring (Cut-Through )
31
Broadcast on mesh (Store and Forward)
Most of the optimised communication algorithms on
a mesh are simple extensions of their ring
counterparts, by consecutively applying the ring
algorithm on each dimension of the mesh.
32
Broadcast on mesh (C-T)
33
Hypercube

The regular binary structure of the hypercube
plays an important role in optimising
communication.
Here, a broadcast is performed by sending the
message along each dimension at each step. This
results in log p or d steps.
It can be proved easily that log p is the minimal
number of steps for every network.

34
(No Transcript)
35
Broadcast on hypercube (SF)
36
Broadcast on binary tree (C-T)
37
Gossiping
All-to-All Communication
38
Gossiping on Ring (Store and Forward)
39
Gossiping on Mesh (Store and Forward)
40
Gossiping on Hypercube (SF)
41
Gossiping on Ring (and Mesh)Cut-Through Routing

Each process sends m(p-1) words of data because
it has an m-word packet for every other processor
The average distance that an m word packet
travels is
Since there are p processors, each performing the
same type of communication, the total traffic on
the network is
The total number of communication channels in the
network to share this load is p.

Hence this procedure cannot be improved by using
CT routing
42
Gossiping on Hypercube (CT routing)
43
Others (later)