Title: Dynamic Interconnect
1Dynamic Interconnect
2Multistage Network--Omega Network
- Motivation simulate crossbar network but with
fewer links - Components
- N processors to connect, Nlog(N) links
- log(N) stages, each stage is connected by shuffle
- Each stage N/2 2x2 switch boxes
P0
P0
P1
P1
P2
P2
P3
P3
P4
P4
P5
P5
P6
P6
P7
P7
3Omega Network -- Routing
- Distributed control
- Check the bit in this stage if it is 0 then
connect to upper port, otherwise connect to the
lower port - Not all permutations are possible
- What if 010 connects to 110, 110 to 100, and
000 to 101
P0
P0
P1
P1
P2
P2
P3
P3
P4
P4
P5
P5
P6
P6
P7
P7
4Comparisons between Dynamic Networks
- Bus System
- Assume n processors on the bus bus width is w
bits - Data transfer latency constant
- Bandwidth per processor O(w/n) to O(w)
- Wire complexity O(w)
- Switching complexity O(n)
- Routing capability only one to one at a time
- Advantage
- Cheap to build
- Disadvantage
- Low bandwidth available to each processor
- Prone to failure
5Comparisons between Dynamic Networks
- Crossbar Switch
- Assume n x n crossbar with line width of w bits
- Data transfer latency constant
- Bandwidth per processor O(w) to O(nw)
- Wire complexity O(n2w)
- Switching complexity O(n2)
- Routing capability all permutations one at a
time - Advantage
- Highest bandwidth
- Highest routing capability
- Disadvantage
- High hardware cost
6Comparisons between Dynamic Networks
- Multistage network
- Assume n x n processors to connect with line
width of w bits using 2 x 2 switch - Data transfer latency O(logn)
- Bandwidth per processor O(w) to O(nw)
- Wire complexity O(nwlogn)
- Switching complexity O(nlogn)
- Routing capability Some permutations and
broadcast - Advantage
- Scalability with modular construction
- Medium cost
- Disadvantage
- Long latency
7Message Transfer Mechanisms
- Message typically consist of
- A header which contains information about the
destination - The data that needs to be transmitted
- A trailer which signals the end of the message
- Circuit switching strategy determines how message
data is actually transferred across network links
in the chosen message route - Three components to message transfer cost
- Startup time (ts) - cost of handling message at
sending processor - Per-hop time (tp) - it is the time taken by the
header to traverse a link - Per-word transfer time (tw) - time taken for a
word to traverse a link
8Dynamic Network -- Switching Strategy
- Circuit switching
- A circuit path is established from source to the
destination. - Like telephone system
- Requires setup time and poor bandwidth, but has
short latency - Latency for routing a m word message with l hops
- t ts tp mtw ? ts m tw
P0
P0
P1
P1
P2
P2
P3
P3
P4
P4
P5
P5
P6
P6
P7
P7
9Dynamic Network -- Switching Strategy
- Store-and-forward (packet switching)
- Message travels one link a time when neighbor
link is free - Buffer the message when there is link is not free
- Like postal offices
- No pre-setup time and better bandwidth, but
longer latency - Only one link on the path could be active
- Latency n(ts m tw)
P0
P0
Whole package buffered here
P1
P1
P2
P2
P3
P3
P4
P4
P5
P5
P6
P6
P7
P7
10Dynamic Network -- Switching Strategy
- Cut-through
- Similar to Store-and-forward, but
- Message will be broken into parcels
- All the links on the path could be active
- Also called warmhole routing
- Small setup time
- Latency l(ts tp) mtw ? ltp mtw
P0
P0
P1
P1
P2
P2
P3
P3
P4
P4
Parcels are buffered here
P5
P5
P6
P6
P7
P7
11Static Network Vs Dynamic Network
- Static Network
- There is a point-to-point links between
processors - Parallel system expansion is easy
- Some processors may be closer than others
- Generally used for message passing machine
interconnects - Dynamic Network
- Paths are established as needed between
processors - System expansion is difficult
- Processors are usually equidistant
- Usually used for shared memory machine
interconnects
12One-to-all broadcast
- Algorithms often require a processor to send
identical data to all other processors or a
subset of processors. This operation is called
one-to-all broadcast or single node broadcast - At the start of a single node broadcast, each
processor has m words of data that needs to be
sent. At the end there a p copies of this data,
one on each processor - The dual of a broadcast operation is a all-to-one
reduction or single node reduction - All-to-one reduction
- At the start of a single node reduction each
processor has m words of data, the reduction
combines all the data from processors using an
associative operator to produce m words at the
receiver - Naive single node broadcast or reduction using
p-1 steps
13One-to-all Broadcast
M
M
M
Broadcast
...
...
0
1
p-1
0
1
p-1
Reduction Accumulation
14Store-and-forward Routing on Ring
- Source send message on both outgoing links in
first two steps - All other processors receive on a link and
transmit on other link - It takes p/2 steps
- Cost (ts m tw) p/2
- What if we use circuit switching routing?
3
4
7
6
5
4
2
2
0
1
2
3
1
2
3
15Store-and-forward Routing on Hypercube
- Takes log(p) steps for a p processor hypercube
- In the ith step, all processors that have the
message transmit it to the neighboring processor
that differs in the ith most significant bit - Cost (ts mtw)log(p)
111
011
3
3
101
001
3
3
010
2
110
2
000
100
1
16Homework
- Due next lecture
- Assume there a mesh interconnect network with p
N x N nodes. Using store-and-forward for the
routing. - (a) Find the node which the highest complexity
for operation one-to-all broadcast. - (b) Describe your routing algorithm (using pseudo
code). - (b) What is the broadcast cost?
17Cut-through Routing on Ring
- Algorithm takes log(p) steps
- In step i, message is sent to processor at
distant p/2i - All messages flow in the same direction
- Cost log(p) (ts mtw) tp(p-1)
3
3
2
7
6
5
4
1
0
1
2
3
2
3
18Cut-through Routing on 2D Torus
- Apply ring algorithm for the processor row of
sender - Now use ring algorithm for all processor columns
- 2log(p) steps
- Cost
- (tsmtw) log(p) 2tp(?p -1)
- This algorithm works for 2D mesh too
12
13
14
15
4
4
4
4
8
9
10
11
3
3
3
3
4
5
6
7
4
4
4
4
2
2
0
1
2
3
1
19Cut-through Routing on Hypercube
- Takes log(p) steps for a p processor hypercube
- In the ith step, all processors that have the
message transmit it to the neighboring processor
that differs in the ith most significant bit - Cost (ts mtw)log(p)
- Cut-through does not provide benifits because of
the use of only single link of communications
111
011
3
3
101
001
3
3
010
2
110
2
000
100
1
20Summary
- Switching Strategies
- Circuit switch
- Store-forward
- Cut-through (wormhole)
- One-to-all broadcasting on
- Ring
- Using store-forward
- Using cut-through
- Hypercube
- Using store-forward
- Using cut-through