Title: Dynamic Networks
1Dynamic Networks
- L.N. Bhuyan
- Partly from Berkeley Notes
2What is Dynamic Network
- Dynamic Network is the network that can connect
any input to any output by enabling or disabling
some switches in the network - Examples
- - Shared Bus The bus arbiter connects a
processor to a memory - - Crossbar Consists of a lot of switching
elements, which can be enabled to connect many
inputs to many outputs simultaneously - - Multistage Network Consists of several
stages of switches that are enabled to get
connections - - The nodes in static networks (like Mesh)
also consist of dynamic crossbars
3Crossbar Switch Design
- Complexity O(N2) for an NXN Crossbar Why?
See next page
4How do you build a crossbar
From Control
N2 switches gt Cost O(N2) Time taken by the
arbiter O(N2)
Multiplexors are controlled from controller
5Crossbar Contd.
- An NXN Crossbar allows all N inputs to be
connected simultaneously to all N outputs - It allows all one-to-one mappings, called
permutations. No. of permutations N! - When two or more inputs request the same output,
only one of them is connected and others are
either dropped or buffered - When processors access memories through crossbar,
this situation is called memory access conflicts - Given p as the probability of request by a
processor per cycle and assuming that a
processors request is uniformly directed to all
N memories, the average number of connections
allowed per cycle, called Bandwidth (BW) is - BW N1(1-p/N)(N-1) Derive this!!!
6Input buffered swtich
- Independent routing logic per input - FSM
- Scheduler logic arbitrates each output -
priority, FIFO, random - Head-of-line blocking problem The head packet
in a buffer cannot depart because the output is
busy with another packet. The second packet may
be destined to an output that is free, but cannot
depart due to blocking by the first packet gt One
solution is to create multiple input queues, one
per output, called Virtual Output Queuing
adopted in most routers. - Scheduler Design How to ensure maximum
simultaneous connections is a challenging
research area.
7Problems with Input-Buffered Switch
- FIFO Input buffers give rise to Head of the Line
(HOL) problem - Current routers employ a separate input queue for
each output, called virtual output queue (VOQ) - Then how to schedule the packets from different
VOQs for transmission?
8VOQ-based Input Buffered Switch
9Scheduling in Input Buffered Switch
- n independent arbitration problems?
- static priority, random, round-robin
- simplifications due to routing algorithm?
- general case is max bipartite matching
Iterative algorithms iSLIP in Cisco
10Iterative Matching A 3-step Procedure
11(No Transcript)
12(No Transcript)
13(No Transcript)
14(No Transcript)
15Fair Scheduling in Crossbar (Infocom 2002)
- Motivation
- Current routers employ fair scheduling at the
output link, but with high link speed there are
very few packets at the output buffer. These
packets were selected by the crossbar with equal
probability from the input buffers. - Many more packets are waiting in the input
queues. Choosing packets during arbitration
depending on the reservation will ensure better
QoS among competing flows at the input buffers.
16The iFS Algorithm
Initially, all inputs and outputs are considered
as unmatched and none of the inputs have any
candidates.Then in each iteration Grant stage
Each unmatched output selects a flow with the
smallest virtual time for its
head-of-line cell and marks the cell
as a candidate for the corresponding
input. Grant signal is then given to the
input. Accept stage Each unmatched input
examines its candidate set, selects
a winner according to age and sends an
accept signal to its output. The input and
output are then considered
as matched. Reset the candidate set
to empty.
17Output/Shared Buffered Switch
Shared Buffer
RAM speed has to be N times the link speed.
Output Buffered Switch has buffers at output to
store packets. There is always a minimal
transmitting buffer at the input. What happens if
there are 2 or more packets to the same output at
the same time. In order to capture both, the
switch speed has to be N times that of link speed
gt Difficult to design.
18Shared Buffer Switch IBM SP Vulcan switch
- Many gigabit Ethernet switches use similar design
without the cut-through - 128 8-byte chunks in central queue, LRU per
19SGI SPIDER IEEE Micro Jan 1997
20Flow Control
- What do you do when push comes to shove?
- Ethernet collision detection and retry after
delay - FDDI, token ring arbitration token
- TCP/WAN buffer, drop, adjust rate
- any solution must adjust to output rate
- Link-level flow control
- Short Links
- long links
- several flits on the wire
22Multistage Interconnection Network
- A network consisting of multiple stages of
crossbar switches has the following properties. - NxN network for N2n
- Consists of log2N stages of 2x2 switches
- Has N/2 2x2 switches per stage
- Cost O(N log n) instead of O(N2) for Crossbar
- For N an, a MIN can be similarly designed with
axa switches
23Multistage interconnection networks
Omega Network Complexity O(Nlog2N)
(a) Perfect shuffle
(b) Inverse perfect shuffle
shuffle interconnection S(an-1 an-2 a1 a0)
(an-2 an-3 a0 an-1 )
25Omega Network
- Every stage of switches is preceded by a perfect
shuffle interconnection - S(an-1 an-2 a1 a0) (an-2 an-3 a0 an-1 )
- An input can be connected to a straight or
exchange output in a 2x2 switch. - E(an-1 an-2 a1 a0) (an-1 an-2 a1 a0)
- To route a message/packet in an Omega network,
the destination tag which is binary equivalent of
the destination is used, (dn-1 dn-2 d1 d0). The
ith bit di is used to control the routing at the
ith stage counted from the right with 0 lt i lt
n-1. If di 0, the input is connected to the
upper output. If di 1, it is connected to the
lower output.
26Self Routing
- A processor generates a tag that is binary
equivalent of the destination - MSB controls the leftmost stage and the lsb
controls the rightmost stage of the Omega
network. A small controller inside the 2 x 2
switch senses this bit and enables the connection - If bit ci 0, the request is to the upper
output if it is 1, the request is to the lower
output. - Based on digit if switch size is greater than 2
- Network conflict - Select Round Robin
- Less Bandwidth than crossbar, but more cost
effective - What about QoS? Future research
27Theorem The Omega network is self routing
- Let source be (sn-1sn-2 s2 s1s0) and
destination be (dn-1dn-2 d2 d1d0). Before
Stage 1, the source is switched to the position
(sn-2sn-3 s1 s0sn-1) due to perfect shuffle
connection. After Stage 1 it is switched to
(sn-2sn-3 s1 s0dn-1) as per the (n-1)th of
the destination. - Before 2nd stage of the switches, the source is
connected to (sn-3 s0dn-1sn-2) as after 2nd
stage it becomes (sn-3 s0dn-1dn-2) - If we continue like this for n stages, the
source matches (dn-1dn-2 di d1d0) which is
the destination.
28Example SP
- 8-port switch, 40 MB/s per link, 8-bit phit,
16-bit flit, single 40 MHz clock - packet sw, cut-through, no virtual channel,
source-based routing - variable packet lt 255 bytes, 31 byte fifo per
input, 7 bytes per output, 16 phit links
- Routing Algorithms restrict the set of routes
within the topology - simple mechanism selects turn at each hop
- arithmetic, selection, lookup
- Deadlock-free if channel dependence graph is
acyclic - limit turns to eliminate dependences
- add separate channel resources to break
dependences - combination of topology, algorithm, and switch
design - Deterministic vs. adaptive routing
- Switch design issues
- input/output/pooled buffering, routing logic,
selection logic - Flow control
- Real networks are a package of design choices
30Protocols HW/SW Interface
- Internetworking allows computers on independent
and incompatible networks to communicate reliably
and efficiently - Enabling technologies SW standards that allow
reliable communications without reliable networks - Hierarchy of SW layers, giving each layer
responsibility for portion of overall
communications task, called protocol families or
protocol suites - Transmission Control Protocol/Internet Protocol
(TCP/IP) - This protocol family is the basis of the Internet
- IP makes best effort to deliver TCP guarantees
delivery - TCP/IP used even when communicating locally NFS
uses IP even though communicating across
homogeneous LAN
31TCP/IP packet
- Application sends message
- TCP breaks into 64KB segements, adds 20B header
- IP adds 20B header, sends to network
- If Ethernet, broken into 1500B packets with
headers, trailers - Header, trailers have length field, destination,
window number, version, ...
IP Header
TCP Header
IP Data
TCP data ( 64KB)
32Communicating with the Server The O/S Wall
- Problems
- O/S overhead to move a packet between network
and application level gt Protocol Stack (TCP/IP) - O/S interrupt
- Data copying from kernel space to user space and
vice versa - Oh, the PCI Bottleneck!
33The Send/Receive Operation
- The application writes the transmit data to the
TCP/IP sockets interface for transmission in
payload sizes ranging from 4 KB to 64 KB. - The data is copied from the User space to the
Kernel space - The OS segments the data into maximum
transmission unit (MTU)size packets, and then
adds TCP/IP header information to each packet. - The OS copies the data onto the network interface
card (NIC) send queue. - The NIC performs the direct memory access (DMA)
transfer of each data packet from the TCP buffer
space to the NIC, and interrupts CPU activities
to indicate completion of the transfer.
34Transmitting data across the memory bus using a
standard NIC
35Timing Measurement in UDP Communication
X.Zhang, L. Bhuyan and W. Feng, Anatomy of UDP
and M-VIA for Cluster Communication JPDC,
October 2005
36I/O Acceleration Techniques
- TCP Offload Offload TCP/IP Checksum and
Segmentation to Interface hardware or
programmable device (Ex. TOEs) A TOE-enabled
NIC using Remote Direct Memory Access (RDMA) can
use zero-copy algorithms to place data directly
into application buffers. - O/S Bypass User-level software techniques to
bypass protocol stack Zero Copy Protocol - (Needs programmable device in the NIC for
direct user level memory access Virtual to
Physical Memory Mapping. Ex. VIA) - Architectural Techniques Instruction set
optimization, Multithreading, copy engines,
onloading, prefetching, etc.
37Comparing standard TCP/IP and TOE enabled TCP/IP
38Chelsio 10 Gbs TOE
39Cluster (Network) of Workstations/PCs
40Myrinet Interface Card
41InfiniBand Interconnection
- Zero-copy mechanism. The zero-copy mechanism
enables a user-level application to perform I/O
on the InfiniBand fabric without being required
to copy data between user space and kernel space. - RDMA. RDMA facilitates transferring data from
remote memory to local memory without the
involvement of host CPUs. - Reliable transport services. The InfiniBand
architecture implements reliable transport
services so the host CPU is not involved in
protocol-processing tasks like segmentation,
reassembly, NACK/ACK, etc. - Virtual lanes. InfiniBand architecture provides
16 virtual lanes (VLs) to multiplex independent
data lanes into the same physical lane, including
a dedicated VL for management operations. - High link speeds. InfiniBand architecture defines
three link speeds, which are characterized as 1X,
4X, and 12X, yielding data rates of 2.5 Gbps, 10
Gbps, and 30 Gbps, respectively. -
- Reprinted from Dell Power Solutions, October
42InfiniBand system fabric
43UDP Communication Life of a Packet
- X. Zhang, L. Bhuyan and W. Feng, Anatomy of
UDP and M-VIA for Cluster Communication Journal
of Parallel and Distributed Computing (JPDC),
Special issue on Design and Performance of
Networks for Super-, Cluster-, and
Grid-Computing, Vol. 65, Issue 10, October 2005,
pp. 1290-1298.