Title: HighCapacity Packet Switches
1High-Capacity Packet Switches
2Switches with Input Buffers (Cisco)
3Packet Switches with Input Buffers
- Switching fabric
- Electronic chips (Mindspeed, AMCC, Vitesse)
- Space-wavelength selector (NEC, Alcatel)
- Fast tunable lasers (Lucent)
- Waveguide arrays (Chiaro)
- Scheduler
- Packets compete not only with the packets
destined for the same output but also with the
packets sourced by the same input. Scheduling
might become a bottleneck in a switch with
hundreds of ports and gigabit line bit-rates.
4Optical Packet Cross-bar (NEC,Alcatel)
- A 2.56 Tb/s multiwavelength and scalable
switch-fabric for fast packet-switching network,
PTL 1998,1999, NEC
5Optical Packet Cross-bar (Lucent)
- A fast 100 channel wavelength tunable transmitter
for optical packet switching, PTL 2001, Bell Labs
6Scheduling Algorithms for Packet Switches with
Input Buffers
- In parallel iterative matching (PIM), SLIP or
dual round-robin (DRR) inputs send requests to
outputs, outputs grant inputs, and inputs then
grant outputs in one iteration. It was proven
that PIM finds a maximal matching after log2N
4/3 steps on average. - Maximum weighted matching and maximum matching
algorithm maximize the weight of the connected
pairs, and achieve 100 for i.i.d. traffic but
have complexities O(N3log2N) and O(N2.5). - Sequential greedy scheduling is a maximal
matching algorithm that is simple to implement.
Maximal matching algorithm does not leave
input-output pair unmatched.
7PIM, SLIP and DRR
- In PIM and SLIP each input sends requests to all
outputs for which it has packets, and in DRR only
to one chosen output. SLIP and DRR use
round-robin choices. - Theorem PIM finds a maximal matching after log2N
4/3 steps on average. - Proof Let n inputs request output Q, and let k
of these inputs receive no grants. With
probability k/n all requests are resolved, and
with probability 1-k/n at most k requests are
unresolved. The average number of requests is at
most (1-k/n)kn/4. So if there are N2 requests
at the beginning, the expected number of
unresolved requests after I iterations is N2/4i
8PIM, SLIP and DRR
- Proof (cont.) Let C be the last step on which
the last request is resolved. Then
9Typical Central Controllers (Cisco)
10SGS Implementation
- All inputs one after another choose outputs, SGS
is a maximal matching algorithm
11SGS Uses Pipelining
Ii -gt Tk Input i chooses output for time slot k
12Bandwidth ReservationsPacket Switches with Input
Buffers
- Anderson et al. Time is divided into frames of F
time slots. Schedule is calculated in each frame
Statistical matching algorithm. - Stiliadis and Varma Counters are loaded per
frame. Queues with positive counters are served
with priority according to parallel iterative
matching (PIM), their counters are then
decremented by 1. DRR proposed by Chao et al.
could be used as well. - Kam et al. Counter is incremented for the
negotiated bandwidth and decremented by 1 when
the queue is served. Maximal weighted matching
algorithm is applied. - Smiljanic Counters are loaded per frame. Queues
with positive counters are served with priority
according to the maximal matching algorithm
preferrably sequential greedy scheduling
algorithm (SGS), where inputs sequentially choose
outputs to transmit packets to.
13Weighted Sequential Greedy Scheduling
- i1
- Input i chooses output j from Ok for which
it has packet to send
Remove i from Ik and j
from Ok - If iltN choose ii1 and
go to the previous step
14Weighted Sequential Greedy Scheduling
- If k1 mod F then cijaij
Ik1,...,N Ok1,...,N i1
- Input i chooses output j from Ok for which
it has packet to send such that cijgt0
Remove i from Ik and j from Ok cijcij-1 - If iltN choose ii1 and
go to the previous step
15Performance of WSGS
Theorem The WSGS protocol ensures aij time slots
per frame to input-output pair (i,j), if
where Ti is the number of slots reserved for
input i, and Rj is the number of slots reserved
for output j.
Proof Note that
16Analogy with Circuit Switches
- Inputs Switches in the first stage
- Time slots in a frame Switches in the middle
stage - Outputs Switches in the last stage
Non-blocking condition
Strictly non-blocking condition
17Admission Control for WSGS
The WSGS protocol ensures aij time slots per
frame to input-output pair (i,j) if
F frame length Ti the number of slots reserved
for input i, Rj the number of slots reserved for
output j. ti, rj are normalized Ti, Rj.
18Non-blocking Nature of WSGS
- Maximal matching algorithm does not leave input
or output unmatched if there is a packet to be
transmitted from the input to the output in
question. - It can be proven that all the traffic passes
through the cross-bar with the speedup of two
which is run by a maximal matching algorithm, as
long as the outputs are not overloaded.
19Rate and Delay Guranteed by WSGS
- Assume a coarse synchronization on a frame by
frame basis, where a frame is the policing
interval comprising F cell time slots of duration
Tc. - Then, the delay of D2FTc is provided for the
utilization of 50. Or, this delay and
utilization of 100 are provided for the fabric
with the speedup of 2. -
20Port Congestion Due to Multicasting
Solution Packets should be forwarded through
the switch by multicast destination ports.
21Forwarding Multicast Traffic
22Forwarding Multicast Traffic
23Forwarding Multicast Traffic
24Adding the Port to the Multicast Tree
25Adding the Port to the Multicast Tree
26Adding the Port to the Multicast Tree
27Removing the Port from the Multicast Tree
28Removing the Port from the Multicast Tree
29Removing the Port from the Multicast Tree
30Removing the Port from the Multicast Tree
31Removing the Port from the Multicast Tree
32Admission Control for Modified WSGS
where Ei is the number of forwarded packets per
frame
33Admission Control for Modified WSGS
for
34Admission Control for Modified WSGS
Modified WSGS protocol ensures negotiated
bandwidths to input-output pairs if for
I
II
F frame length, P forwarding fan-out
Ti the number of slots reserved for input i, Ri
the number of slots reserved for output i. ti,
ri are normalized Ti, Ri.
35Rate and Delay Guaranteed by Modified WSGS
- Assume again a coarse synchronization on a frame
by frame basis. - Then, the delay of D FTc is
provided for the utilization of 1/(P2), where P
is the forwarding fan-out. Or, this delay and
utilization of 100 are provided for the fabric
speedup of P2. -
36Quality of Service, P2, S4, B10Gb/s, Tc50ns
37Clos Packet Switches
38Load Balancing in Packet Switches
- J. Turner introduces load balancing of multicast
sessions in Benes packet switches, INFOCOM 1993 - C. Chang et al. propose load balancing in
two-stage Birkhoff-von-Neumann switch, while Iyer
et al. analyze the performance of the parallel
plane switch (PPS) which applies load balancing. - Keslassy et al. propose the implementation of
high-capacity PPS or Birkhoff-von-Neumann
architecture. - Smiljanic examines rate and delay guarantees in
three-stage Clos packet switches based on load
balancing. These switches provide the larger
number of lower speed ports.
39Load Balancing Algorithms
- Packets are split into cells, and cells are
grouped into flows. - Cells of each flow are balanced over center SEs
- Balancing of a flow can be implemented in the
following way - One counter is associated with each flow.
- When a cell of the flow arrives, it is marked to
be transmitted through the center SE whose
designation equals the counter value, and then
counter is incremented (decremented) modulo l,
where l is the number of center SEs.
40Load Balancing Algorithms
- A flow comprises cells determined by different
rules, but that have the same input port or the
input switching element (SE), and have the same
output port or the output SE. Examples - SEs with input buffers
- Cells sourced by the same input
- Cells sourced by the same input bound for the
same output - Cells sourced by the same input bound for the
same output SE - SEs with shared buffers
- Cells sourced by the same input SE bound for the
same output - Cells sourced by the same input SE bound for the
same output SE
41Non-Blocking Load Balancing
l
Non-blocking if , no speedup
is needed
42Rate and Delay Guarantees
- Let us assume the implementation with the coarse
synchronization of the switching elements (SEs),
i.e - the switching elements are synchronized on a
frame-by-frame basis - in each frame any SE passes cells that arrived to
this SE in the previous frame - The delay through a three-stage Clos network with
such coarse synchronization including packet
reordering delay is - Note that if multicasting is accomplished by the
described packet forwarding, the utilization is
decreased 3 times, and the delay is increased
logPN times.
43Utilization Formula
- Utilization under which the delay is guaranteed
to be below D - where S is the switching fabric speedup, Nf is
the number of flows whose cells pass the internal
fabric link, and Tc is the cell time slot
duration.
44Derivation of Utilization
- The maximum number of cells transmitted over a
given link from an input to a center SE, Fc,
fulfills - where fig is the number of cells per frame in
flow g of cells from input SE i, and Fu is the
maximum number of cells assigned to some port - If Nf -n flows have one cell per frame, and
remaining n flows are assigned max(0,nFu-Nfn)
cells per frame
45Derivation of Utilization
- So
- The same expression holds for Fc and Ua over the
links from center to output SEs - Since FD/(4Tc)
46Speedup Formula
- The delay of D is guaranteed for 100 utilization
for the speedup of - where Nf is the maximum number of flows whose
cells pass any internal fabric link, and Tc is
the cell time slot duration.
47Derivation of Speedup
- We put Ua1 in the formula for utilization, and
readily obtain expression for the required
speedup
48Counter Synchronization
- The utilization was decreased because all flows
may be balanced starting for the same center SE,
so this SE will not be able to deliver all the
passing cells within a frame. - Higher utilization can be achieved if the
counters of different flows are synchronized. - Counter of flow g sourced by input SE1i is reset
at the beginning of each frame to cig ( ig )
mod l, where l is the number of center SEs. And,
counter of flow g bound for output SE3j is reset
at the beginning of each frame to - cjg ( jg ) mod l.
49Utilization Formula when Counters are
Synchronized
- Utilization under which the delay is guaranteed
to be below D - where S is the switching fabric speedup, Nf is
the maximum number of flows whose cells pass any
internal fabric link, and Tc is the cell time
slot duration.
50Derivation of Utilization when Counters are
Synchronized
- The maximum number of cells transmitted over a
given link from an input to a center SE2(l-1),
Fc, fulfills - where fig denotes the number of cells in flow g
that are balanced starting from input SE1i
51Derivation of Utilization when Counters are
Synchronized
- Fc is maximized for
- where yig gt 0 are integers.
- Fc is maximized if
- And Fc is then equal to
52Derivation of Utilization when Counters are
Synchronized
- If FultlNf /(2n), it holds that for some zltl
- In this case, Fc is maximized for
- and equal to
53Derivation of Utilization when Counters are
Synchronized
- From Fc nSF/l, it follows that
54Derivation of Utilization when Counters are
Synchronized
55Speedup Formula when Counters are Synchronized
- The delay of D is guaranteed for 100 utilization
for the speedup of - where Nf is the maximum number of flows whose
cells pass any internal fabric link, and Tc is
the cell time slot duration.
56Derivation of Speedup when Counters are
Synchronized
- Speedup providing 100 utilization of the
transmission capacity is derived when FuF in
inequality Fc nSF/l
57Derivation of Speedup when Counters are
Synchronized
- Because F Nf gt(10Nf)/(8N), for N 2.
58Utilization vs. Number of Ports, Tc50ns, nml
COUNTERS IN SYNC
COUNTERS OUT OF SYNC
NfnN
NfN
59Speedup vs. Number of Ports, Tc50ns, nml
COUNTERS OUT OF SYNC
COUNTERS IN SYNC
NfnN
NfN
60Utilization and Speedup vs. Number of Ports,
D3ms, Nfml640
COUNTERS OUT OF SYNC
COUNTERS IN SYNC
1ms
UTIL
SPEED
61Conclusions Scalable Implementations
- Switches with input buffers require simple
implementation pipelining relaxes processing of
output selector, central controller has a linear
structure - Clos packet switches based on load balancing is
even more scalable they require neither the
synchronization on a cell by cell basis across
the whole fabric nor the high-capacity fabric.
62Conclusions Performance Advantages
- Both examined architectures provide nonblocking
with moderate fabric speedups, i.e. the fabric
passes all the traffic as long as outputs are not
overloaded. - Rate and delay are guaranteed even to the most
sensitive applications. - Due to the nonblocking nature of the fabric, the
admission control can be distributed, and
therefore more agile.
63References
- T. E. Anderson, S. S. Owicki, J. B. Saxe, and C.
P. Thacker, Highspeed switch scheduling for
local-area networks, ACM Transactions on
Computer Systems, vol. 11, no. 4, November 1993,
pp. 319-352. - N. McKeown et al., The Tiny Tera A packet
switch core, IEEE Micro, vol. 17, no. 1,
Jan.-Feb. 1997, pp. 26-33. - A. Smiljanic, Flexible bandwidth allocation in
high-capacity packet switches, IEEE/ACM
Transactions on Networking, April 2002, pp.
287-293.
64References
- A. Smiljanic, Scheduling of multicast trafc in
high-capacity packet switches, IEEE
Communication Magazine, November 2002, pp. 72-77. - A. Smiljanic, Performance of load balancing
algorithms in Clos packet switches, Proceedings
of IEEE HPSR, April 2004, pp. 304-308. - J. S. Turner, An optimal nonblocking multicast
virtual circuit switch, Proceeding of INFOCOM
1994, vol. 1, pp. 298-305.