HighCapacity Packet Switches - PowerPoint PPT Presentation

About This Presentation

Title:

HighCapacity Packet Switches

Description:

Scheduling might become a bottleneck in a switch with hundreds of ports and ... SLIP and DRR use round-robin choices. ... Examples: SEs with input buffers ... – PowerPoint PPT presentation

Number of Views:45

Avg rating:3.0/5.0

Slides: 65

Provided by: aleksandra5

Learn more at: http://www.ece.sunysb.edu

Category:

more less

Transcript and Presenter's Notes

Title: HighCapacity Packet Switches

1
High-Capacity Packet Switches
2
Switches with Input Buffers (Cisco)
3
Packet Switches with Input Buffers

Switching fabric
Electronic chips (Mindspeed, AMCC, Vitesse)
Space-wavelength selector (NEC, Alcatel)
Fast tunable lasers (Lucent)
Waveguide arrays (Chiaro)
Scheduler
Packets compete not only with the packets
destined for the same output but also with the
packets sourced by the same input. Scheduling
might become a bottleneck in a switch with
hundreds of ports and gigabit line bit-rates.

4
Optical Packet Cross-bar (NEC,Alcatel)

A 2.56 Tb/s multiwavelength and scalable
switch-fabric for fast packet-switching network,
PTL 1998,1999, NEC

5
Optical Packet Cross-bar (Lucent)

A fast 100 channel wavelength tunable transmitter
for optical packet switching, PTL 2001, Bell Labs

6
Scheduling Algorithms for Packet Switches with
Input Buffers

In parallel iterative matching (PIM), SLIP or
dual round-robin (DRR) inputs send requests to
outputs, outputs grant inputs, and inputs then
grant outputs in one iteration. It was proven
that PIM finds a maximal matching after log2N
4/3 steps on average.
Maximum weighted matching and maximum matching
algorithm maximize the weight of the connected
pairs, and achieve 100 for i.i.d. traffic but
have complexities O(N3log2N) and O(N2.5).
Sequential greedy scheduling is a maximal
matching algorithm that is simple to implement.
Maximal matching algorithm does not leave
input-output pair unmatched.

7
PIM, SLIP and DRR

In PIM and SLIP each input sends requests to all
outputs for which it has packets, and in DRR only
to one chosen output. SLIP and DRR use
round-robin choices.
Theorem PIM finds a maximal matching after log2N
4/3 steps on average.
Proof Let n inputs request output Q, and let k
of these inputs receive no grants. With
probability k/n all requests are resolved, and
with probability 1-k/n at most k requests are
unresolved. The average number of requests is at
most (1-k/n)kn/4. So if there are N2 requests
at the beginning, the expected number of
unresolved requests after I iterations is N2/4i

8
PIM, SLIP and DRR

Proof (cont.) Let C be the last step on which
the last request is resolved. Then

9
Typical Central Controllers (Cisco)
10
SGS Implementation

All inputs one after another choose outputs, SGS
is a maximal matching algorithm

11
SGS Uses Pipelining
Ii -gt Tk Input i chooses output for time slot k
12
Bandwidth ReservationsPacket Switches with Input
Buffers

Anderson et al. Time is divided into frames of F
time slots. Schedule is calculated in each frame
Statistical matching algorithm.
Stiliadis and Varma Counters are loaded per
frame. Queues with positive counters are served
with priority according to parallel iterative
matching (PIM), their counters are then
decremented by 1. DRR proposed by Chao et al.
could be used as well.
Kam et al. Counter is incremented for the
negotiated bandwidth and decremented by 1 when
the queue is served. Maximal weighted matching
algorithm is applied.
Smiljanic Counters are loaded per frame. Queues
with positive counters are served with priority
according to the maximal matching algorithm
preferrably sequential greedy scheduling
algorithm (SGS), where inputs sequentially choose
outputs to transmit packets to.

13
Weighted Sequential Greedy Scheduling

i1
Input i chooses output j from Ok for which
it has packet to send
Remove i from Ik and j
from Ok
If iltN choose ii1 and
go to the previous step

14
Weighted Sequential Greedy Scheduling

If k1 mod F then cijaij
Ik1,...,N Ok1,...,N i1
Input i chooses output j from Ok for which
it has packet to send such that cijgt0
Remove i from Ik and j from Ok cijcij-1
If iltN choose ii1 and
go to the previous step

15
Performance of WSGS
Theorem The WSGS protocol ensures aij time slots
per frame to input-output pair (i,j), if
where Ti is the number of slots reserved for
input i, and Rj is the number of slots reserved
for output j.
Proof Note that
16
Analogy with Circuit Switches

Inputs Switches in the first stage
Time slots in a frame Switches in the middle
stage
Outputs Switches in the last stage

Non-blocking condition
Strictly non-blocking condition
17
Admission Control for WSGS
The WSGS protocol ensures aij time slots per
frame to input-output pair (i,j) if
F frame length Ti the number of slots reserved
for input i, Rj the number of slots reserved for
output j. ti, rj are normalized Ti, Rj.
18
Non-blocking Nature of WSGS

Maximal matching algorithm does not leave input
or output unmatched if there is a packet to be
transmitted from the input to the output in
question.
It can be proven that all the traffic passes
through the cross-bar with the speedup of two
which is run by a maximal matching algorithm, as
long as the outputs are not overloaded.

19
Rate and Delay Guranteed by WSGS

Assume a coarse synchronization on a frame by
frame basis, where a frame is the policing
interval comprising F cell time slots of duration
Tc.
Then, the delay of D2FTc is provided for the
utilization of 50. Or, this delay and
utilization of 100 are provided for the fabric
with the speedup of 2.

20
Port Congestion Due to Multicasting
Solution Packets should be forwarded through
the switch by multicast destination ports.
21
Forwarding Multicast Traffic
22
Forwarding Multicast Traffic
23
Forwarding Multicast Traffic
24
Adding the Port to the Multicast Tree
25
Adding the Port to the Multicast Tree
26
Adding the Port to the Multicast Tree
27
Removing the Port from the Multicast Tree
28
Removing the Port from the Multicast Tree
29
Removing the Port from the Multicast Tree
30
Removing the Port from the Multicast Tree
31
Removing the Port from the Multicast Tree
32
Admission Control for Modified WSGS
where Ei is the number of forwarded packets per
frame
33
Admission Control for Modified WSGS
for

34
Admission Control for Modified WSGS
Modified WSGS protocol ensures negotiated
bandwidths to input-output pairs if for

I
II

F frame length, P forwarding fan-out
Ti the number of slots reserved for input i, Ri
the number of slots reserved for output i. ti,
ri are normalized Ti, Ri.
35
Rate and Delay Guaranteed by Modified WSGS

Assume again a coarse synchronization on a frame
by frame basis.
Then, the delay of D FTc is
provided for the utilization of 1/(P2), where P
is the forwarding fan-out. Or, this delay and
utilization of 100 are provided for the fabric
speedup of P2.

36
Quality of Service, P2, S4, B10Gb/s, Tc50ns
37
Clos Packet Switches
38
Load Balancing in Packet Switches

J. Turner introduces load balancing of multicast
sessions in Benes packet switches, INFOCOM 1993
C. Chang et al. propose load balancing in
two-stage Birkhoff-von-Neumann switch, while Iyer
et al. analyze the performance of the parallel
plane switch (PPS) which applies load balancing.
Keslassy et al. propose the implementation of
high-capacity PPS or Birkhoff-von-Neumann
architecture.
Smiljanic examines rate and delay guarantees in
three-stage Clos packet switches based on load
balancing. These switches provide the larger
number of lower speed ports.

39
Load Balancing Algorithms

Packets are split into cells, and cells are
grouped into flows.
Cells of each flow are balanced over center SEs
Balancing of a flow can be implemented in the
following way
One counter is associated with each flow.
When a cell of the flow arrives, it is marked to
be transmitted through the center SE whose
designation equals the counter value, and then
counter is incremented (decremented) modulo l,
where l is the number of center SEs.

40
Load Balancing Algorithms

A flow comprises cells determined by different
rules, but that have the same input port or the
input switching element (SE), and have the same
output port or the output SE. Examples
SEs with input buffers
Cells sourced by the same input
Cells sourced by the same input bound for the
same output
Cells sourced by the same input bound for the
same output SE
SEs with shared buffers
Cells sourced by the same input SE bound for the
same output
Cells sourced by the same input SE bound for the
same output SE

41
Non-Blocking Load Balancing
l
Non-blocking if , no speedup
is needed
42
Rate and Delay Guarantees

Let us assume the implementation with the coarse
synchronization of the switching elements (SEs),
i.e
the switching elements are synchronized on a
frame-by-frame basis
in each frame any SE passes cells that arrived to
this SE in the previous frame
The delay through a three-stage Clos network with
such coarse synchronization including packet
reordering delay is
Note that if multicasting is accomplished by the
described packet forwarding, the utilization is
decreased 3 times, and the delay is increased
logPN times.

43
Utilization Formula

Utilization under which the delay is guaranteed
to be below D
where S is the switching fabric speedup, Nf is
the number of flows whose cells pass the internal
fabric link, and Tc is the cell time slot
duration.

44
Derivation of Utilization

The maximum number of cells transmitted over a
given link from an input to a center SE, Fc,
fulfills
where fig is the number of cells per frame in
flow g of cells from input SE i, and Fu is the
maximum number of cells assigned to some port
If Nf -n flows have one cell per frame, and
remaining n flows are assigned max(0,nFu-Nfn)
cells per frame

45
Derivation of Utilization

So
The same expression holds for Fc and Ua over the
links from center to output SEs
Since FD/(4Tc)

46
Speedup Formula

The delay of D is guaranteed for 100 utilization
for the speedup of
where Nf is the maximum number of flows whose
cells pass any internal fabric link, and Tc is
the cell time slot duration.

47
Derivation of Speedup

We put Ua1 in the formula for utilization, and
readily obtain expression for the required
speedup

48
Counter Synchronization

The utilization was decreased because all flows
may be balanced starting for the same center SE,
so this SE will not be able to deliver all the
passing cells within a frame.
Higher utilization can be achieved if the
counters of different flows are synchronized.
Counter of flow g sourced by input SE1i is reset
at the beginning of each frame to cig ( ig )
mod l, where l is the number of center SEs. And,
counter of flow g bound for output SE3j is reset
at the beginning of each frame to
cjg ( jg ) mod l.

49
Utilization Formula when Counters are
Synchronized

Utilization under which the delay is guaranteed
to be below D
where S is the switching fabric speedup, Nf is
the maximum number of flows whose cells pass any
internal fabric link, and Tc is the cell time
slot duration.

50
Derivation of Utilization when Counters are
Synchronized

The maximum number of cells transmitted over a
given link from an input to a center SE2(l-1),
Fc, fulfills
where fig denotes the number of cells in flow g
that are balanced starting from input SE1i

51
Derivation of Utilization when Counters are
Synchronized

Fc is maximized for
where yig gt 0 are integers.
Fc is maximized if
And Fc is then equal to

52
Derivation of Utilization when Counters are
Synchronized

If FultlNf /(2n), it holds that for some zltl
In this case, Fc is maximized for
and equal to

53
Derivation of Utilization when Counters are
Synchronized

From Fc nSF/l, it follows that

54
Derivation of Utilization when Counters are
Synchronized

55
Speedup Formula when Counters are Synchronized

The delay of D is guaranteed for 100 utilization
for the speedup of
where Nf is the maximum number of flows whose
cells pass any internal fabric link, and Tc is
the cell time slot duration.

56
Derivation of Speedup when Counters are
Synchronized

Speedup providing 100 utilization of the
transmission capacity is derived when FuF in
inequality Fc nSF/l

57
Derivation of Speedup when Counters are
Synchronized

Because F Nf gt(10Nf)/(8N), for N 2.

58
Utilization vs. Number of Ports, Tc50ns, nml
COUNTERS IN SYNC
COUNTERS OUT OF SYNC
NfnN
NfN
59
Speedup vs. Number of Ports, Tc50ns, nml
COUNTERS OUT OF SYNC
COUNTERS IN SYNC
NfnN
NfN
60
Utilization and Speedup vs. Number of Ports,
D3ms, Nfml640
COUNTERS OUT OF SYNC
COUNTERS IN SYNC
1ms
UTIL
SPEED
61
Conclusions Scalable Implementations

Switches with input buffers require simple
implementation pipelining relaxes processing of
output selector, central controller has a linear
structure
Clos packet switches based on load balancing is
even more scalable they require neither the
synchronization on a cell by cell basis across
the whole fabric nor the high-capacity fabric.

62
Conclusions Performance Advantages

Both examined architectures provide nonblocking
with moderate fabric speedups, i.e. the fabric
passes all the traffic as long as outputs are not
overloaded.
Rate and delay are guaranteed even to the most
sensitive applications.
Due to the nonblocking nature of the fabric, the
admission control can be distributed, and
therefore more agile.

63
References

T. E. Anderson, S. S. Owicki, J. B. Saxe, and C.
P. Thacker, Highspeed switch scheduling for
local-area networks, ACM Transactions on
Computer Systems, vol. 11, no. 4, November 1993,
pp. 319-352.
N. McKeown et al., The Tiny Tera A packet
switch core, IEEE Micro, vol. 17, no. 1,
Jan.-Feb. 1997, pp. 26-33.
A. Smiljanic, Flexible bandwidth allocation in
high-capacity packet switches, IEEE/ACM
Transactions on Networking, April 2002, pp.
287-293.

64
References

A. Smiljanic, Scheduling of multicast trafc in
high-capacity packet switches, IEEE
Communication Magazine, November 2002, pp. 72-77.
A. Smiljanic, Performance of load balancing
algorithms in Clos packet switches, Proceedings
of IEEE HPSR, April 2004, pp. 304-308.
J. S. Turner, An optimal nonblocking multicast
virtual circuit switch, Proceeding of INFOCOM
1994, vol. 1, pp. 298-305.

Write a Comment

User Comments (0)