HighCapacity Packet Switches - PowerPoint PPT Presentation

About This Presentation
Title:

HighCapacity Packet Switches

Description:

Scheduling might become a bottleneck in a switch with hundreds of ports and ... SLIP and DRR use round-robin choices. ... Examples: SEs with input buffers ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 65
Provided by: aleksandra5
Category:

less

Transcript and Presenter's Notes

Title: HighCapacity Packet Switches


1
High-Capacity Packet Switches
2
Switches with Input Buffers (Cisco)
3
Packet Switches with Input Buffers
  • Switching fabric
  • Electronic chips (Mindspeed, AMCC, Vitesse)
  • Space-wavelength selector (NEC, Alcatel)
  • Fast tunable lasers (Lucent)
  • Waveguide arrays (Chiaro)
  • Scheduler
  • Packets compete not only with the packets
    destined for the same output but also with the
    packets sourced by the same input. Scheduling
    might become a bottleneck in a switch with
    hundreds of ports and gigabit line bit-rates.

4
Optical Packet Cross-bar (NEC,Alcatel)
  • A 2.56 Tb/s multiwavelength and scalable
    switch-fabric for fast packet-switching network,
    PTL 1998,1999, NEC

5
Optical Packet Cross-bar (Lucent)
  • A fast 100 channel wavelength tunable transmitter
    for optical packet switching, PTL 2001, Bell Labs

6
Scheduling Algorithms for Packet Switches with
Input Buffers
  • In parallel iterative matching (PIM), SLIP or
    dual round-robin (DRR) inputs send requests to
    outputs, outputs grant inputs, and inputs then
    grant outputs in one iteration. It was proven
    that PIM finds a maximal matching after log2N
    4/3 steps on average.
  • Maximum weighted matching and maximum matching
    algorithm maximize the weight of the connected
    pairs, and achieve 100 for i.i.d. traffic but
    have complexities O(N3log2N) and O(N2.5).
  • Sequential greedy scheduling is a maximal
    matching algorithm that is simple to implement.
    Maximal matching algorithm does not leave
    input-output pair unmatched.

7
PIM, SLIP and DRR
  • In PIM and SLIP each input sends requests to all
    outputs for which it has packets, and in DRR only
    to one chosen output. SLIP and DRR use
    round-robin choices.
  • Theorem PIM finds a maximal matching after log2N
    4/3 steps on average.
  • Proof Let n inputs request output Q, and let k
    of these inputs receive no grants. With
    probability k/n all requests are resolved, and
    with probability 1-k/n at most k requests are
    unresolved. The average number of requests is at
    most (1-k/n)kn/4. So if there are N2 requests
    at the beginning, the expected number of
    unresolved requests after I iterations is N2/4i

8
PIM, SLIP and DRR
  • Proof (cont.) Let C be the last step on which
    the last request is resolved. Then

9
Typical Central Controllers (Cisco)
10
SGS Implementation
  • All inputs one after another choose outputs, SGS
    is a maximal matching algorithm

11
SGS Uses Pipelining
Ii -gt Tk Input i chooses output for time slot k
12
Bandwidth ReservationsPacket Switches with Input
Buffers
  • Anderson et al. Time is divided into frames of F
    time slots. Schedule is calculated in each frame
    Statistical matching algorithm.
  • Stiliadis and Varma Counters are loaded per
    frame. Queues with positive counters are served
    with priority according to parallel iterative
    matching (PIM), their counters are then
    decremented by 1. DRR proposed by Chao et al.
    could be used as well.
  • Kam et al. Counter is incremented for the
    negotiated bandwidth and decremented by 1 when
    the queue is served. Maximal weighted matching
    algorithm is applied.
  • Smiljanic Counters are loaded per frame. Queues
    with positive counters are served with priority
    according to the maximal matching algorithm
    preferrably sequential greedy scheduling
    algorithm (SGS), where inputs sequentially choose
    outputs to transmit packets to.

13
Weighted Sequential Greedy Scheduling
  • i1
  • Input i chooses output j from Ok for which
    it has packet to send
    Remove i from Ik and j
    from Ok
  • If iltN choose ii1 and
    go to the previous step

14
Weighted Sequential Greedy Scheduling
  • If k1 mod F then cijaij
    Ik1,...,N Ok1,...,N i1
  • Input i chooses output j from Ok for which
    it has packet to send such that cijgt0
    Remove i from Ik and j from Ok cijcij-1
  • If iltN choose ii1 and
    go to the previous step

15
Performance of WSGS
Theorem The WSGS protocol ensures aij time slots
per frame to input-output pair (i,j), if
where Ti is the number of slots reserved for
input i, and Rj is the number of slots reserved
for output j.
Proof Note that
16
Analogy with Circuit Switches
  • Inputs Switches in the first stage
  • Time slots in a frame Switches in the middle
    stage
  • Outputs Switches in the last stage

Non-blocking condition
Strictly non-blocking condition
17
Admission Control for WSGS
The WSGS protocol ensures aij time slots per
frame to input-output pair (i,j) if
F frame length Ti the number of slots reserved
for input i, Rj the number of slots reserved for
output j. ti, rj are normalized Ti, Rj.
18
Non-blocking Nature of WSGS
  • Maximal matching algorithm does not leave input
    or output unmatched if there is a packet to be
    transmitted from the input to the output in
    question.
  • It can be proven that all the traffic passes
    through the cross-bar with the speedup of two
    which is run by a maximal matching algorithm, as
    long as the outputs are not overloaded.

19
Rate and Delay Guranteed by WSGS
  • Assume a coarse synchronization on a frame by
    frame basis, where a frame is the policing
    interval comprising F cell time slots of duration
    Tc.
  • Then, the delay of D2FTc is provided for the
    utilization of 50. Or, this delay and
    utilization of 100 are provided for the fabric
    with the speedup of 2.

20
Port Congestion Due to Multicasting
Solution Packets should be forwarded through
the switch by multicast destination ports.
21
Forwarding Multicast Traffic
22
Forwarding Multicast Traffic
23
Forwarding Multicast Traffic
24
Adding the Port to the Multicast Tree
25
Adding the Port to the Multicast Tree
26
Adding the Port to the Multicast Tree
27
Removing the Port from the Multicast Tree
28
Removing the Port from the Multicast Tree
29
Removing the Port from the Multicast Tree
30
Removing the Port from the Multicast Tree
31
Removing the Port from the Multicast Tree
32
Admission Control for Modified WSGS
where Ei is the number of forwarded packets per
frame
33
Admission Control for Modified WSGS
for

34
Admission Control for Modified WSGS
Modified WSGS protocol ensures negotiated
bandwidths to input-output pairs if for

I
II

F frame length, P forwarding fan-out
Ti the number of slots reserved for input i, Ri
the number of slots reserved for output i. ti,
ri are normalized Ti, Ri.
35
Rate and Delay Guaranteed by Modified WSGS
  • Assume again a coarse synchronization on a frame
    by frame basis.
  • Then, the delay of D FTc is
    provided for the utilization of 1/(P2), where P
    is the forwarding fan-out. Or, this delay and
    utilization of 100 are provided for the fabric
    speedup of P2.

36
Quality of Service, P2, S4, B10Gb/s, Tc50ns
37
Clos Packet Switches
38
Load Balancing in Packet Switches
  • J. Turner introduces load balancing of multicast
    sessions in Benes packet switches, INFOCOM 1993
  • C. Chang et al. propose load balancing in
    two-stage Birkhoff-von-Neumann switch, while Iyer
    et al. analyze the performance of the parallel
    plane switch (PPS) which applies load balancing.
  • Keslassy et al. propose the implementation of
    high-capacity PPS or Birkhoff-von-Neumann
    architecture.
  • Smiljanic examines rate and delay guarantees in
    three-stage Clos packet switches based on load
    balancing. These switches provide the larger
    number of lower speed ports.

39
Load Balancing Algorithms
  • Packets are split into cells, and cells are
    grouped into flows.
  • Cells of each flow are balanced over center SEs
  • Balancing of a flow can be implemented in the
    following way
  • One counter is associated with each flow.
  • When a cell of the flow arrives, it is marked to
    be transmitted through the center SE whose
    designation equals the counter value, and then
    counter is incremented (decremented) modulo l,
    where l is the number of center SEs.

40
Load Balancing Algorithms
  • A flow comprises cells determined by different
    rules, but that have the same input port or the
    input switching element (SE), and have the same
    output port or the output SE. Examples
  • SEs with input buffers
  • Cells sourced by the same input
  • Cells sourced by the same input bound for the
    same output
  • Cells sourced by the same input bound for the
    same output SE
  • SEs with shared buffers
  • Cells sourced by the same input SE bound for the
    same output
  • Cells sourced by the same input SE bound for the
    same output SE

41
Non-Blocking Load Balancing
l
Non-blocking if , no speedup
is needed
42
Rate and Delay Guarantees
  • Let us assume the implementation with the coarse
    synchronization of the switching elements (SEs),
    i.e
  • the switching elements are synchronized on a
    frame-by-frame basis
  • in each frame any SE passes cells that arrived to
    this SE in the previous frame
  • The delay through a three-stage Clos network with
    such coarse synchronization including packet
    reordering delay is
  • Note that if multicasting is accomplished by the
    described packet forwarding, the utilization is
    decreased 3 times, and the delay is increased
    logPN times.

43
Utilization Formula
  • Utilization under which the delay is guaranteed
    to be below D
  • where S is the switching fabric speedup, Nf is
    the number of flows whose cells pass the internal
    fabric link, and Tc is the cell time slot
    duration.

44
Derivation of Utilization
  • The maximum number of cells transmitted over a
    given link from an input to a center SE, Fc,
    fulfills
  • where fig is the number of cells per frame in
    flow g of cells from input SE i, and Fu is the
    maximum number of cells assigned to some port
  • If Nf -n flows have one cell per frame, and
    remaining n flows are assigned max(0,nFu-Nfn)
    cells per frame

45
Derivation of Utilization
  • So
  • The same expression holds for Fc and Ua over the
    links from center to output SEs
  • Since FD/(4Tc)

46
Speedup Formula
  • The delay of D is guaranteed for 100 utilization
    for the speedup of
  • where Nf is the maximum number of flows whose
    cells pass any internal fabric link, and Tc is
    the cell time slot duration.

47
Derivation of Speedup
  • We put Ua1 in the formula for utilization, and
    readily obtain expression for the required
    speedup

48
Counter Synchronization
  • The utilization was decreased because all flows
    may be balanced starting for the same center SE,
    so this SE will not be able to deliver all the
    passing cells within a frame.
  • Higher utilization can be achieved if the
    counters of different flows are synchronized.
  • Counter of flow g sourced by input SE1i is reset
    at the beginning of each frame to cig ( ig )
    mod l, where l is the number of center SEs. And,
    counter of flow g bound for output SE3j is reset
    at the beginning of each frame to
  • cjg ( jg ) mod l.

49
Utilization Formula when Counters are
Synchronized
  • Utilization under which the delay is guaranteed
    to be below D
  • where S is the switching fabric speedup, Nf is
    the maximum number of flows whose cells pass any
    internal fabric link, and Tc is the cell time
    slot duration.

50
Derivation of Utilization when Counters are
Synchronized
  • The maximum number of cells transmitted over a
    given link from an input to a center SE2(l-1),
    Fc, fulfills
  • where fig denotes the number of cells in flow g
    that are balanced starting from input SE1i

51
Derivation of Utilization when Counters are
Synchronized
  • Fc is maximized for
  • where yig gt 0 are integers.
  • Fc is maximized if
  • And Fc is then equal to

52
Derivation of Utilization when Counters are
Synchronized
  • If FultlNf /(2n), it holds that for some zltl
  • In this case, Fc is maximized for
  • and equal to

53
Derivation of Utilization when Counters are
Synchronized
  • From Fc nSF/l, it follows that

54
Derivation of Utilization when Counters are
Synchronized
  • And

55
Speedup Formula when Counters are Synchronized
  • The delay of D is guaranteed for 100 utilization
    for the speedup of
  • where Nf is the maximum number of flows whose
    cells pass any internal fabric link, and Tc is
    the cell time slot duration.

56
Derivation of Speedup when Counters are
Synchronized
  • Speedup providing 100 utilization of the
    transmission capacity is derived when FuF in
    inequality Fc nSF/l

57
Derivation of Speedup when Counters are
Synchronized
  • Because F Nf gt(10Nf)/(8N), for N 2.

58
Utilization vs. Number of Ports, Tc50ns, nml
COUNTERS IN SYNC
COUNTERS OUT OF SYNC
NfnN
NfN
59
Speedup vs. Number of Ports, Tc50ns, nml
COUNTERS OUT OF SYNC
COUNTERS IN SYNC
NfnN
NfN
60
Utilization and Speedup vs. Number of Ports,
D3ms, Nfml640
COUNTERS OUT OF SYNC
COUNTERS IN SYNC
1ms
UTIL
SPEED
61
Conclusions Scalable Implementations
  • Switches with input buffers require simple
    implementation pipelining relaxes processing of
    output selector, central controller has a linear
    structure
  • Clos packet switches based on load balancing is
    even more scalable they require neither the
    synchronization on a cell by cell basis across
    the whole fabric nor the high-capacity fabric.

62
Conclusions Performance Advantages
  • Both examined architectures provide nonblocking
    with moderate fabric speedups, i.e. the fabric
    passes all the traffic as long as outputs are not
    overloaded.
  • Rate and delay are guaranteed even to the most
    sensitive applications.
  • Due to the nonblocking nature of the fabric, the
    admission control can be distributed, and
    therefore more agile.

63
References
  • T. E. Anderson, S. S. Owicki, J. B. Saxe, and C.
    P. Thacker, Highspeed switch scheduling for
    local-area networks, ACM Transactions on
    Computer Systems, vol. 11, no. 4, November 1993,
    pp. 319-352.
  • N. McKeown et al., The Tiny Tera A packet
    switch core, IEEE Micro, vol. 17, no. 1,
    Jan.-Feb. 1997, pp. 26-33.
  • A. Smiljanic, Flexible bandwidth allocation in
    high-capacity packet switches, IEEE/ACM
    Transactions on Networking, April 2002, pp.
    287-293.

64
References
  • A. Smiljanic, Scheduling of multicast trafc in
    high-capacity packet switches, IEEE
    Communication Magazine, November 2002, pp. 72-77.
  • A. Smiljanic, Performance of load balancing
    algorithms in Clos packet switches, Proceedings
    of IEEE HPSR, April 2004, pp. 304-308.
  • J. S. Turner, An optimal nonblocking multicast
    virtual circuit switch, Proceeding of INFOCOM
    1994, vol. 1, pp. 298-305.
Write a Comment
User Comments (0)
About PowerShow.com