Title: Scalable Multimodule Switches with Quality of Service Thesis Defense
1Scalable Multi-module Switches with Quality of
ServiceThesis Defense
- Santosh Krishnan
- sk_at_cs.columbia.edu
- May 1, 2006
- Advisor Prof. Henning G. Schulzrinne
- Co-advisor Dr. Fabio M. Chiussi
2Outline
- Problem Definition
- Motivations, list of contributions
- Switching Model Components
- Related work Formal methods in switching
- Buffered Clos Switches
- Concept of functional equivalence
- BCS Throughput and Quality of Service
- Single-path BCS CIOQ, aggregation, pipelining
- Multi-path BCS Parallelization
- Conclusions
3Problem Definition
Goals
- How to methodically construct a high-capacity
switch? - How to design high-performance algorithms for
such switches?
Importance
- Physical layer improvements 10-G Ethernet,
OC-768 - Converged network requiring QoS IPTV, MPLS VPN
- Case for modular design component reuse
What exists
- Ad-hoc approach to switch design
- No benchmarks, varying performance satisfaction
- Non-blocking, 100 throughput, nominal capacity
4Contributions
- Taxonomy of multi-module switches Buffered Clos
Switches - Performance framework Functional equivalence
with ideal switch
Mimics circuit-switching rigor
Applications
Combined I/O Queueing
Aggregation
- QoS Online maximal matching
- Throughput Critical matching
- Strict stability Maximal matching, SOQF
- Switched Fair Airport matching
- Shadow CIOQ and Decompose
- Virtual Element Queueing
Pipelining
- Striping and Equal Dispatch
- Concurrent Dispatch 3D matching
Parallelization
- Flow-based PPS Clos fitting
- Cell-based PPS Striping, Equal Dispatch
Memory Space Memory
- Combination methods
- Recursive BCS
5Switching Model
CPU
Slow Path
PPU
PPU
Switch Fabric
Outputs
PPU
PPU
Inputs
PPU
PPU
Fast Path
- Basic property Contention
- Flows Guaranteed QoS, Best-effort
- Ideal Switch Provide bandwidth trunks, sustain
link capacity - Black box for network engineering purposes
6Switching Model Components
Memory Element
Space Element
Buffers
Matching 2D
Link Scheduling
Mesh
Conflict-free property Matching complexity
Constraints
Memory bandwidth Full-mesh circuitry
Monolithic
OQ Switch Ideal
IQ Switch
- Architecture Interconnect memory and space
elements - Algorithms Meaningfully emulate the ideal
switch for throughput and QoS
7Background Clos Networks
M
Outputs
Inputs- One circuit
Recognize
- Space-time duality
- Fitting matrix decomposition
K
- Strictly non-blocking K 2M 1 (Clos theorem)
- Re-arrangeable K M (Slepian-Duguid)
Fitting Algorithms
Inspiration Replace selected elements with memory
8Background CIOQ Switches
Pro
Con
- Complexity of matching
- Switch size
- Frequency
- Reconfiguration rate
Queue State
Configuration
0
0
1
5
3
0
- Offline Templates
- Maximum, Maximal, Critical
- Heuristics
1
0
0
0
1
7
0
1
5
0
0
0
What performance results when applied to a
changing queue state?
9Background CIOQ Switch Results
Based on combinatorics and stability theory
QoS
(Weller-Hajek 97)
Throughput
Auxiliary Results Envelope matching (Kar 00),
Packet-mode matching (Marsan 02)
10Framework Buffered Clos Switches
Parallelize Pool memory resources
PPS
Definition
- Switch size
- Type of elements
- Number in first stage
- Number in second
- Speedup
Aggregate Smaller elements
CIOQ-A, G-MSM
Pipeline Lower speed, complexity
CIOQ-P, G-MSM
- Isomorphism Non-blocking Clos network
- Properties Multi-stage, fully connected,
symmetric, uniform
11Framework Functional Equivalence
Characterize relative performance Functional
equivalence
f1 Allocate known rates
Shape Bandwidth trunks
f2 Relative stability for admissible traffic
Literature 100 throughput
f3 Per-output relative stability
Work conserving
f4 Strict relative stability all pairs
f5 Exact emulation
- Emulate an ideal switch exact, asymptotic
- Bandwidth trunks, independent throughput
optimization
12CIOQ Bandwidth Trunks
Shaping plus online matching is sufficient for
bandwidth guarantees
BVN Templates
Offline
Rate Matrix
Cons Template Storage Centralized rate processing
Online
Weight Scheduler
Arbitrary Arrivals
Shape/Batch VOQ
Online Maximal (s2) Online Critical (s1)
Split time into intervals T GCD (R) Batch
traffic in each interval Simple counters
- Extension of Weller-Hajek maximal matching
theorem - Clos analogy Maximal matching as a strategy for
orderly assignments
13CIOQ Admissible Traffic
Best Throughput Results
- No speedup MWM (McKeown et al.), Speedup 2
Maximal (Dai-Prabhakar) - Can a simple maximum size matching suffice for
admissible traffic?
Red Herring!
Critical matching suffices for asymptotic 100
throughput (f2)
6
3
0
6
3
0
Augment
MSM
0
1
7
1
1
7
Queue State
Critical Matching
5
0
5
2
0
2
Intuition 2x2 Line buckets
R1
R2
C1
C2
Max
14CIOQ Strict Relative Stability
- Maximal matching Keeps under-subscribed outputs
stable (f3) (s2) - Shortest Output-Queue First (f4) (s3)
- Output element scheduler Identical to the one in
emulated switch - Intuition Give preference to less congested
pairs at the output - Asymptotic emulation of an ideal switch
long-term fairness
15Switched Fair Airport
- Integrate two policies M1 and M2
- M1 Provides bandwidth trunks given rate
reservations - M2 Optimize throughput independent of above
rates
Multi-phase Combination
Exclusive Combination
Speedup Required
M1
M2
Maximal matching is additive to any other policy,
hence needs the least speedup
16CIOQ-A Aggregation
Advantages
Smaller space element Lower arbitration
complexity Heterogeneous subports
- Shadow-Decompose CIOQ emulation (f5)
- VEQ Matching Less complex, only for admissible
traffic (f2)
17CIOQ-P Pipelining
- Sequential Dispatch CIOQ emulation (f5)
- Concurrent Dispatch
- Limited candidates stale-state issues
- 3D Maximal Matching for relative stability
- Striping Shadow on envelope basis
- Equal Dispatch
- Explicitly equalize load
- Separate occupancy counters for each SE
Implement arbitrarily complex policies!
Advantages
Slower space element Lower arbitration complexity
18G-MSM Combination
Combination methods CIOQ-A/P No need for
independent analysis Recursion possible
19PPS Architecture
Core
Advantages
Demux
Mux
Reuse low-capacity core switch Implement
arbitrarily slow memories!
provided
Memoryless first and third stages Performance
Emulates OQ switch
- Pool the resources on several switching paths
- Dual of a CIOQ-P switch
- Matching algorithm replaced by load balancing
- Sequence control might be necessary
20PPS Flow-based
- Model for clustered routers
- Per-flow path assignment explicit or hashed
- No need for sequence control
- Memory in first stage
- High speedup (Clos fitting)
- Unbalanced load assignment
- Requires knowledge of loads
Split flows
21PPS Cell-based
- Uniformly distribute the load of each flow
- Premise Each core element receives 1/K cells of
each flow
- Equal dispatch and striping suffice for
asymptotic OQ emulation - Bandwidth trunks Large buffers required
22Summary A Recipe Book
- Taxonomy of multi-module switches Buffered Clos
Switches - Performance framework Functional equivalence
with ideal switch
Applications
Combined I/O Queueing
Aggregation
- QoS Online maximal matching
- Throughput Critical matching
- Strict stability Maximal matching, SOQF
- Switched Fair Airport matching
- Shadow and Decompose
- Virtual Element Queueing
Pipelining
- Striping and Equal Dispatch
- Concurrent Dispatch 3D matching
Parallelization
- Flow-based PPS Clos fitting
- Cell-based PPS Striping, Equal Dispatch
Memory Space Memory
- Combination methods
- Recursive BCS
23Avenues for Follow-on Research
- Efficient policies for multicast
- Similar treatment on other interconnection
networks - Theory of backpressure
- Recent interest in buffered crossbars
- Quality of stability Average delay analysis
- Short-timescale equivalence
- Emulation of a finite-memory ideal switch
- Interplay of buffer management with matching
algorithms
24Supporting Slides
25Relevant Publications
- Dynamic Partitioning Switch Memory Management,
Infocom 99 - Packet Switches with QoS Support, Hot
Interconnects 00 - Feedback Control for Distributed Scheduling,
Globecomm 00 - Buffered Clos Switches, Columbia TR 02
- Inverse Multiplexing for Switches, Globecom 98
- Switched Connections Inverse Multiplexing, Intl.
Conf. ATM 99 - Recognition of Parallel Packet Switches, GBN,
Infocom 01 - Stability Analysis of Parallel Packet Switches,
ICC 01 - Open-loop Schemes for Multi-path Switches, ICC 03
Switching Algorithms
Parallel Switches
26Proposal Conjectures
Proposal six conjectures
- Maximal matching is sufficient to isolate
oversubscribed outputs DONE - SOQF is sufficient for strict relative stability
DONE - Equal dispatch for strict stability in CIOQ-P
DONE - Equal dispatch plus decomposition for strict
stability in G-MSM DONE - Rate shaping plus maximal matching suffices for
QoS in CIOQ DONE - SOQF suffices for long-term fairness in CIOQ
DONE - Plus many more to round out the work
27Additional Contributions
Background Survey of formal methods in
switching a new perspective
Applications
Combined I/O Queueing
Aggregation
- Maximal Matching Delay analysis
- Perfect Sequences Uniform Traffic
- Multicast support using Recycling
- Batch Decomposition (Optical)
- Support for Heterogeneous Subports
Pipelining
- Concurrent Dispatch BVN and SPS
Parallelization
- SMM Switches PPS without backpressure
- Fractional Dispatch for memoryless inputs
28Matching Flavors
- Maximal matching Non-idling, greedy
- Maximum-size matching Maximum flow in a
bipartite graph - Ford-Fulkerson, Hopcroft-Karp
Invariant
3
0
6
At least one connection in the marked lines
0
7
1
Queue State
Non-empty
5
0
0
29Matching Flavors (continued)
- Critical Matching Covers all critical rows and
columns - Critical line A line with the maximum sum
- Perfect Matching Each configuration is a
permutation - Maximum Weight Matching Use queue length as
weights - Optimization problem simplex method
- Template Matchings
- BVN Decompose rate matrix as convex combination
of permutations - Double Lower number of permutations, wasted
slots - Min N permutations will cover all entries, large
number of wasted slots - Stable Matching Gale-Shapely algorithm
30Stability Theory
- Lyapunov functions Kumar-Meyn 95
- Mechanism to extend Fosters criterion to a
system of queues - Weighted cartesian product of queue lengths
- Symmetric and co-positive
- Fluid limits Dai-Prabhakar 00
- Function of discrete time Interpolate
- Limit Scale time to infinity
- The scaling parameter may be drawn from an
increasing sequence rn
F(t) lim 1/r f(rt)
r 8
31CIOQ Bandwidth Trunks
Arrivals into GQ Bounded admissible
Bandwidth Trunk Timescale 1/GCD(R)
Covers all entries in GQ before next batch
- Delay comparable to BVN rate decomposition
32CIOQ Perfect Sequences
- Sub-maximal Perfect Sequence
- A sequence of N permutations that covers the unit
matrix - A repeating sequence guarantees 1/N to each pair
- Suffices for 100 throughput to uniform traffic
- Simple implementation Staggered round-robin
- Not even maximal!
Concurrent SPS for CIOQ-P K turns in KN slots
Basis for iSLIP Basis for Atlanta arbitration
33Hierarchical Scheduling
34CIOQ-P Equal Dispatch
Explicitly equalize the load for each
input-output pair
Implemented as counters No mis-sequencing issues
35CIOQ-P 3D Maximal Matching
Concurrent traversal of queue state matrix
Pointers do not coincide with each other
36Recursive G-MSM
Any matching
SPS
SPS
Memory element of a G-MSM Replace with a CIOQ
switch
Virtual Element Queues Organized per space element
37PPS Data Path