Title: Hot interconnects tutorial
1High Performance Switches and Routers Theory
and Practice
- Hot Interconnects 7
- August 20, 1999
- Stanford University
Nick McKeown Assistant Professor of Electrical
Engineering and Computer Science nickm_at_stanford.e
du http//www.stanford.edu/nickm
2Tutorial Outline
- Introduction What is a Packet Switch?
- Packet Lookup and Classification Where does a
packet go next? - Switching FabricsHow does the packet get there?
3IntroductionWhat is a Packet Switch?
- Basic Architectural Components
- Some Example Packet Switches
- The Evolution of IP Routers
4Basic Architectural Components
Congestion Control
Control
Admission Control
Reservation
Routing
Datapath per-packet processing
Output Scheduling
Switching
Policing
5Basic Architectural ComponentsDatapath
per-packet processing
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
6Where high performance packet switches are used
- Carrier Class Core Router - ATM Switch - Frame
Relay Switch
The Internet Core
7IntroductionWhat is a Packet Switch?
- Basic Architectural Components
- Some Example Packet Switches
- The Evolution of IP Routers
8ATM Switch
- Lookup cell VCI/VPI in VC table.
- Replace old VCI/VPI with new.
- Forward cell to outgoing interface.
- Transmit cell onto link.
9Ethernet Switch
- Lookup frame DA in forwarding table.
- If known, forward to correct port.
- If unknown, broadcast to all ports.
- Learn SA of incoming frame.
- Forward frame to outgoing interface.
- Transmit frame onto link.
10IP Router
- Lookup packet DA in forwarding table.
- If known, forward to correct port.
- If unknown, drop packet.
- Decrement TTL, update header Cksum.
- Forward packet to outgoing interface.
- Transmit packet onto link.
11IntroductionWhat is a Packet Switch?
- Basic Architectural Components
- Some Example Packet Switches
- The Evolution of IP Routers
12First-Generation IP Routers
Shared Backplane
Buffer Memory
CPU
13Second-Generation IP Routers
Buffer Memory
CPU
14Third-Generation Switches/Routers
Switched Backplane
Line Card
CPU Card
Line Card
Local Buffer Memory
Local Buffer Memory
MAC
MAC
15Fourth-Generation Switches/RoutersClustering and
Multistage
13
14
15
16
17
18
25
26
27
28
29
30
1
2
3
4
5
6
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
19
20
21
22
23
24
31
32
21
7
8
9
10
11
12
16Packet SwitchesReferences
- J. Giacopelli, M. Littlewood, W.D. Sincoskie
Sunshine A high performance self-routing
broadband packet switch architecture, ISS 90. - J. S. Turner Design of a Broadcast packet
switching network, IEEE Trans Comm, June 1988,
pp. 734-743. - C. Partridge et al. A Fifty Gigabit per second
IP Router, IEEE Trans Networking, 1998. - N. McKeown, M. Izzard, A. Mekkittikul, W.
Ellersick, M. Horowitz, The Tiny Tera A Packet
Switch Core, IEEE Micro Magazine, Jan-Feb 1997.
17Tutorial Outline
- Introduction What is a Packet Switch?
- Packet Lookup and Classification Where does a
packet go next? - Switching FabricsHow does the packet get there?
18Basic Architectural ComponentsDatapath
per-packet processing
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
19Forwarding Decisions
- ATM and MPLS switches
- Direct Lookup
- Bridges and Ethernet switches
- Associative Lookup
- Hashing
- Trees and tries
- IP Routers
- CIDR
- Patricia trees/tries
- Other methods
- Caching
- Packet Classification
20ATM and MPLS SwitchesDirect Lookup
(Port, VCI)
VCI
Memory
Address
Data
21Forwarding Decisions
- ATM and MPLS switches
- Direct Lookup
- Bridges and Ethernet switches
- Associative Lookup
- Hashing
- Trees and tries
- IP Routers
- CIDR
- Patricia trees/tries
- Other methods
- Caching
- Packet Classification
22Bridges and Ethernet SwitchesAssociative Lookups
Associative Memory or CAM
Network Address
Associated Data
Search Data
48
23Bridges and Ethernet SwitchesHashing
Search Data
Hashing Function
16
Data
Memory
Address
48
24Lookups Using HashingAn example
Memory
1
2
3
4
Search Data
Hashing Function
16
1
2
CRC-16
48
1
2
3
Linked lists
25Lookups Using HashingPerformance of simple
example
26Lookups Using Hashing
- Advantages
- Simple
- Expected lookup time can be small
- Disadvantages
- Non-deterministic lookup time
- Inefficient use of memory
27Trees and Tries
Binary Search Tree
lt
gt
lt
gt
lt
gt
28Trees and TriesMultiway tries
16-ary Search Trie
0000, ptr
1111, ptr
0000, 0
1111, ptr
1111, ptr
0000, 0
000011110000
111111111111
29Trees and TriesMultiway tries
Table produced from 215 randomly generated 48-bit
addresses
30Forwarding Decisions
- ATM and MPLS switches
- Direct Lookup
- Bridges and Ethernet switches
- Associative Lookup
- Hashing
- Trees and tries
- IP Routers
- CIDR
- Patricia trees/tries
- Other methods
- Caching
- Packet Classification
31IP RoutersClass-based addresses
IP Address Space
Class A
Class B
Class C
D
32IP RoutersCIDR
Class-based
A
B
C
D
0
232-1
Classless
65/24
128.9/16
0
232-1
128.9.16.14
33IP RoutersCIDR
128.9/16
0
232-1
128.9.16.14
34IP RoutersMetrics for Lookups
- Lookup time
- Storage space
- Update time
- Preprocessing time
35IP Router Lookup
- IPv4 unicast destination address based lookup
36Need more than IPv4 unicast lookups
- Multicast
- PIMSM
- Longest Prefix Matching on the source and group
address - Try (S,G) followed by (,G) followed by (,,RP)
- Check Incoming Interface
- DVMRP
- Incoming Interface Check followed by (S,G)
lookup - IPv6
- 128bit destination address field
- Exact address architecture not yet known
37Lookup Performance Required
- Gigabit Ethernet (84B packets) 1.49 Mpps
38Size of the Routing Table
- Source http//www.telstra.net/ops/bgptable.html
39Method 1 Ternary CAMs
Associative Memory
Value
Mask
10.0.0.0
R1
255.0.0.0
255.255.0.0
Next Hop
10.1.0.0
R2
255.255.255.0
10.1.1.0
R3
10.1.3.0
R4
255.255.255.0
255.255.255.255
10.1.3.1
R4
Priority Encoder
40Method 2 Binary Tries
Example Prefixes
0
1
a) 00001
b) 00010
c) 00011
d) 001
e) 0101
d
g
f
f) 011
g) 100
i
h
h) 1010
e
i) 1100
a
c
b
j) 11110000
j
41Four-way tries
Reduced number of memory accesses
But greater wasted space...
42Method 3 Patricia Tree
Example Prefixes
0
1
a) 00001
b) 00010
c) 00011
d) 001
Skip5
e) 0101
f) 011
f
g
j
d
g) 100
h) 1010
h
e
i
i) 1100
a
b
c
j) 11110000
Advantages
Disadvantages
- Extensible to wider fields
- Pointers take a lot of space
(Total storage for 40K entries is 2MB)
43Method 4 Level Compressed Tries
j
f
g
d
c
h
b
e
i
a
.
Expected depth of a trie
log
n (1log
(logn))
.
For bernoulli type distributions, expected depth
O(loglogn)
.
Achieves approx 0.5Mpps on a Pentium with
a 40k routing table, occupying less than 0.8MB
Advantages
Disadvantages
- No practical performance gain
- Handling updates is complex
44Method 5 Compacting Forwarding Tables
- Optimize the data structure to store 40,000
routing table entries in about 150-160kBytes. - Rely on the compacted data structure to be
residing in the primary or secondary cache
of a fast processor. - Achieves approx 2Mpps.
Advantages
Disadvantages
- Good software solution for
low speeds and small routing
- Scalability to larger tables
tables.
- Handling updates is complex
45Method 6 A Hash-based Scheme
Example Prefixes
Example Prefixes
Store a hash table for each prefix length
10.0.0.0/8
10.0.0.0/8
10.1.0.0/16
10.1.0.0/16
Length
Hash
10.1.1.0/24
10.1.1.0/24
10.1.2.0/24
10.1.2.0/24
8
10
10.2.3.0/24
10.2.3.0/24
12
Example Addrs
16
10.1, 10.2
10.1.1.4
24
10.4.4.3
10.2.3.9
10.2.4.8
10.1.1, 10.1.2, 10.2.3
46A Hash-Based Scheme (contd.)
- Binary search of the prefix lengths
O(log
N )
hashes
2
- Need to provide intermediate markers
- But then we need precomputation per marker
- Performance is about 2.2Mpps in the worst case
for 33K table.
Advantages
Disadvantages
- Good software solution for
low speeds and small routing
- Scalability to larger tables
tables.
- Handling updates is complex
47Method 8 Routing Lookups in Hardware
Number
Prefix length
Most prefixes are 24-bits or shorter
48Routing Lookups in Hardware
224 16M entries
Prefixes up to 24-bits
142.19.6
142.19.6.14
14
49Routing Lookups in Hardware
Prefixes up to 24-bits
1
Next Hop
128.3.72
128.3.72.44
44
50Routing Lookups in Hardware (Contd.)
Prefixes up to n-bits 2n entries
entries
j
Prefixes longer than NM bits
i
0
N
Next Hop
N M
51Routing Updates
10.4.24.0
10.4.24.0
Depth 3
10.4.0.0
10.4.0.0
Depth 2
Depth 1
10.0.0.0
10.0.0.0
- Disadvantages
- Large memory required
- Depends on prefix length distribution
- Advantages
- 20 Mpps with 50ns DRAM
- Easy to implement in hardware
52IP Router LookupsReferences
- A. Brodnik, S. Carlsson, M. Degermark, S. Pink.
Small Forwarding Tables for Fast Routing
Lookups, Sigcomm 1997, pp 3-14. - B. Lampson, V. Srinivasan, G. Varghese. IP
lookups using multiway and multicolumn search,
Infocom 1998, pp 1248-56, vol. 3. - M. Waldvogel, G. Varghese, J. Turner, B.
Plattner. Scalable high speed IP routing
lookups, Sigcomm 1997, pp 25-36. - P. Gupta, S. Lin, N.McKeown. Routing lookups in
hardware at memory access speeds, Infocom 1998,
pp 1241-1248, vol. 3. - S. Nilsson, G. Karlsson. Fast address lookup for
Internet routers, IFIP Intl Conf on Broadband
Communications, Stuttgart, Germany, April 1-3,
1998. - V. Srinivasan, G.Varghese. Fast IP lookups using
controlled prefix expansion, Sigmetrics, June
1998.
53Caching Addresses
Slow Path
Buffer Memory
CPU
Fast Path
54Caching Addresses
55Forwarding Decisions
- ATM and MPLS switches
- Direct Lookup
- Bridges and Ethernet switches
- Associative Lookup
- Hashing
- Trees and tries
- IP Routers
- CIDR
- Patricia trees/tries
- Other methods
- Caching
- Packet Classification
56Providing ValueAdded ServicesSome examples
- Differentiated services
- Regard traffic from AS33 as platinumgrade
- Access Control Lists
- Deny udp host 194.72.72.33 194.72.6.64 0.0.0.15
eq snmp - Committed Access Rate
- Rate limit WWW traffic from subinterface739 to
10Mbps - Policybased Routing
- Route all voice traffic through the ATM network
- Peering Arrangements
- Restrict the total amount of traffic of
precedence 7 from - MAC address N to 20 Mbps between 10 am and 5pm
- Accounting and Billing
- Generate hourly reports of traffic from MAC
address M
57Flow Classification
58A Packet Classifier
Given a classifier, find the action associated
with the highest priority rule (here, the lowest
numbered rule) matching an incoming packet.
59Geometric Interpretation in 2D
R7
R6
R2
R1
R4
R5
R3
e.g. (144.24/16, 64/24)
e.g. (128.16.46.23, )
Field 2
Field 1
60Proposed Schemes
61Proposed Schemes (Contd.)
62Proposed Schemes (Contd.)
63Packet ClassificationReferences
- T.V. Lakshman. D. Stiliadis. High speed policy
based packet forwarding using efficient
multi-dimensional range matching, Sigcomm 1998,
pp 191-202. - V. Srinivasan, S. Suri, G. Varghese and M.
Waldvogel. Fast and scalable layer 4 switching,
Sigcomm 1998, pp 203-214. - V. Srinivasan, G. Varghese, S. Suri. Fast packet
classification using tuple space search, to be
presented at Sigcomm 1999. - P. Gupta, N. McKeown, Packet classification
using intelligent hierarchical cuttings, Hot
Interconnects VII, 1999. - P. Gupta, N. McKeown, Packet classification on
multiple fields, Sigcomm 1999.
64Tutorial Outline
- Introduction What is a Packet Switch?
- Packet Lookup and Classification Where does a
packet go next? - Switching FabricsHow does the packet get there?
65Switching Fabrics
- Output and Input Queueing
- Output Queueing
- Input Queueing
- Scheduling algorithms
- Combining input and output queues
- Multicast traffic
- Other non-blocking fabrics
- Multistage Switches
66Basic Architectural ComponentsDatapath
per-packet processing
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
67InterconnectsTwo basic techniques
Input Queueing
Output Queueing
Usually a non-blocking switch fabric (e.g.
crossbar)
Usually a fast bus
68InterconnectsOutput Queueing
Individual Output Queues
Centralized Shared Memory
1
2
N
1
2
N
69Output QueueingThe ideal
70Output QueueingHow fast can we make centralized
shared memory?
5ns SRAM
Shared Memory
- 5ns per memory operation
- Two memory operations per packet
- Therefore, up to 160Gb/s
- In practice, closer to 80Gb/s
1
2
N
200 byte bus
71Switching Fabrics
- Output and Input Queueing
- Output Queueing
- Input Queueing
- Scheduling algorithms
- Combining input and output queues
- Multicast traffic
- Other non-blocking fabrics
- Multistage Switches
72InterconnectsInput Queueing with Crossbar
Scheduler
Data In
Data Out
configuration
73Input QueueingHead of Line Blocking
Delay
Load
100
58.6
74Head of Line Blocking
75(No Transcript)
76(No Transcript)
77Input QueueingVirtual output queues
78Input QueuesVirtual Output Queues
Delay
Load
100
79Input Queueing
Scheduler
80Input QueueingScheduling
81Input QueueingScheduling
1
7
1
2
2
2
4
2
3
3
5
4
4
2
Request
Graph
Question Maximum weight or maximum size?
82Input QueueingScheduling
- Maximum Size
- Maximizes instantaneous throughput
- Does it maximize long-term throughput?
- Maximum Weight
- Can clear most backlogged queues
- But does it sacrifice long-term throughput?
83Input QueueingScheduling
84Input QueueingLongest Queue First orOldest Cell
First
Queue Length
Weight
100
Waiting Time
85Input QueueingWhy is serving long/old queues
better than serving maximum number of queues?
- When traffic is uniformly distributed, servicing
themaximum number of queues leads to 100
throughput. - When traffic is non-uniform, some queues become
longer than others. - A good algorithm keeps the queue lengths
matched, and services a large number of queues.
86Input QueueingPractical Algorithms
- Maximal Size Algorithms
- Wave Front Arbiter (WFA)
- Parallel Iterative Matching (PIM)
- iSLIP
- Maximal Weight Algorithms
- Fair Access Round Robin (FARR)
- Longest Port First (LPF)
87Wave Front Arbiter
Requests
Match
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
88Wave Front Arbiter
Requests
Match
89Wave Front ArbiterImplementation
Combinational Logic Blocks
90Wave Front ArbiterWrapped WFA (WWFA)
N steps instead of 2N-1
Requests
Match
91Input QueueingPractical Algorithms
- Maximal Size Algorithms
- Wave Front Arbiter (WFA)
- Parallel Iterative Matching (PIM)
- iSLIP
- Maximal Weight Algorithms
- Fair Access Round Robin (FARR)
- Longest Port First (LPF)
92Parallel Iterative Matching
Random Selection
Random Selection
Requests
93Parallel Iterative MatchingMaximal is not Maximum
Requests
94Parallel Iterative MatchingAnalytical Results
Number of iterations to converge
95Parallel Iterative Matching
96Parallel Iterative Matching
97Parallel Iterative Matching
98Input QueueingPractical Algorithms
- Maximal Size Algorithms
- Wave Front Arbiter (WFA)
- Parallel Iterative Matching (PIM)
- iSLIP
- Maximal Weight Algorithms
- Fair Access Round Robin (FARR)
- Longest Port First (LPF)
99iSLIP
Round-Robin Selection
Round-Robin Selection
Requests
100iSLIPProperties
- Random under low load
- TDM under high load
- Lowest priority to MRU
- 1 iteration fair to outputs
- Converges in at most N iterations. On average lt
log2N - Implementation N priority encoders
- Up to 100 throughput for uniform traffic
101iSLIP
102iSLIP
103iSLIPImplementation
Programmable Priority Encoder
1
1
State
Decision
log2N
N
Grant
Accept
2
2
Grant
Accept
N
log2N
N
N
Grant
Accept
log2N
N
104Input Queueing ReferencesReferences
- M. Karol et al. Input vs Output Queueing on a
Space-Division Packet Switch, IEEE Trans Comm.,
Dec 1987, pp. 1347-1356. - Y. Tamir, Symmetric Crossbar arbiters for VLSI
communication switches, IEEE Trans Parallel and
Dist Sys., Jan 1993, pp.13-27. - T. Anderson et al. High-Speed Switch Scheduling
for Local Area Networks, ACM Trans Comp Sys.,
Nov 1993, pp. 319-352. - N. McKeown, The iSLIP scheduling algorithm for
Input-Queued Switches, IEEE Trans Networking,
April 1999, pp. 188-201. - C. Lund et al. Fair prioritized scheduling in an
input-buffered switch, Proc. of IFIP-IEEE Conf.,
April 1996, pp. 358-69. - A. Mekkitikul et al. A Practical Scheduling
Algorithm to Achieve 100 Throughput in
Input-Queued Switches, IEEE Infocom 98, April
1998.
105Switching Fabrics
- Output and Input Queueing
- Output Queueing
- Input Queueing
- Scheduling algorithms
- Combining input and output queues
- Multicast traffic
- Other non-blocking fabrics
- Multistage Switches
106Input QueueingSpeedup
- Input queued switches can not easily control
delay - But output queued switches can.
- How can we emulate the behavior of an output
queued switch?
107Output QueueingThe ideal
108Using Speedup
109Using Speedup
Output Queued Switch
1
N
N
N
110Using Speedup
Theorem For a switch with combined input and
output queueing to exactly mimic an output queued
switch, for all types of traffic, a speedup of
2-1/N is necessary and sufficient.
111Switching Fabrics
- Output and Input Queueing
- Output Queueing
- Input Queueing
- Scheduling algorithms
- Combining input and output queues
- Multicast traffic
- Other non-blocking fabrics
- Multistage Switches
112Multicast Traffic
113Multicast Traffic
- Virtual output (fanout) queues are not practical
for multicast. - Fanout splitting leads to a large increase in
throughput. - Scheduling is simpler than for unicast.
114Multicast TrafficFanout splitting
115Multicast TrafficScheduling
1
1
1
1
2
2
2
2
3
3
3
3
4
4
4
4
Requests
Grant
Match
116Switching Fabrics
- Output and Input Queueing
- Output Queueing
- Input Queueing
- Scheduling algorithms
- Combining input and output queues
- Multicast traffic
- Other non-blocking fabrics
- Multistage Switches
117Other Non-Blocking FabricsClos Network
118Other Non-Blocking FabricsClos Network
119Other Non-Blocking FabricsSelf-Routing Networks
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
120Other Non-Blocking FabricsSelf-Routing Networks
The Non-blocking Batcher Banyan Network
Batcher Sorter
Self-Routing Network
3
7
7
7
7
7
7
000
7
2
5
0
4
6
6
001
5
3
2
5
5
4
5
010
2
5
3
1
6
5
4
011
6
6
1
3
0
3
3
100
0
1
0
4
3
2
2
101
1
0
6
2
1
0
1
110
4
4
4
6
2
2
0
111
- Fabric can be used as scheduler.
- Batcher-Banyan network is blocking for multicast.
121Switching Fabrics
- Output and Input Queueing
- Output Queueing
- Input Queueing
- Scheduling algorithms
- Combining input and output queues
- Multicast traffic
- Other non-blocking fabrics
- Multistage Switches
122Multistage switchesSelf-Routing
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
Stage-by-stage flow-control
123Multistage switchesSelf-Routing
Buffered multistage switch
Multicast copy network
124Tutorial Outline
- Introduction What is a Packet Switch?
- Packet Lookup and Classification Where does a
packet go next? - Switching FabricsHow does the packet get there?
125Basic Architectural Components
Congestion Control
Control
Admission Control
Reservation
Routing
Datapath per-packet processing
Output Scheduling
Switching
Policing
126Basic Architectural ComponentsDatapath
per-packet processing
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision