Title: High Performance Routing
1High Performance Routing
Nick McKeown Assistant Professor of Electrical
Engineering and Computer Science, Stanford
University Abrizio/PMC-Sierra Inc.
nickm_at_stanford.edu
http//www.stanford.edu/nickm
2Outline
- Outline
- Review What is a Router?
- The Evolution of Routers
- Single-stage switchingThe Fork-Join Router
3Outline
- Switching is the bottleneck in a router.
- The trend has been to overcome limitations in
memory bandwidth - Shared memory -gt Single-stage, crossbar-based,
combined input and output queued (CIOQ). - and reduce power per-rack per-system
- Single box systems -gt Multi-rack systems (LCS).
4Outline (2)
- What comes next?
- Multistage switches solve the wrong problem
- N2 is not the problem.
- Multistage switches are more blocking, more
power-hungry and less predictable. - Parallel single-stage switches (e.g. the
Fork-Join Router) are non-blocking, use less
power, can achieve as high capacity, and can be
predictable.
5Outline
- Outline
- Review What is a Router?
- The Evolution of Routers
- Single-stage switchingThe Fork-Join Router
6Basic Architectural Components
Routing Protocols
Routing Table
Control Plane
Datapath per-packet processing
Switching
Forwarding Table
7Basic Architectural ComponentsDatapath
per-packet processing
2. Interconnect
1. Ingress
3. Egress
Forwarding Table
Classifier Table
Policing Access Control
Forwarding Decision
Forwarding Table
Classifier Table
Policing Access Control
Forwarding Decision
Forwarding Table
Classifier Table
Policing Access Control
Forwarding Decision
8Outline
- Outline
- Review What is a Router?
- The Evolution of Routers
- Single-stage switchingThe Fork-Join Router
9First Generation Routers
Fixed length DMA blocks or cells. Reassembled
on egress linecard
Shared Backplane
Line Interface
Fixed length cells or variable length packets
Typically lt0.5Gb/s aggregate capacity
10First Generation RoutersQueueing Structure
Shared Memory
- Numerous work has proven and made possible
- Fairness
- Delay Guarantees
- Delay Variation Control
- Loss Guarantees
- Statistical Guarantees
Input 1
Output 1
Output 2
Input 2
Large, single dynamically allocated memory
buffer N writes per cell time N reads per
cell time. Limited by memory bandwidth.
Input N
Output N
11Second Generation Routers
CPU
Buffer Memory
Route Table
Line Card
Line Card
Line Card
Drop Policy Or Backpressure
Drop Policy
Buffer Memory
Buffer Memory
Buffer Memory
Fwding Cache
Fwding Cache
Output Link Scheduling
MAC
MAC
MAC
Typically lt5Gb/s aggregate capacity
12Second Generation RoutersAs caching became
ineffective
Exception Processor
CPU
Route Table
Line Card
Line Card
Line Card
Buffer Memory
Buffer Memory
Buffer Memory
Fwding Table
Fwding Table
MAC
MAC
MAC
13Second Generation RoutersQueueing Structure
Combined Input and Output Queueing
Bus
14Third Generation Routers
Switched Backplane
Line Card
CPU Card
Line Card
Local Buffer Memory
Local Buffer Memory
Line Interface
CPU
Routing Table
Memory
Fwding Table
MAC
MAC
Typically lt50Gb/s aggregate capacity
15Third Generation RoutersQueueing Structure
Switch
16Third Generation Routers
- Size-constrained 19 or 23 wide.
- Power-constrained lt6kW.
- QoS unfriendly input congestion.
7
Supply 100A/200A maximum at 48V
19 or 23
17Fourth Generation Routers/Switches
Optical links
Switch Core
Linecards
18Fourth Generation Routers/SwitchesThe LCS
Protocol
- What is LCS?
- Credit-based flow control enables separation.
- Label-based multicast enables scaling.
- Its Benefits
- Large Number of Ports.Separation enables large
number of ports in multiple racks. - Minimizes Switch Core Complexity and
Power.Switch core can be bufferless and
lossless. QoS, discard etc. performed on linecard.
19Fourth Generation Routers/SwitchesQueueing
Structure
Virtual Output Queues
1 read per cell time
1 write per cell time
Lookup Drop Policy
Output Scheduling
Switch Fabric
Lookup Drop Policy
Output Scheduling
Switch Arbitration
Lookup Drop Policy
Output Scheduling
Switch Core (Bufferless)
Linecard
Linecard
Typically lt5Tb/s aggregate capacity
20Myths about CIOQ-based crossbar switches
- Input-queued crossbars have low throughput
- An input-queued crossbar can have as high
throughput as any switch. - Crossbars dont support multicast traffic well
- A crossbar inherently supports multicast
efficiently. - Crossbars dont scale well
- Today, it is the number of chip I/Os, not the
number of crosspoints, that limits the size of a
switch fabric. Expect 5Tb/s crossbar switches.
21Myths about CIOQ-based crossbar switches (2)
- 4. Crossbar switches cant support delay/QoS
guarantees - With an internal speedup of 2, a CIOQ switch can
precisely emulate a shared memory switch for all
traffic.
22What makes sense today?
23What makes sense tomorrow?
- Single-stage (if possible)
- Reduces complexity
- Minimizes interconnect b/w
- Minimizes power
24Outline
- Outline
- Review What is a Router?
- The Evolution of Routers
- Single-stage switchingThe Fork-Join Router
25Buffer MemoryHow Fast Can I Make a Packet Buffer?
5ns SRAM
Buffer Memory
64-byte wide bus
64-byte wide bus
- Rough Estimate
- 5ns per memory operation.
- Two memory operations per packet.
- Therefore, maximum 51.2Gb/s.
- In practice, closer to 40Gb/s.
26Buffer MemoryIs It Going to Get Better?
Specmarks, Memory size, Gate density
time
27Fork-Join RouterSponsored by NSF and ITRI
- How can we
- Increase capacity.
- Reduce power per subsystem.
- While at the same time
- Keep the system simple.
- Support line rates faster than memory bandwidth.
- Support guaranteed services.
Increase parallelism.
Multiple racks.
Single-stage buffering.
Pkt-by-pkt load balancing.
Hmmm.?
28The Fork-Join Router
Router
1
rate, R
rate, R
1
1
2
rate, R
rate, R
N
N
k
Bufferless
29The Fork-Join Router
- Advantages
- Single-stage of buffering
- kh a power per subsystem i
- kh a memory bandwidth i
- kh a fowarding table lookup rate i
30The Fork-Join Router
- Questions
- Switching What is the performance?
- Forwarding Lookups How do they work?
31A Parallel Packet Switch
Arriving packet tagged with egress port
1
Output Queued Switch
rate, R
rate, R
2
1
1
Output Queued Switch
rate, R
rate, R
N
N
k
Output Queued Switch
32Performance Questions
- Can it be work-conserving?
- Can it emulate a single big output queued switch?
- Can it support delay guarantees,
strict-priorities, WFQ, ?
33Work Conservation
1
Output Queued Switch
R/k
R/k
2
Output Queued Switch
R/k
R/k
rate, R
rate, R
1
1
R/k
R/k
k
Output Queued Switch
Output Link Constraint
Input Link Constraint
34Work Conservation
1
1
4
5
R/k
R/k
4
1
2
R/k
2
R/k
2
rate, R
rate, R
1
1
3
R/k
R/k
k
3
Output Link Constraint
35Work Conservation
1
Output Queued Switch
S(R/k)
S(R/k)
rate, R
rate, R
S(R/k)
S(R/k)
2
1
1
Output Queued Switch
rate, R
rate, R
N
N
k
Output Queued Switch
S(R/k)
S(R/k)
36Precise Emulation of an Output Queued Switch
Output Queued Switch
1
N
N
N
37Parallel Packet SwitchTheorems
- If S gt 2k/(k2) _at_ 2 then a parallel packet switch
can be work-conserving for all traffic. - If S gt 2k/(k2) _at_ 2 then a parallel packet switch
can precisely emulate a FCFS output-queued switch
for all traffic.
38Parallel Packet SwitchTheorems
- 3. If S gt 3k/(k3) _at_ 3 then a parallel packet
switch can precisely emulate a switch with WFQ,
strict priorities, and other types of QoS, for
all traffic.
39Parallel Packet SwitchTheorems
- 4. If S gt 2 then a parallel packet switch with a
small co-ordination buffer at rate R, can
precisely emulate a switch with WFQ, strict
priorities, and other types of QoS, for all
traffic.
40The Fork-Join Router
- Questions
- Switching What is the performance?
- Forwarding Lookups How do they work?
41The Fork-Join RouterLookahead Forwarding Table
Lookups
Packet tagged with egress port at next router
Lookup performed in parallel at rate R/k
42The Fork-Join Router
Router
1
rate, R
rate, R
1
1
2
rate, R
rate, R
N
N
k
Expect gt50Tb/s aggregate capacity
43Conclusions
- The main problems are power (supply and
dissipation) and memory bandwidth. - Multi-stage switches solve the wrong problem.
- Single-stage switches are here to stay.
- Very high capacity single-stage electronic
routers are feasible.