High Speed Router Design

About This Presentation

Title:

High Speed Router Design

Description:

... Queueing Input Queuing Input Queueing Head of Line Blocking Solution: Input Queueing w/ Virtual output queues (VOQ) Head-of-Line (HOL) ... – PowerPoint PPT presentation

Number of Views:326

Avg rating:3.0/5.0

Slides: 172

Provided by: ShivkumarK64

Category:

more less

Transcript and Presenter's Notes

Title: High Speed Router Design

1
High Speed Router Design

Shivkumar Kalyanaraman
Rensselaer Polytechnic Institute
shivkuma_at_ecse.rpi.edu
http//www.ecse.rpi.edu/Homepages/shivkuma
Based in part on slides of Nick McKeown
(Stanford), S. Keshav (Ensim), Douglas Comer
(Purdue),
Raj Yavatkar (Intel), Cyriel Minkenberg (IBM
Zurich)

2
Overview

Introduction
Evolution of High-Speed Routers
High Speed Router Components
Lookup Algorithm
Classification
Switching

3
What do switches/routers look like?
Access routers e.g. ISDN, ADSL
Core router e.g. OC48c POS
Core ATM switch
4
Dimensions, Power Consumption
Cisco GSR 12416
Juniper M160
19
19
Capacity 160Gb/sPower 4.2kW
Capacity 80Gb/sPower 2.6kW
6ft
3ft
2ft
2.5ft
5
Where high performance packet switches are used
- Carrier Class Core Router - ATM Switch - Frame
Relay Switch
The Internet Core
6
Where are routers? Ans Points of Presence (POPs)
7
Why the Need for Big/Fast/Large Routers?
POP with smaller routers
POP with large routers

Interfaces Price gt200k, Power gt 400W
Space, power, interface cost economics!
About 50-60 of i/fs are used for interconnection
within the POP.
Industry trend is towards large, single router
per POP.

8
Job of router architect

For a given set of features

9
Performance metrics

Capacity
maximize C, s.t. volume lt 2m3 and power lt 5kW
Throughput
Maximize usage of expensive long-haul links.
Trivial with work-conserving output-queued
routers
Controllable Delay
Some users would like predictable delay.
This is feasible with output-queueing plus
weighted fair queuing (WFQ).

10
The Problem

Output queued switches are impractical

R
R
R
R
DRAM
data
NR
NR
11
Memory BandwidthCommercial DRAM

Memory speed is not keeping up with Moores Law.

DRAM 1.1x / 18months
Moores Law 2x / 18 months
Router Capacity 2.2x / 18months
Line Capacity 2x / 7 months
12
Packet processing is getting harder
CPU Instructions per minimum length packet since
1996
13
Basic Ideas
14
Forwarding Functions ATM Switch

Lookup cell VCI/VPI in VC table.
Replace old VCI/VPI with new.
Forward cell to outgoing interface.
Transmit cell onto link.

15
Functions Ethernet (L2) Switch

Lookup frame destination address (DA) in
forwarding table.
If known, forward to correct port.
If unknown, broadcast to all ports.
Learn source address (SA) of incoming frame.
Forward frame to outgoing interface.
Transmit frame onto link.

16
Functions IP Router

Lookup packet DA in forwarding table.
If known, forward to correct port.
If unknown, drop packet.
Decrement TTL, update header Cksum.
Forward packet to outgoing interface.
Transmit packet onto link.

17
Basic Architectural Components
Congestion Control
Control
Admission Control
Reservation
Routing
Datapath per-packet processing
Output Scheduling
Switching
Policing
18
Basic Architectural Components
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
19
Generic Router Architecture
Header Processing
Lookup IP Address
Update Header
Queue Packet
20
Generic Router Architecture
Buffer Manager
Buffer Memory
Buffer Manager
Buffer Memory
Buffer Manager
Buffer Memory
21
Simplest Design Software Router using PCs!

Idea add special-purpose software to
general-purpose hardware Cheap, but slow
Measure of speed aggregate data rate or
aggregate packet rate
Limits number type of interfaces, topologies
etc
Eg 400 Mbps aggregate rate will allow four 100
Mbps ethernet interfaces, but no GbE!
Eg MITs Click Router

22
Aggregate Packet vs Bit Rates
64 byte pkts
1518 byte pkts
23
Per-Packet Processing Time Budget
MITs Click Router claims 435Kpps with 64 byte
packets! See http//www.pdos.lcs.mit.edu/click/
(gt it can do 100 Mbps, but not GbE interfaces!)
24
Soln Decentralization/Parallelism

Fine-grained parallelism instruction-level
Symmetric coarse-grain parallelism multi-procs
Asymmetric coarse-grain parallelism multi-procs
Co-processors (ASICs)
Operates under control of CPU
Move expensive ops to hardware
NICs with on-board processing
Attack I/O bottleneck
Move processing to the NIC (ASIC or embedded
RISC)
Handles only 1 interface rather than aggregate
rate!
Smart NICs with onboard stacks
Cell Switching Design protocols to suit hardware
speeds!
Data pipelines

25
Optimizations (contd)
26
Demultiplexing vs Classification

De-multiplexing in a layered model provides
freedom to use arbitrary protocols without
transmission overhead, but imposes sequential
processing limitations
Packet classification combines demuxing from a
sequence of opns at multiple layers to an
operation at one layer!

Overall goal flow segregation
27
Classification example
28
Hardware Optimization of Classification
29
Hybrid Hardware/Software Classifier
30
Conceptual Bindings
Connectionless Network
31
Second Gen. Network Systems
32
Switch Fabric Concept
Data path (aka backplane) that provides
parallelism Connects the NICs which have on-board
processing
33
Desired Switch Fabric Properties
34
Space Division Fabric
Asynchronous design arose from multi-processor
context Data can be sent across fabric at
arbitrary times
35
Blocking and Port Contention

Even if internally non-blocking (I.e. fully
inter-connected), port-contention can occur! Why
?
Need blocking circuits at input and output ports

36
Crossbar Switched interconnections

Use switches between each input and output
instead of separate paths active gt data flows
from I to O
Total number of paths required NM
Number of switching points NxM

37
Crossbar Switched interconnections

Switch controller (centralized) handles port
contention
Allows transfers in parallel (upto MinN,M
paths)
Note port hardware can operate much slower!
Issues switches, switch controller
Port contention still exists

38
Queuing input, output buffers
39
Time-division Switching Fabrics

Aka bus! (I.e. single shared link)
Low cost and low speed (used in computers!)
Need arbitration mechanism
eg fixed time-slots or data-blocks, fixed cells,
variable packets

40
Time division switching telephony

Key idea when de-multiplexing, position in frame
determines output trunk
Time division switching interchanges sample
position within a frame time slot interchange
(TSI)

41
Time-division Shared memory fabrics

Memory interface hardware expensive gt many
ports share fewer memory interfaces
Eg dual-ported memory
Separate low-speed bus lines for controller

42
(No Transcript)
43
Multi-Stage Fabrics

Compromise between pure time-division and pure
space division
Attempt to combine advantages of each
Lower cost from time-division
Higher performance from space-division
Technique Limited Sharing
Eg Banyan switch
Features
Scalable
Self-routing, I.e. no central controller
Packet queues allowed, but not required
Note multi-stage switches share the
crosspoints which have now become expensive
resources

44
Banyan Switch Fabric (Contd)

Basic building block 2x2 switch, labelled by
0/1
Can be synchronous or asynchronous
Asynch gt packets can arrive at arbitrary times
Synchronous banyan offers TWICE the effective
throughput!
Worst case when all inputs receive packets with
same label

45
Banyan Fabric
More on switching later
46
Forwardinga.k.a. Port Mapping
47
Basic Architectural ComponentsForwarding
Decision
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
48
ATM and MPLS SwitchesDirect Lookup
(Port, VCI)
VCI
Memory
Address
Data
49
Bridges and Ethernet SwitchesAssociative Lookups
Associative Memory or CAM
Network Address
Associated Data
Search Data
48
50
Bridges and Ethernet SwitchesHashing
Search Data
Hashing Function
16
Data
Memory
Address
48
51
Lookups Using HashingAn example
Memory
1
2
3
4
Search Data
Hashing Function
16
1
2
CRC-16
48
1
2
3
Linked lists
52
Lookups Using HashingPerformance of simple
example
53
Lookups Using Hashing

Advantages
Simple
Expected lookup time can be small
Disadvantages
Non-deterministic lookup time
Inefficient use of memory

54
Per-packet processing in an IP Router

1. Accept packet arriving on an incoming link.
2. Lookup packet destination address in the
forwarding table, to identify outgoing port(s).
3. Manipulate packet header e.g., decrement TTL,
update header checksum.
4. Send (switch) packet to the outgoing port(s).
5. Classify and buffer packet in the queue.
6. Transmit packet onto outgoing link.

55
Caching Addresses
Slow Path
Buffer Memory
CPU
Fast Path
56
Caching Addresses
57
IP Router Lookup

IPv4 unicast destination address based lookup

58
Lookup and Forwarding Engine
Packet
header
payload
Router
Routing Lookup Data Structure
Destination Address
Outgoing Port
Forwarding Table
Dest-network
Port
65.0.0.0/8
3
128.9.0.0/16
1
149.12.0.0/19
7
59
Example Forwarding Table
Destination IP Prefix Outgoing Port
65.0.0.0/ 8 3
128.9.0.0/16 1
142.12.0.0/19 7
Prefix length
IP prefix 0-32 bits
142.12.0.0/19
128.9.0.0/16
65.0.0.0/8
0
232-1
224
65.0.0.0
65.255.255.255
60
Prefixes can Overlap
Longest matching prefix
128.9.176.0/24
128.9.16.0/21
128.9.172.0/21
142.12.0.0/19
65.0.0.0/8
128.9.0.0/16
0
232-1
Routing lookup Find the longest matching prefix
(aka the most specific route) among all prefixes
that match the destination address.
61
Difficulty of Longest Prefix Match

2-dimensional search
Prefix Length
Prefix Value

32
24
Prefix Length
128.9.176.0/24
128.9.172.0/21
128.9.16.0/21
142.12.0.0/19
128.9.0.0/16
65.0.0.0/8
8
Prefix Values
62
IP RoutersMetrics for Lookups

Lookup time
Storage space
Update time
Preprocessing time

128.9.16.14
63
Lookup Rates Required
40B packets (Mpps)
Line-rate (Gbps)
Line
Year
1.94
0.622
OC12c
1998-99
7.81
2.5
OC48c
1999-00
31.25
10.0
OC192c
2000-01
125
40.0
OC768c
2002-03
64
Update Rates Required

Recent BGP studies show that updates can be
Bursty several 100s of routes updated/withdrawn
gt insert/delete operations
Frequent Average 100 updates per second
Need data structure to be efficient in terms of
lookup as well as update (insert/delete)
operations.

65
Size of the Forwarding Table
Renewed Exponential Growth
Number of Prefixes
10,000/year
95
96
97
98
99
00
Year
Renewed growth due to multi-homing of enterprise
networks!

Source http//www.telstra.net/ops/bgptable.html

66
Potential Hyper-Exponential Growth!
Global routing table vs Moore's law since 1999
160000
Global prefixes
Moore's law
150000
Double growth
140000
130000
120000
110000
Prefixes
100000
90000
80000
70000
60000
50000
01/99
04/99
07/99
10/99
01/00
04/00
07/00
10/00
01/01
04/01
67
Trees and Tries
Binary Search Tree
Binary Search Trie
lt
gt
0
1
lt
gt
lt
gt
0
1
0
1
111
010
68
Trees and TriesMultiway tries
16-ary Search Trie
0000, ptr
1111, ptr
1111, ptr
0000, 0
1111, ptr
0000, 0
000011110000
111111111111
69
Lookup Multiway TriesTradeoffs
Table produced from 215 randomly generated 48-bit
addresses
70
Routing Lookups in Hardware
Number
Prefix length
Most prefixes are 24-bits or shorter
71
Routing Lookups in Hardware
224 16M entries
Prefixes up to 24-bits
142.19.6
142.19.6.14
14
72
Routing Lookups in Hardware
Prefixes up to 24-bits
1
Next Hop
128.3.72
128.3.72.44
44
73
Switchinga.k.a. Interconnect
74
Basic Architectural Components Interconnect
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
75
First-Generation IP Routers
Shared Backplane
Buffer Memory
CPU

Most Ethernet switches and cheap packet routers
Bottleneck can be CPU, host-adaptor or I/O bus
What is costly? Bus ? Memory? Interface? CPU?

76
Second-Generation IP Routers

Port mapping intelligence in line cards
Higher hit rate in local lookup cache
What is costly? Bus ? Memory? Interface? CPU?

77
Third-Generation Switches/Routers
Switched Backplane
Line Card
CPU Card
Line Card
Local Buffer Memory
Local Buffer Memory
MAC
MAC

Third generation switch provides parallel paths
(fabric)
Whats costly? Bus? Memory, CPU?

78
Fourth-Generation Switches/RoutersClustering and
Multistage
13
14
15
16
17
18
25
26
27
28
29
30
1
2
3
4
5
6
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
19
20
21
22
23
24
31
32
21
7
8
9
10
11
12
79
Switching goals (telephony data)
80
Circuit switch

A switch that can handle N calls has N logical
inputs and N logical outputs
N up to 200,000
Moves 8-bit samples from an input to an output
port
Recall that samples have no headers
Destination of sample depends on time at which it
arrives at the switch
In practice, input trunks are multiplexed
Multiplexed trunks carry frames set of samples
Goal extract samples from frame, and depending
on position in frame, switch to output
each incoming sample has to get to the right
output line and the right slot in the output frame

81
Call blocking

Cant find a path from input to output
Internal blocking
slot in output frame exists, but no path
Output blocking
no slot in output frame is available
Output blocking is reduced in transit switches
need to put a sample in one of several slots
going to the desired next hop

82
Multiplexors and demultiplexors

Most trunks time division multiplex voice samples
At a central office, trunk is demultiplexed and
distributed to active circuits
Addressing not required
Synchronous multiplexor N input lines
Output runs N times as fast as input

1
1
2
2
3
3
De- MUX

MUX

1
2
3

N
N
N
83
Switching what does a switch do?

Transfers data from an input to an output
many ports (density), high speeds
Eg Crossbar

84
Circuit Switch
85
Issue Call Blocking
86
Time division switching

Key idea when de-multiplexing, position in frame
determines output trunk
Time division switching interchanges sample
position within a frame time slot interchange
(TSI)

87
Scaling Issues with TSI
88
Space division switching

Each sample takes a different path through the
switch, depending on its destination

89
Crossbar

Simplest possible space-division switch
Crosspoints can be turned on or off, long enough
to transfer a packet from an input to an output
Internally nonblocking
but need N2 crosspoints
time to set each crosspoint grows quadratically

90
Multistage crossbar

In a crossbar during each switching time only one
cross-point per row or column is active
Can save crosspoints if a cross-point can attach
to more than one input line (why?)
This is done in a multistage crossbar
Need to rearrange connections every switching time

91
Multistage crossbar

Can suffer internal blocking
unless sufficient number of second-level stages
Number of crosspoints lt N2
Finding a path from input to output requires a
depth-first-search
Scales better than crossbar, but still not too
well
120,000 call switch needs 250 million crosspoints

92
Time-Space Switching
93
Time-Space-Time (TST) switching
Telephone switches like 5ESS use multiple
space-stages eg TSSST etc
94
Packet switches

In a circuit switch, path of a sample is
determined at time of connection establishment
No need for a sample header--position in frame
used
In a packet switch, packets carry a destination
field or label
Need to look up destination port on-the-fly
Datagram switches
lookup based on entire destination address
(longest-prefix match)
Cell or Label-switches
lookup based on VCI or Labels

95
Blocking in packet switches

Can have both internal and output blocking
Internal
no path to output
Output
trunk unavailable
Unlike a circuit switch, cannot predict if
packets will block (why?)
If packet is blocked gt must either buffer or
drop

96
Dealing with blocking in packet switches

Over-provisioning
internal links much faster than inputs
Buffers
at input or output
Backpressure
if switch fabric doesnt have buffers, prevent
packet from entering until path is available
Parallel switch fabrics
increases effective switching capacity

97
Switch Fabrics Buffered crossbar

What happens if packets at two inputs both want
to go to same output?
Can defer one at an input buffer
Or, buffer cross-points complex arbiter

98
Switch fabric element

Goal towards building self-routing fabrics
Can build complicated fabrics from a simple
element
Routing rule if 0, send packet to upper output,
else to lower output
If both packets to same output, buffer or drop

99
Banyan

Simplest self-routing recursive fabric
What if two packets both want to go to the same
output?
output blocking

100
Features of multi-stage switches

Issue output blocking two packets want to go to
same output port

101
Blocking in Banyan Fabric
102
Blocking in Banyan S/ws Sorting

Can avoid blocking by choosing order in which
packets appear at input ports
If we can
present packets at inputs sorted by output
remove duplicates
remove gaps
precede banyan with a perfect shuffle stage
then no internal blocking
For example X, 010, 010, X, 011, X, X, X
Sort gt 010, 011, 011, X, X, X, X, X
Remove dups gt 010, 011, X, X, X, X, X, X
Shuffle gt 010, X, 011, X, X, X, X,
X
Need sort, shuffle, and trap networks

103
Sorting using Merging

Build sorters from merge networks
Assume we can merge two sorted lists
Sort pairwise, merge, recurse

104
Putting together Batcher-Banyan
105
Non-Blocking Batcher-Banyan
Batcher Sorter
Self-Routing Network
3
7
7
7
7
7
7
000
7
2
5
0
4
6
6
001
5
3
2
5
5
4
5
010
2
5
3
1
6
5
4
011
6
6
1
3
0
3
3
100
0
1
0
4
3
2
2
101
1
0
6
2
1
0
1
110
4
4
4
6
2
2
0
111

Fabric can be used as scheduler.
Batcher-Banyan network is blocking for multicast.

106
Queuing, Buffer Management, Classification
107
Basic Architectural Components Queuing,
Classification
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
108
QueuingTwo basic techniques
Input Queueing
Output Queueing
Usually a non-blocking switch fabric (e.g.
crossbar)
Usually a fast bus
109
QueuingOutput Queueing
Individual Output Queues
Centralized Shared Memory
1
2
N
1
2
N
110
Input Queuing
111
Input QueueingHead of Line Blocking
Delay
Load
100
112
Solution Input Queueing w/Virtual output queues
(VOQ)
113
Head-of-Line (HOL) in Input Queuing
114
Input QueuesVirtual Output Queues
Delay
Load
100
115
Output Queuing
116
Packet Classification
HEADER
Action
Incoming Packet
117
Multi-field Packet Classification
Given a classifier with N rules, find the action
associated with the highest priority rule
matching an incoming packet.
118
Prefix matching 1-d range problem
128.9/16
0
232-1
128.9.16.14
119
Classification 2D Geometry problem
R7
R6
R2
R1
R4
R5
R3
e.g. (144.24/16, 64/24)
e.g. (128.16.46.23, )
Field 2
Field 1
120
Network ProcessorsBuilding Block for
programmable networks

Slides from Raj Yavatkar, raj.yavatkar_at_intel.com

121
Intel IXP Network Processors

Microengines
RISC processors optimized for packet processing
Hardware support for multi-threading
Fast path
Embedded StrongARM/Xscale
Runs embedded OS and handles exception tasks
Slow path, Control plane

122
Various forms of Processors
Embedded Processor (run-to-completion)
Parallel architecture
Pipelined Architecture
123
Software Architectures
124
Division of Functions
125
Packet Flow Through the Hierarchy
126
Scaling Network Processors
127
Memory Scaling
128
Memory Scaling (contd)
129
Memory Types
130
Memory Caching and CAM
CACHE
Content Addressable Memory (CAM)
131
CAM and Ternary CAM
CAM Operation
Ternary CAM (T-CAM)
132
Ternary CAMs
Associative Memory
Value
Mask
10.0.0.0
R1
255.0.0.0
255.255.0.0
Next Hop
10.1.0.0
R2
255.255.255.0
10.1.1.0
R3
10.1.3.0
R4
255.255.255.0
255.255.255.255
10.1.3.1
R4
Priority Encoder
Using T-CAMs for Classification
133
IXP A Building Block for Network Systems

Example IXP2800
16 micro-engines XScale core
Up to 1.4 Ghz ME speed
8 HW threads/ME
4K control store per ME
Multi-level memory hierarchy
Multiple inter-processor communication channels
NPU vs. GPU tradeoffs
Reduce core complexity
No hardware caching
Simpler instructions ? shallow pipelines
Multiple cores with HW multi-threading per chip

134
IXP 2400 Block Diagram
135
IXP2800 Features

Half Duplex OC-192 / 10 Gb/sec Ethernet Network
Processor
XScale Core
700 MHz (half the ME)
32 Kbytes instruction cache / 32 Kbytes data
cache
Media / Switch Fabric Interface
2 x 16 bit LVDS Transmit Receive
Configured as CSIX-L2 or SPI-4
PCI Interface
64 bit / 66 MHz Interface for Control
3 DMA Channels
QDR Interface (w/Parity)
(4) 36 bit SRAM Channels (QDR or Co-Processor)
Network Processor Forum LookAside-1 Standard
Interface
Using a clamshell topology both Memory and
Co-processor can be instantiated on same channel
RDR Interface
(3) Independent Direct Rambus DRAM Interfaces
Supports 4i Banks or 16 interleaved Banks
Supports 16/32 Byte bursts

136
Hardware Features to ease packet processing

Ring Buffers
For inter-block communication/synchronization
Producer-consumer paradigm
Next Neighbor Registers and Signaling
Allows for single cycle transfer of context to
the next logical micro-engine to dramatically
improve performance
Simple, easy transfer of state
Distributed data caching within each micro-engine
Allows for all threads to keep processing even
when multiple threads are accessing the same
data

137
XScale Core processor

Compliant with the ARM V5TE architecture
support for ARMs thumb instructions
support for Digital Signal Processing (DSP)
enhancements to the instruction set
Intels improvements to the internal pipeline to
improve the memory-latency hiding abilities of
the core
does not implement the floating-point
instructions of the ARM V5 instruction set

138
Microengines RISC processors

IXP 2800 has 16 microengines, organized into 4
clusters (4 MEs per cluster)
ME instruction set specifically tuned for
processing network data
40-bit x 4K control store
Six-stage pipeline in an instruction
On an average takes one cycle to execute
Each ME has eight hardware-assisted threads of
execution
can be configured to use either all eight threads
or only four threads
The non-preemptive hardware thread arbiter swaps
between threads in round-robin order

139
MicroEngine v2
D-Push Bus
S-Push Bus
From Next Neighbor
Control Store 4K Instructions
Local Memory 640 words
128 GPR
128 GPR
128 Next Neighbor
128 S Xfer In
128 D Xfer In
LM Addr 1
2 per CTX
B_op
A_op
LM Addr 0
Prev B
Prev A
P-Random
B_Operand
A_Operand
CRC Unit
Multiply
Lock 0-15
Status and LRU Logic (6-bit)
TAGs 0-15
32-bit ExecutionData Path
Find first bit
CAM
CRC remain
Add, shift, logical
Status
Entry
Local CSRs
ALU_Out
To Next Neighbor
Timers
128 S Xfer Out
128 D Xfer Out
Timestamp
D-Pull Bus
S-Pull Bus
140
Why Multi-threading?
141
Packet processing using multi-threading within a
MicroEngine
142
Registers available to each ME

Four different types of registers
general purpose, SRAM transfer, DRAM transfer,
next-neighbor (NN)
256, 32-bit GPRs
can be accessed in thread-local or absolute mode
256, 32-bit SRAM transfer registers.
used to read/write to all functional units on the
IXP2xxx except the DRAM
256, 32-bit DRAM transfer registers
divided equally into read-only and write-only
used exclusively for communication between the
MEs and the DRAM
Benefit of having separate transfer and GPRs
ME can continue processing with GPRs while other
functional units read and write the transfer
registers

143
Different Types of Memory
Type of Memory Logical width (bytes) Size in bytes Approx unloaded latency (cycles) Special Notes
Local to ME 4 2560 3 Indexed addressing post incr/decr
On-chip scratch 4 16K 60 Atomic ops 16 rings w/at. get/put
SRAM 4 256M 150 Atomic ops 64-elem q-array
DRAM 8 2G 300 Direct path to/from MSF
144
IXA Software Framework
ExternalProcessors
Control Plane Protocol Stacks
Control Plane PDK
XScaleCore
Core Components
C/C Language
Core Component Library
Resource Manager Library
Microblock Library
Microengine Pipeline
MicroengineC Language
Micro block
Micro block
Micro block
Utility Library
Protocol Library
Hardware Abstraction Library
145

Micro-engine
C Compiler
C language constructs
Basic types,
pointers, bit fields
In-line assembly code support
Aggregates
Structs, unions, arrays

146
What is a Microblock

Data plane packet processing on the microengines
is divided into logical functions called
microblocks
Coarse Grain and stateful
Example
5-Tuple Classification, IPv4 Forwarding, NAT
Several microblocks running on a microengine
thread can be combined into a microblock group.
A microblock group has a dispatch loop that
defines the dataflow for packets between
microblocks
A microblock group runs on each thread of one or
more microengines
Microblocks can send and receive packets to/from
an associated Xscale Core Component.

147
Core Components and Microblocks
XScale Core
Micro- engines
Microblock Library
Core Libraries
User-written code
Intel/3rd party blocks
148
Applications of Network Processors

Fully programmable architecture
Implement any packet processing applications
Examples from customers
Routing/switching, VPN, DSLAM, Multi-servioce
switch, storage, content processing
Intrusion Detection (IDS) and RMON
Use as a research platform
Experiment with new algorithms, protocols
Use as a teaching tool
Understand architectural issues
Gain hands-on experience withy networking systems

149
Technical and Business Challenges

Technical Challengers
Shift from ASIC-based paradigm to software-based
apps
Challenges in programming an NPU
Trade-off between power, board cost, and no. of
NPUs
How to add co-processors for additional
functions?
Business challenges
Reliance on an outside supplier for the key
component
Preserving intellectual property advantages
Add value and differentiation through software
algorithms in data plane, control plane, services
plane functionality
Must decrease time-to-market (TTM) to be
competitive

150
Challenges in Modern Tera-bit Class Switch Design
151
Goals

Design of a terabit-class system
Several Tb/s aggregate throughput
2.5 Tb/s 256x256 OC-192 or 64x64 OC-768
OEM
Achieve wide coverage of application spectrum
Single-stage
Electronic fabric

152
System Architecture
153
Power

Requirement
Do not exceed the per shelf (2 kW), per board
(150W), and per chip (20W) budgets
Forced-air cooling, avoid hot-spots
More throughput at same power Gb/s/W density is
increasing
I/O uses an increasing fraction of power (gt 50)
Electrical I/O technology has not kept pace with
capacity demand
Low-power, high-density I/O technology is a must
CMOS density increases faster than W/gate
decreases
Functionality/chip constrained by power rather
than density
Power determines the number of chips and boards
Architecture must be able to be distributed
accordingly

154
Packaging

Requirement
NEBS compliance
Constrained by
Standard form factors
Power budget at chip, card, rack level
Switch core
Link, connector, chip packaging technology
Connector density (pins/inch)
CMOS density doubles, number of pins 5-10 per
generation
This determines the maximum per-chip and per-card
throughput
Line cards
Increasing port counts
Prevalent line rate granularity OC-192 (10 Gb/s)
1 adapter/card
gt 1 Tb/s systems require multi-rack solutions
Long cables instead of backplane (30 to 100m)
Interconnect accounts for large part of system
cost

155
Packaging

2.5 Tb/s, 1.6x speedup, 2.5 Gb/s links 8b/10b
4000 links (diff. pairs)

156
Switch-Internal Round-Trip (RT)

Physical system size
Direct consequence of packaging
CMOS technology
Clock speeds increasing much slower than density
More parallelism required to increase throughput
Shrinking packet cycle
Line rates have up drastically (OC-3 through
OC-768)
Minimum packet size has remained constant
Large round-trip (RT) in terms of min. packet
duration
Can be (many) tens of packets per port
Used to be only a node-to-node issue, now also
inside the node
System-wide clocking and synchronization

Evolution of RT
157
Switch-Internal Round-Trip (RT)
switch fabric
line card 1
switch core
switch fabric interface chips
line card N

Consequences
Performance impact?
All buffers must be scaled by RT
Fabric-internal flow control becomes an important
issue

158
Speed-Up

Requirement
Industry standard 2x speed-up
Three flavors
Utilization compensate SAR overhead
Performance compensate scheduling inefficiencies
OQ speed-up memory access time
Switch core speed-up S is very costly
Bandwidth is a scarce resource COST and POWER
Core buffers must run S times faster
Core scheduler must run S times faster
Is it really needed?
SAR overhead reduction
Variable-length packet switching hard to
implement, but may be more cost-effective
Performance does the gain in performance justify
the increase in cost and power?
Depends on application
Low Internet utilization

159
Multicast

Requirement
Full multicast support
Many multicast groups, full link utilization, no
blocking, QoS
Complicates everything
Buffering, queuing, scheduling, flow control, QoS
Sophisticated multicast support really needed?
Expensive
Often disabled in the field
Complexity, billing, potential for abuse, etc.
Again, depends on application

160
Packet size

Requirement
Support very short packets (32-64B)
40B _at_ OC-768 8 ns
Short packet duration
Determines speed of control section
Queues and schedulers
Implies longer RT
Wider data paths
Do we have to switch short packets individually?
Aggregation techniques
Burst, envelope, container switching, packing
Single-stage, multi-path switches
Parallel packet switch

161
100Tb/s optical routerStanford University
Research Project

Collaboration
4 Professors at Stanford (Mark Horowitz, Nick
McKeown, David Miller and Olav Solgaard), and our
groups.
Objective
To determine the best way to incorporate optics
into routers.
Push technology hard to expose new issues.
Photonics, Electronics, System design
Motivating example The design of a 100 Tb/s
Internet router
Challenging but not impossible (100x current
commercial systems)
It identifies some interesting research problems

162
100Tb/s optical router
Optical Switch
Electronic Linecard 1
Electronic Linecard 625
160- 320Gb/s
160- 320Gb/s
40Gb/s

Line termination
IP packet processing
Packet buffering

Line termination
IP packet processing
Packet buffering

40Gb/s
160Gb/s
40Gb/s
Arbitration
Request
40Gb/s
Grant
(100Tb/s 625 160Gb/s)
163
Research Problems

Linecard
Memory bottleneck Address lookup and packet
buffering.
Architecture
Arbitration Computation complexity.
Switch Fabric
Optics Fabric scalability and speed,
Electronics Switch control and link electronics,
Packaging Three surface problem.

164
160Gb/s Linecard Packet Buffering
b
DRAM
DRAM
DRAM
160 Gb/s
160 Gb/s
Queue Manager
SRAM

Problem
Packet buffer needs density of DRAM (40 Gbits)
and speed of SRAM (2ns per packet)
Solution
Hybrid solution uses on-chip SRAM and off-chip
DRAM.
Identified optimal algorithms that minimize size
of SRAM (12 Mbits).
Precisely emulates behavior of 40 Gbit, 2ns SRAM.

165
The Arbitration Problem

A packet switch fabric is reconfigured for every
packet transfer.
At 160Gb/s, a new IP packet can arrive every 2ns.
The configuration is picked to maximize
throughput and not waste capacity.
Known algorithms are too slow.

166
100Tb/s Router
Optical links
Optical Switch Fabric
Racks of 160Gb/s Linecards
167
Racks with 160Gb/s linecards
168
Passive Optical Switching
Integrated AWGR or diffraction grating based
wavelength router
Midstage Linecard 1
Egress Linecard 1
Ingress Linecard 1
1
1
1
1
Midstage Linecard 2
Egress Linecard 2
2
Ingress Linecard 2
2
2
2
Midstage Linecard n
Egress Linecard n
n
Ingress Linecard n
n
n
n
169
Predictions Core Internet routers

The need for more capacity for a given power and
volume budget will mean
Fewer functions in routers
Little or no optimization for multicast,
Continued over-provisioning will lead to little
or no support for QoS, DiffServ, ,
Fewer unnecessary requirements
Mis-sequencing will be tolerated,
Latency requirements will be relaxed.
Less programmability in routers, and hence no
network processors (NPs used at edge).
Greater use of optics to reduce power in switch.

170
Likely Events

The need for capacity and reliability will mean
Widespread replacement of core routers with
transport switching based on circuits
Circuit switches have proved simpler, more
reliable, lower power, higher capacity and lower
cost per Gb/s. Eventually, this is going to
matter.
Internet will evolve to become edge routers
interconnected by rich mesh of WDM circuit
switches.

171
Summary