Title: High Speed Router Design
1High Speed Router Design
- Shivkumar Kalyanaraman
- Rensselaer Polytechnic Institute
- shivkuma_at_ecse.rpi.edu
- http//www.ecse.rpi.edu/Homepages/shivkuma
- Based in part on slides of Nick McKeown
(Stanford), S. Keshav (Ensim), Douglas Comer
(Purdue), - Raj Yavatkar (Intel), Cyriel Minkenberg (IBM
Zurich)
2Overview
- Introduction
- Evolution of High-Speed Routers
- High Speed Router Components
- Lookup Algorithm
- Classification
- Switching
3What do switches/routers look like?
Access routers e.g. ISDN, ADSL
Core router e.g. OC48c POS
Core ATM switch
4Dimensions, Power Consumption
Cisco GSR 12416
Juniper M160
19
19
Capacity 160Gb/sPower 4.2kW
Capacity 80Gb/sPower 2.6kW
6ft
3ft
2ft
2.5ft
5Where high performance packet switches are used
- Carrier Class Core Router - ATM Switch - Frame
Relay Switch
The Internet Core
6Where are routers? Ans Points of Presence (POPs)
7Why the Need for Big/Fast/Large Routers?
POP with smaller routers
POP with large routers
- Interfaces Price gt200k, Power gt 400W
- Space, power, interface cost economics!
- About 50-60 of i/fs are used for interconnection
within the POP. - Industry trend is towards large, single router
per POP.
8Job of router architect
- For a given set of features
9Performance metrics
- Capacity
- maximize C, s.t. volume lt 2m3 and power lt 5kW
- Throughput
- Maximize usage of expensive long-haul links.
- Trivial with work-conserving output-queued
routers - Controllable Delay
- Some users would like predictable delay.
- This is feasible with output-queueing plus
weighted fair queuing (WFQ).
10The Problem
- Output queued switches are impractical
R
R
R
R
DRAM
data
NR
NR
11Memory BandwidthCommercial DRAM
- Memory speed is not keeping up with Moores Law.
DRAM 1.1x / 18months
Moores Law 2x / 18 months
Router Capacity 2.2x / 18months
Line Capacity 2x / 7 months
12Packet processing is getting harder
CPU Instructions per minimum length packet since
1996
13Basic Ideas
14Forwarding Functions ATM Switch
- Lookup cell VCI/VPI in VC table.
- Replace old VCI/VPI with new.
- Forward cell to outgoing interface.
- Transmit cell onto link.
15Functions Ethernet (L2) Switch
- Lookup frame destination address (DA) in
forwarding table. - If known, forward to correct port.
- If unknown, broadcast to all ports.
- Learn source address (SA) of incoming frame.
- Forward frame to outgoing interface.
- Transmit frame onto link.
16Functions IP Router
- Lookup packet DA in forwarding table.
- If known, forward to correct port.
- If unknown, drop packet.
- Decrement TTL, update header Cksum.
- Forward packet to outgoing interface.
- Transmit packet onto link.
17Basic Architectural Components
Congestion Control
Control
Admission Control
Reservation
Routing
Datapath per-packet processing
Output Scheduling
Switching
Policing
18Basic Architectural Components
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
19Generic Router Architecture
Header Processing
Lookup IP Address
Update Header
Queue Packet
20Generic Router Architecture
Buffer Manager
Buffer Memory
Buffer Manager
Buffer Memory
Buffer Manager
Buffer Memory
21Simplest Design Software Router using PCs!
- Idea add special-purpose software to
general-purpose hardware Cheap, but slow - Measure of speed aggregate data rate or
aggregate packet rate - Limits number type of interfaces, topologies
etc - Eg 400 Mbps aggregate rate will allow four 100
Mbps ethernet interfaces, but no GbE! - Eg MITs Click Router
22Aggregate Packet vs Bit Rates
64 byte pkts
1518 byte pkts
23Per-Packet Processing Time Budget
MITs Click Router claims 435Kpps with 64 byte
packets! See http//www.pdos.lcs.mit.edu/click/
(gt it can do 100 Mbps, but not GbE interfaces!)
24Soln Decentralization/Parallelism
- Fine-grained parallelism instruction-level
- Symmetric coarse-grain parallelism multi-procs
- Asymmetric coarse-grain parallelism multi-procs
- Co-processors (ASICs)
- Operates under control of CPU
- Move expensive ops to hardware
- NICs with on-board processing
- Attack I/O bottleneck
- Move processing to the NIC (ASIC or embedded
RISC) - Handles only 1 interface rather than aggregate
rate! - Smart NICs with onboard stacks
- Cell Switching Design protocols to suit hardware
speeds! - Data pipelines
25Optimizations (contd)
26Demultiplexing vs Classification
- De-multiplexing in a layered model provides
freedom to use arbitrary protocols without
transmission overhead, but imposes sequential
processing limitations - Packet classification combines demuxing from a
sequence of opns at multiple layers to an
operation at one layer!
Overall goal flow segregation
27Classification example
28Hardware Optimization of Classification
29Hybrid Hardware/Software Classifier
30Conceptual Bindings
Connectionless Network
31Second Gen. Network Systems
32Switch Fabric Concept
Data path (aka backplane) that provides
parallelism Connects the NICs which have on-board
processing
33Desired Switch Fabric Properties
34Space Division Fabric
Asynchronous design arose from multi-processor
context Data can be sent across fabric at
arbitrary times
35Blocking and Port Contention
- Even if internally non-blocking (I.e. fully
inter-connected), port-contention can occur! Why
? - Need blocking circuits at input and output ports
36Crossbar Switched interconnections
- Use switches between each input and output
instead of separate paths active gt data flows
from I to O - Total number of paths required NM
- Number of switching points NxM
37Crossbar Switched interconnections
- Switch controller (centralized) handles port
contention - Allows transfers in parallel (upto MinN,M
paths) - Note port hardware can operate much slower!
- Issues switches, switch controller
- Port contention still exists
38Queuing input, output buffers
39Time-division Switching Fabrics
- Aka bus! (I.e. single shared link)
- Low cost and low speed (used in computers!)
- Need arbitration mechanism
- eg fixed time-slots or data-blocks, fixed cells,
variable packets
40Time division switching telephony
- Key idea when de-multiplexing, position in frame
determines output trunk - Time division switching interchanges sample
position within a frame time slot interchange
(TSI)
41Time-division Shared memory fabrics
- Memory interface hardware expensive gt many
ports share fewer memory interfaces - Eg dual-ported memory
- Separate low-speed bus lines for controller
42(No Transcript)
43Multi-Stage Fabrics
- Compromise between pure time-division and pure
space division - Attempt to combine advantages of each
- Lower cost from time-division
- Higher performance from space-division
- Technique Limited Sharing
- Eg Banyan switch
- Features
- Scalable
- Self-routing, I.e. no central controller
- Packet queues allowed, but not required
- Note multi-stage switches share the
crosspoints which have now become expensive
resources
44Banyan Switch Fabric (Contd)
- Basic building block 2x2 switch, labelled by
0/1 - Can be synchronous or asynchronous
- Asynch gt packets can arrive at arbitrary times
- Synchronous banyan offers TWICE the effective
throughput! - Worst case when all inputs receive packets with
same label
45Banyan Fabric
More on switching later
46Forwardinga.k.a. Port Mapping
47Basic Architectural ComponentsForwarding
Decision
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
48ATM and MPLS SwitchesDirect Lookup
(Port, VCI)
VCI
Memory
Address
Data
49Bridges and Ethernet SwitchesAssociative Lookups
Associative Memory or CAM
Network Address
Associated Data
Search Data
48
50Bridges and Ethernet SwitchesHashing
Search Data
Hashing Function
16
Data
Memory
Address
48
51Lookups Using HashingAn example
Memory
1
2
3
4
Search Data
Hashing Function
16
1
2
CRC-16
48
1
2
3
Linked lists
52Lookups Using HashingPerformance of simple
example
53Lookups Using Hashing
- Advantages
- Simple
- Expected lookup time can be small
- Disadvantages
- Non-deterministic lookup time
- Inefficient use of memory
54Per-packet processing in an IP Router
- 1. Accept packet arriving on an incoming link.
- 2. Lookup packet destination address in the
forwarding table, to identify outgoing port(s). - 3. Manipulate packet header e.g., decrement TTL,
update header checksum. - 4. Send (switch) packet to the outgoing port(s).
- 5. Classify and buffer packet in the queue.
- 6. Transmit packet onto outgoing link.
55Caching Addresses
Slow Path
Buffer Memory
CPU
Fast Path
56Caching Addresses
57IP Router Lookup
- IPv4 unicast destination address based lookup
58Lookup and Forwarding Engine
Packet
header
payload
Router
Routing Lookup Data Structure
Destination Address
Outgoing Port
Forwarding Table
Dest-network
Port
65.0.0.0/8
3
128.9.0.0/16
1
149.12.0.0/19
7
59Example Forwarding Table
Destination IP Prefix Outgoing Port
65.0.0.0/ 8 3
128.9.0.0/16 1
142.12.0.0/19 7
Prefix length
IP prefix 0-32 bits
142.12.0.0/19
128.9.0.0/16
65.0.0.0/8
0
232-1
224
65.0.0.0
65.255.255.255
60Prefixes can Overlap
Longest matching prefix
128.9.176.0/24
128.9.16.0/21
128.9.172.0/21
142.12.0.0/19
65.0.0.0/8
128.9.0.0/16
0
232-1
Routing lookup Find the longest matching prefix
(aka the most specific route) among all prefixes
that match the destination address.
61Difficulty of Longest Prefix Match
- 2-dimensional search
- Prefix Length
- Prefix Value
32
24
Prefix Length
128.9.176.0/24
128.9.172.0/21
128.9.16.0/21
142.12.0.0/19
128.9.0.0/16
65.0.0.0/8
8
Prefix Values
62IP RoutersMetrics for Lookups
- Lookup time
- Storage space
- Update time
- Preprocessing time
128.9.16.14
63Lookup Rates Required
40B packets (Mpps)
Line-rate (Gbps)
Line
Year
1.94
0.622
OC12c
1998-99
7.81
2.5
OC48c
1999-00
31.25
10.0
OC192c
2000-01
125
40.0
OC768c
2002-03
64Update Rates Required
- Recent BGP studies show that updates can be
- Bursty several 100s of routes updated/withdrawn
gt insert/delete operations - Frequent Average 100 updates per second
- Need data structure to be efficient in terms of
lookup as well as update (insert/delete)
operations.
65Size of the Forwarding Table
Renewed Exponential Growth
Number of Prefixes
10,000/year
95
96
97
98
99
00
Year
Renewed growth due to multi-homing of enterprise
networks!
- Source http//www.telstra.net/ops/bgptable.html
66Potential Hyper-Exponential Growth!
Global routing table vs Moore's law since 1999
160000
Global prefixes
Moore's law
150000
Double growth
140000
130000
120000
110000
Prefixes
100000
90000
80000
70000
60000
50000
01/99
04/99
07/99
10/99
01/00
04/00
07/00
10/00
01/01
04/01
67Trees and Tries
Binary Search Tree
Binary Search Trie
lt
gt
0
1
lt
gt
lt
gt
0
1
0
1
111
010
68Trees and TriesMultiway tries
16-ary Search Trie
0000, ptr
1111, ptr
1111, ptr
0000, 0
1111, ptr
0000, 0
000011110000
111111111111
69Lookup Multiway TriesTradeoffs
Table produced from 215 randomly generated 48-bit
addresses
70Routing Lookups in Hardware
Number
Prefix length
Most prefixes are 24-bits or shorter
71Routing Lookups in Hardware
224 16M entries
Prefixes up to 24-bits
142.19.6
142.19.6.14
14
72Routing Lookups in Hardware
Prefixes up to 24-bits
1
Next Hop
128.3.72
128.3.72.44
44
73Switchinga.k.a. Interconnect
74Basic Architectural Components Interconnect
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
75First-Generation IP Routers
Shared Backplane
Buffer Memory
CPU
- Most Ethernet switches and cheap packet routers
- Bottleneck can be CPU, host-adaptor or I/O bus
- What is costly? Bus ? Memory? Interface? CPU?
76Second-Generation IP Routers
- Port mapping intelligence in line cards
- Higher hit rate in local lookup cache
- What is costly? Bus ? Memory? Interface? CPU?
77Third-Generation Switches/Routers
Switched Backplane
Line Card
CPU Card
Line Card
Local Buffer Memory
Local Buffer Memory
MAC
MAC
- Third generation switch provides parallel paths
(fabric) - Whats costly? Bus? Memory, CPU?
78Fourth-Generation Switches/RoutersClustering and
Multistage
13
14
15
16
17
18
25
26
27
28
29
30
1
2
3
4
5
6
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
19
20
21
22
23
24
31
32
21
7
8
9
10
11
12
79Switching goals (telephony data)
80Circuit switch
- A switch that can handle N calls has N logical
inputs and N logical outputs - N up to 200,000
- Moves 8-bit samples from an input to an output
port - Recall that samples have no headers
- Destination of sample depends on time at which it
arrives at the switch - In practice, input trunks are multiplexed
- Multiplexed trunks carry frames set of samples
- Goal extract samples from frame, and depending
on position in frame, switch to output - each incoming sample has to get to the right
output line and the right slot in the output frame
81Call blocking
- Cant find a path from input to output
- Internal blocking
- slot in output frame exists, but no path
- Output blocking
- no slot in output frame is available
- Output blocking is reduced in transit switches
- need to put a sample in one of several slots
going to the desired next hop
82Multiplexors and demultiplexors
- Most trunks time division multiplex voice samples
- At a central office, trunk is demultiplexed and
distributed to active circuits - Addressing not required
- Synchronous multiplexor N input lines
- Output runs N times as fast as input
1
1
2
2
3
3
De- MUX
MUX
1
2
3
N
N
N
83Switching what does a switch do?
- Transfers data from an input to an output
- many ports (density), high speeds
- Eg Crossbar
84Circuit Switch
85Issue Call Blocking
86Time division switching
- Key idea when de-multiplexing, position in frame
determines output trunk - Time division switching interchanges sample
position within a frame time slot interchange
(TSI)
87Scaling Issues with TSI
88Space division switching
- Each sample takes a different path through the
switch, depending on its destination
89Crossbar
- Simplest possible space-division switch
- Crosspoints can be turned on or off, long enough
to transfer a packet from an input to an output - Internally nonblocking
- but need N2 crosspoints
- time to set each crosspoint grows quadratically
90Multistage crossbar
- In a crossbar during each switching time only one
cross-point per row or column is active - Can save crosspoints if a cross-point can attach
to more than one input line (why?) - This is done in a multistage crossbar
- Need to rearrange connections every switching time
91Multistage crossbar
- Can suffer internal blocking
- unless sufficient number of second-level stages
- Number of crosspoints lt N2
- Finding a path from input to output requires a
depth-first-search - Scales better than crossbar, but still not too
well - 120,000 call switch needs 250 million crosspoints
92Time-Space Switching
93Time-Space-Time (TST) switching
Telephone switches like 5ESS use multiple
space-stages eg TSSST etc
94Packet switches
- In a circuit switch, path of a sample is
determined at time of connection establishment - No need for a sample header--position in frame
used - In a packet switch, packets carry a destination
field or label - Need to look up destination port on-the-fly
- Datagram switches
- lookup based on entire destination address
(longest-prefix match) - Cell or Label-switches
- lookup based on VCI or Labels
95Blocking in packet switches
- Can have both internal and output blocking
- Internal
- no path to output
- Output
- trunk unavailable
- Unlike a circuit switch, cannot predict if
packets will block (why?) - If packet is blocked gt must either buffer or
drop
96Dealing with blocking in packet switches
- Over-provisioning
- internal links much faster than inputs
- Buffers
- at input or output
- Backpressure
- if switch fabric doesnt have buffers, prevent
packet from entering until path is available - Parallel switch fabrics
- increases effective switching capacity
97Switch Fabrics Buffered crossbar
- What happens if packets at two inputs both want
to go to same output? - Can defer one at an input buffer
- Or, buffer cross-points complex arbiter
98Switch fabric element
- Goal towards building self-routing fabrics
- Can build complicated fabrics from a simple
element - Routing rule if 0, send packet to upper output,
else to lower output - If both packets to same output, buffer or drop
99Banyan
- Simplest self-routing recursive fabric
- What if two packets both want to go to the same
output? - output blocking
100Features of multi-stage switches
- Issue output blocking two packets want to go to
same output port
101Blocking in Banyan Fabric
102Blocking in Banyan S/ws Sorting
- Can avoid blocking by choosing order in which
packets appear at input ports - If we can
- present packets at inputs sorted by output
- remove duplicates
- remove gaps
- precede banyan with a perfect shuffle stage
- then no internal blocking
- For example X, 010, 010, X, 011, X, X, X
- Sort gt 010, 011, 011, X, X, X, X, X
- Remove dups gt 010, 011, X, X, X, X, X, X
- Shuffle gt 010, X, 011, X, X, X, X,
X - Need sort, shuffle, and trap networks
103Sorting using Merging
- Build sorters from merge networks
- Assume we can merge two sorted lists
- Sort pairwise, merge, recurse
104Putting together Batcher-Banyan
105Non-Blocking Batcher-Banyan
Batcher Sorter
Self-Routing Network
3
7
7
7
7
7
7
000
7
2
5
0
4
6
6
001
5
3
2
5
5
4
5
010
2
5
3
1
6
5
4
011
6
6
1
3
0
3
3
100
0
1
0
4
3
2
2
101
1
0
6
2
1
0
1
110
4
4
4
6
2
2
0
111
- Fabric can be used as scheduler.
- Batcher-Banyan network is blocking for multicast.
106Queuing, Buffer Management, Classification
107Basic Architectural Components Queuing,
Classification
3.
1.
Output Scheduling
2.
Forwarding Table
Interconnect
Forwarding Decision
Forwarding Table
Forwarding Decision
Forwarding Table
Forwarding Decision
108QueuingTwo basic techniques
Input Queueing
Output Queueing
Usually a non-blocking switch fabric (e.g.
crossbar)
Usually a fast bus
109QueuingOutput Queueing
Individual Output Queues
Centralized Shared Memory
1
2
N
1
2
N
110Input Queuing
111Input QueueingHead of Line Blocking
Delay
Load
100
112Solution Input Queueing w/Virtual output queues
(VOQ)
113Head-of-Line (HOL) in Input Queuing
114Input QueuesVirtual Output Queues
Delay
Load
100
115Output Queuing
116Packet Classification
HEADER
Action
Incoming Packet
117Multi-field Packet Classification
Given a classifier with N rules, find the action
associated with the highest priority rule
matching an incoming packet.
118Prefix matching 1-d range problem
128.9/16
0
232-1
128.9.16.14
119Classification 2D Geometry problem
R7
R6
R2
R1
R4
R5
R3
e.g. (144.24/16, 64/24)
e.g. (128.16.46.23, )
Field 2
Field 1
120Network ProcessorsBuilding Block for
programmable networks
- Slides from Raj Yavatkar, raj.yavatkar_at_intel.com
121Intel IXP Network Processors
- Microengines
- RISC processors optimized for packet processing
- Hardware support for multi-threading
- Fast path
- Embedded StrongARM/Xscale
- Runs embedded OS and handles exception tasks
- Slow path, Control plane
122Various forms of Processors
Embedded Processor (run-to-completion)
Parallel architecture
Pipelined Architecture
123Software Architectures
124Division of Functions
125Packet Flow Through the Hierarchy
126Scaling Network Processors
127Memory Scaling
128Memory Scaling (contd)
129Memory Types
130Memory Caching and CAM
CACHE
Content Addressable Memory (CAM)
131CAM and Ternary CAM
CAM Operation
Ternary CAM (T-CAM)
132Ternary CAMs
Associative Memory
Value
Mask
10.0.0.0
R1
255.0.0.0
255.255.0.0
Next Hop
10.1.0.0
R2
255.255.255.0
10.1.1.0
R3
10.1.3.0
R4
255.255.255.0
255.255.255.255
10.1.3.1
R4
Priority Encoder
Using T-CAMs for Classification
133IXP A Building Block for Network Systems
- Example IXP2800
- 16 micro-engines XScale core
- Up to 1.4 Ghz ME speed
- 8 HW threads/ME
- 4K control store per ME
- Multi-level memory hierarchy
- Multiple inter-processor communication channels
- NPU vs. GPU tradeoffs
- Reduce core complexity
- No hardware caching
- Simpler instructions ? shallow pipelines
- Multiple cores with HW multi-threading per chip
134IXP 2400 Block Diagram
135IXP2800 Features
- Half Duplex OC-192 / 10 Gb/sec Ethernet Network
Processor - XScale Core
- 700 MHz (half the ME)
- 32 Kbytes instruction cache / 32 Kbytes data
cache - Media / Switch Fabric Interface
- 2 x 16 bit LVDS Transmit Receive
- Configured as CSIX-L2 or SPI-4
- PCI Interface
- 64 bit / 66 MHz Interface for Control
- 3 DMA Channels
- QDR Interface (w/Parity)
- (4) 36 bit SRAM Channels (QDR or Co-Processor)
- Network Processor Forum LookAside-1 Standard
Interface - Using a clamshell topology both Memory and
Co-processor can be instantiated on same channel - RDR Interface
- (3) Independent Direct Rambus DRAM Interfaces
- Supports 4i Banks or 16 interleaved Banks
- Supports 16/32 Byte bursts
136Hardware Features to ease packet processing
- Ring Buffers
- For inter-block communication/synchronization
- Producer-consumer paradigm
- Next Neighbor Registers and Signaling
- Allows for single cycle transfer of context to
the next logical micro-engine to dramatically
improve performance - Simple, easy transfer of state
- Distributed data caching within each micro-engine
- Allows for all threads to keep processing even
when multiple threads are accessing the same
data
137XScale Core processor
- Compliant with the ARM V5TE architecture
- support for ARMs thumb instructions
- support for Digital Signal Processing (DSP)
enhancements to the instruction set - Intels improvements to the internal pipeline to
improve the memory-latency hiding abilities of
the core - does not implement the floating-point
instructions of the ARM V5 instruction set
138Microengines RISC processors
- IXP 2800 has 16 microengines, organized into 4
clusters (4 MEs per cluster) - ME instruction set specifically tuned for
processing network data - 40-bit x 4K control store
- Six-stage pipeline in an instruction
- On an average takes one cycle to execute
- Each ME has eight hardware-assisted threads of
execution - can be configured to use either all eight threads
or only four threads - The non-preemptive hardware thread arbiter swaps
between threads in round-robin order
139MicroEngine v2
D-Push Bus
S-Push Bus
From Next Neighbor
Control Store 4K Instructions
Local Memory 640 words
128 GPR
128 GPR
128 Next Neighbor
128 S Xfer In
128 D Xfer In
LM Addr 1
2 per CTX
B_op
A_op
LM Addr 0
Prev B
Prev A
P-Random
B_Operand
A_Operand
CRC Unit
Multiply
Lock 0-15
Status and LRU Logic (6-bit)
TAGs 0-15
32-bit ExecutionData Path
Find first bit
CAM
CRC remain
Add, shift, logical
Status
Entry
Local CSRs
ALU_Out
To Next Neighbor
Timers
128 S Xfer Out
128 D Xfer Out
Timestamp
D-Pull Bus
S-Pull Bus
140Why Multi-threading?
141Packet processing using multi-threading within a
MicroEngine
142Registers available to each ME
- Four different types of registers
- general purpose, SRAM transfer, DRAM transfer,
next-neighbor (NN) - 256, 32-bit GPRs
- can be accessed in thread-local or absolute mode
- 256, 32-bit SRAM transfer registers.
- used to read/write to all functional units on the
IXP2xxx except the DRAM - 256, 32-bit DRAM transfer registers
- divided equally into read-only and write-only
- used exclusively for communication between the
MEs and the DRAM - Benefit of having separate transfer and GPRs
- ME can continue processing with GPRs while other
functional units read and write the transfer
registers
143Different Types of Memory
Type of Memory Logical width (bytes) Size in bytes Approx unloaded latency (cycles) Special Notes
Local to ME 4 2560 3 Indexed addressing post incr/decr
On-chip scratch 4 16K 60 Atomic ops 16 rings w/at. get/put
SRAM 4 256M 150 Atomic ops 64-elem q-array
DRAM 8 2G 300 Direct path to/from MSF
144IXA Software Framework
ExternalProcessors
Control Plane Protocol Stacks
Control Plane PDK
XScaleCore
Core Components
C/C Language
Core Component Library
Resource Manager Library
Microblock Library
Microengine Pipeline
MicroengineC Language
Micro block
Micro block
Micro block
Utility Library
Protocol Library
Hardware Abstraction Library
145- Micro-engine
- C Compiler
- C language constructs
- Basic types,
- pointers, bit fields
- In-line assembly code support
- Aggregates
- Structs, unions, arrays
146What is a Microblock
- Data plane packet processing on the microengines
is divided into logical functions called
microblocks - Coarse Grain and stateful
- Example
- 5-Tuple Classification, IPv4 Forwarding, NAT
- Several microblocks running on a microengine
thread can be combined into a microblock group. - A microblock group has a dispatch loop that
defines the dataflow for packets between
microblocks - A microblock group runs on each thread of one or
more microengines - Microblocks can send and receive packets to/from
an associated Xscale Core Component.
147Core Components and Microblocks
XScale Core
Micro- engines
Microblock Library
Core Libraries
User-written code
Intel/3rd party blocks
148Applications of Network Processors
- Fully programmable architecture
- Implement any packet processing applications
- Examples from customers
- Routing/switching, VPN, DSLAM, Multi-servioce
switch, storage, content processing - Intrusion Detection (IDS) and RMON
- Use as a research platform
- Experiment with new algorithms, protocols
- Use as a teaching tool
- Understand architectural issues
- Gain hands-on experience withy networking systems
149Technical and Business Challenges
- Technical Challengers
- Shift from ASIC-based paradigm to software-based
apps - Challenges in programming an NPU
- Trade-off between power, board cost, and no. of
NPUs - How to add co-processors for additional
functions? - Business challenges
- Reliance on an outside supplier for the key
component - Preserving intellectual property advantages
- Add value and differentiation through software
algorithms in data plane, control plane, services
plane functionality - Must decrease time-to-market (TTM) to be
competitive
150Challenges in Modern Tera-bit Class Switch Design
151Goals
- Design of a terabit-class system
- Several Tb/s aggregate throughput
- 2.5 Tb/s 256x256 OC-192 or 64x64 OC-768
- OEM
- Achieve wide coverage of application spectrum
- Single-stage
- Electronic fabric
152System Architecture
153Power
- Requirement
- Do not exceed the per shelf (2 kW), per board
(150W), and per chip (20W) budgets - Forced-air cooling, avoid hot-spots
- More throughput at same power Gb/s/W density is
increasing - I/O uses an increasing fraction of power (gt 50)
- Electrical I/O technology has not kept pace with
capacity demand - Low-power, high-density I/O technology is a must
- CMOS density increases faster than W/gate
decreases - Functionality/chip constrained by power rather
than density - Power determines the number of chips and boards
- Architecture must be able to be distributed
accordingly
154Packaging
- Requirement
- NEBS compliance
- Constrained by
- Standard form factors
- Power budget at chip, card, rack level
- Switch core
- Link, connector, chip packaging technology
- Connector density (pins/inch)
- CMOS density doubles, number of pins 5-10 per
generation - This determines the maximum per-chip and per-card
throughput - Line cards
- Increasing port counts
- Prevalent line rate granularity OC-192 (10 Gb/s)
- 1 adapter/card
- gt 1 Tb/s systems require multi-rack solutions
- Long cables instead of backplane (30 to 100m)
- Interconnect accounts for large part of system
cost
155Packaging
- 2.5 Tb/s, 1.6x speedup, 2.5 Gb/s links 8b/10b
4000 links (diff. pairs)
156Switch-Internal Round-Trip (RT)
- Physical system size
- Direct consequence of packaging
- CMOS technology
- Clock speeds increasing much slower than density
- More parallelism required to increase throughput
- Shrinking packet cycle
- Line rates have up drastically (OC-3 through
OC-768) - Minimum packet size has remained constant
- Large round-trip (RT) in terms of min. packet
duration - Can be (many) tens of packets per port
- Used to be only a node-to-node issue, now also
inside the node - System-wide clocking and synchronization
Evolution of RT
157Switch-Internal Round-Trip (RT)
switch fabric
line card 1
switch core
switch fabric interface chips
line card N
- Consequences
- Performance impact?
- All buffers must be scaled by RT
- Fabric-internal flow control becomes an important
issue
158Speed-Up
- Requirement
- Industry standard 2x speed-up
- Three flavors
- Utilization compensate SAR overhead
- Performance compensate scheduling inefficiencies
- OQ speed-up memory access time
- Switch core speed-up S is very costly
- Bandwidth is a scarce resource COST and POWER
- Core buffers must run S times faster
- Core scheduler must run S times faster
- Is it really needed?
- SAR overhead reduction
- Variable-length packet switching hard to
implement, but may be more cost-effective - Performance does the gain in performance justify
the increase in cost and power? - Depends on application
- Low Internet utilization
159Multicast
- Requirement
- Full multicast support
- Many multicast groups, full link utilization, no
blocking, QoS - Complicates everything
- Buffering, queuing, scheduling, flow control, QoS
- Sophisticated multicast support really needed?
- Expensive
- Often disabled in the field
- Complexity, billing, potential for abuse, etc.
- Again, depends on application
160Packet size
- Requirement
- Support very short packets (32-64B)
- 40B _at_ OC-768 8 ns
- Short packet duration
- Determines speed of control section
- Queues and schedulers
- Implies longer RT
- Wider data paths
- Do we have to switch short packets individually?
- Aggregation techniques
- Burst, envelope, container switching, packing
- Single-stage, multi-path switches
- Parallel packet switch
161100Tb/s optical routerStanford University
Research Project
- Collaboration
- 4 Professors at Stanford (Mark Horowitz, Nick
McKeown, David Miller and Olav Solgaard), and our
groups. - Objective
- To determine the best way to incorporate optics
into routers. - Push technology hard to expose new issues.
- Photonics, Electronics, System design
- Motivating example The design of a 100 Tb/s
Internet router - Challenging but not impossible (100x current
commercial systems) - It identifies some interesting research problems
162100Tb/s optical router
Optical Switch
Electronic Linecard 1
Electronic Linecard 625
160- 320Gb/s
160- 320Gb/s
40Gb/s
- Line termination
- IP packet processing
- Packet buffering
- Line termination
- IP packet processing
- Packet buffering
40Gb/s
160Gb/s
40Gb/s
Arbitration
Request
40Gb/s
Grant
(100Tb/s 625 160Gb/s)
163Research Problems
- Linecard
- Memory bottleneck Address lookup and packet
buffering. - Architecture
- Arbitration Computation complexity.
- Switch Fabric
- Optics Fabric scalability and speed,
- Electronics Switch control and link electronics,
- Packaging Three surface problem.
164160Gb/s Linecard Packet Buffering
b
DRAM
DRAM
DRAM
160 Gb/s
160 Gb/s
Queue Manager
SRAM
- Problem
- Packet buffer needs density of DRAM (40 Gbits)
and speed of SRAM (2ns per packet) - Solution
- Hybrid solution uses on-chip SRAM and off-chip
DRAM. - Identified optimal algorithms that minimize size
of SRAM (12 Mbits). - Precisely emulates behavior of 40 Gbit, 2ns SRAM.
165The Arbitration Problem
- A packet switch fabric is reconfigured for every
packet transfer. - At 160Gb/s, a new IP packet can arrive every 2ns.
- The configuration is picked to maximize
throughput and not waste capacity. - Known algorithms are too slow.
166100Tb/s Router
Optical links
Optical Switch Fabric
Racks of 160Gb/s Linecards
167Racks with 160Gb/s linecards
168Passive Optical Switching
Integrated AWGR or diffraction grating based
wavelength router
Midstage Linecard 1
Egress Linecard 1
Ingress Linecard 1
1
1
1
1
Midstage Linecard 2
Egress Linecard 2
2
Ingress Linecard 2
2
2
2
Midstage Linecard n
Egress Linecard n
n
Ingress Linecard n
n
n
n
169Predictions Core Internet routers
- The need for more capacity for a given power and
volume budget will mean - Fewer functions in routers
- Little or no optimization for multicast,
- Continued over-provisioning will lead to little
or no support for QoS, DiffServ, , - Fewer unnecessary requirements
- Mis-sequencing will be tolerated,
- Latency requirements will be relaxed.
- Less programmability in routers, and hence no
network processors (NPs used at edge). - Greater use of optics to reduce power in switch.
170Likely Events
- The need for capacity and reliability will mean
- Widespread replacement of core routers with
transport switching based on circuits - Circuit switches have proved simpler, more
reliable, lower power, higher capacity and lower
cost per Gb/s. Eventually, this is going to
matter. - Internet will evolve to become edge routers
interconnected by rich mesh of WDM circuit
switches.
171Summary
- High speed routers lookup, switching,
classification, buffer management - Lookup Range-matching, tries, multi-way tries
- Switching circuit s/w, crossbar, batcher-banyan,
- Queuing input/output queuing issues
- Classification Multi-dimensional geometry
problem - Road ahead to 100 Tbps routers