Internet Routers Case Study - PowerPoint PPT Presentation

About This Presentation
Title:

Internet Routers Case Study

Description:

Internet Routers Case Study Eric Keller 4/19/07 and 4/24/07 Outline Overview/Background Landscape Router components RED/WRED MPLS 5 Example systems (2 Cisco, Juniper ... – PowerPoint PPT presentation

Number of Views:524
Avg rating:3.0/5.0
Slides: 67
Provided by: meganatha
Category:
Tags: case | internet | routers | study

less

Transcript and Presenter's Notes

Title: Internet Routers Case Study


1
Internet Routers Case Study
  • Eric Keller
  • 4/19/07 and 4/24/07

2
Outline
  • Overview/Background
  • Landscape
  • Router components
  • RED/WRED
  • MPLS
  • 5 Example systems (2 Cisco, Juniper, Avici,
    Foundry)
  • Software Routers

3
Choices choices
4
Interface Speeds
  • What Ill focus on
  • Most interesting architectures
  • Lower end (I think) will all be mostly software
  • Ill talk about Click for that

SourceChidamber Kulkarni
5
The US backbone
Core routers are supposed to be as fast as
possible Edge routers are supposed to have the
features But, core routers, seemingly, have all
the same functionality as edge, just faster (due
to blurring)
High Performance Switches and Routers, by H.
Jonathan Chao and Bin Liu
6
Internet Routers Components
  • 4 Basic components common to 4 of 5 systems
    studied (other had first 3 cards combined)
  • Interface Cards
  • Packet Processing cards
  • Switch Fabric Cards
  • Control Plane Cards

7
Data Path Functions
Network Processor
Ingress Traffic Manager
Switch Fabric
Egress Traffic Manager
- Parse - Identify flow - Determine Egress Port -
Mark QoS Parameters - Append TM or SF Header
- Police - Manage congestion (WRED) - Queue
packets in class-based VOQs - Segment packets
into switch cells
- Queue cells in class based VOQs - Flow control
TM per class based VOQ - Schedule class based
VOQs to egress ports
-Reassemble cells into packets -Shape outgoing
traffic -Schedule egress traffic
Source Vahid Tabatabaee
8
RED/WRED
  • Tail Drop drop packets when queues full or
    nearly full
  • TCP global synchronization as all TCP connections
    "hold back" simultaneously, and then step forward
    simultaneously
  • RED random early detection, uses probabilistic
    dropping (details on next slide)
  • Goal mark packets at fairly evenly spaced
    intervals to avoid global synchronization and
    avoid biases, and frequently enough to keep
    average queue size down
  • WRED RED for multiple queues (each with
    different probabilities)

9
RED/WRED
10
RED graphs
11
Multi-Protocol Label Switching (MPLS)
  • emulates some properties of a circuit-switched
    network over a packet-switched network.
  • 32 bit headers used for routing instead of IP
    address (longest prefix matching)
  • Popped at each hop
  • Has quality of service capabilities

12
MPLS
13
Internet Backbone Core Routers
  • Cisco CRS-1 (2004)
  • Cisco 12000 (prev generation)
  • Juniper T-Series (2004)
  • Avici TSR (2000)
  • Foundry XMR (2006)
  • (many failed companies)

14
Cisco CRS-1
  • Ciscos top end router for the internet backbone
  • Modular and distributed routing system
  • Scales up to 92 Tbps
  • Supports OC768c/STM-256c (40Gbps)
  • Fastest link the backbone carries today
  • 100 Gbps ready

15
Models
Each slot 40Gbps Some math 440Gbps 160
Gbps, But they say 320, why? Fabric shelf In
single shelf config, all switching is contained
on cards in this system. In multi shelf config
all switching is in its own rack (fabric card
shelf)
16
Recall 4 main components
  • Interface Cards
  • Packet Processing cards
  • Switch Fabric Cards
  • Control Plane Cards

17
Cisco CRS-1 example 4 slot shelf
  • Interface Cards
  • Packet Processing cards
  • Switch Fabric Cards
  • Control Plane Cards

Route Processor
Switch Fabric Cards on back
Multi Service Cards
Interface Cards (4 port OC192c/STM-64c)
18
Route Processor
  • Performs control plane routing protocols (e.g
    BGP)
  • Can control any line card on any shelf
    (recall-you can connect up to 72 shelves)
  • 1 redundant in each shelf
  • One 1.2 GHz PowerPC or Two 800-MHz Power PC
    symmetric multiprocessing (SMP)
  • CPUs can only communicate through switch fabric
    as if they were on a separate card.
  • Connectivity
  • Console port (RJ-45 connector)
  • Auxiliary port (RJ-45 connector)
  • One 10/100/1000 Ethernet port (RJ-45 connector)
  • Two 10/100/1000 Ethernet ports for control plane
    connectivity
  • Memory/storage
  • 4 GB of route memory per processor
  • 64 MB of boot Flash
  • 2 MB of nonvolatile RAM (NVRAM)
  • One 1-GB PCMCIA card (internal)
  • One 40-GB hard drive

19
Modular Service Card (MSC)
  • The packet processing engine
  • 1 for each interface module
  • Connected via a midplane (built into the chassis)
    to interface cards and switch fabric cards
  • Configurable with 2GB of route table memory (but
    the route processor has 4GB??)
  • GB of packet buffer memory per side
    (ingress/egress)
  • Two SPP 188 Tensilica CPUs

20
Silicon Packet Processor (SPP)
16 Clusters of 12 PPEs
From Eatherton ANCS05
21
From Eatherton ANCS05
22
From Eatherton ANCS05
23
Switching Fabric
  • 3-stage, dynamically self-routed Benes topology
  • Before more details, heres pic of Benes

24
Switching Fabric
  • 3-stage, dynamically self-routed Benes topology
    switching fabric
  • Stage 1 (S1)Distributes traffic to Stage 2 of
    the fabric plane. Stage 1 elements receive cells
    from the ingress MSC and distribute the cells to
    Stage 2 (S2) of the fabric plane.
  • Cells are distributed to S2 elements in
    round-robin fashion one cell goes to the first
    S2 element, the next cell goes to the next S2
    element, and so on
  • Stage 2 (S2)Performs switching, provides 2x
    speedup of cells (two output links for every
    input link). Stage 2 elements receive cells from
    Stage 1 and route them toward the appropriate
  • egress MSC and PLIM (single-shelf system)
  • egress line card chassis (multishelf system)
  • Stage 3 (S3)Performs switching, provides 2 times
    (2x) speedup of cells, and performs a second
    level of the multicast function. Stage 3 elements
    receive cells from Stage 2 and perform the
    switching necessary to route each cell to the
    appropriate egress MSC
  • Buffering at both S2 and S3
  • Uses backpressure - carried in cell header

Max 1152 ports?
25
Switch Fabric (some more info)
  • 8 Planes 1 redundant
  • Cells sent round robin between planes
  • Supports multicast up to 1 million groups
  • Separate virtual channels/queues for different
    priorities
  • Single shelf system, fabric cards contain all 3
    stages
  • Multi shelf system, fabric cards contain only
    stage 2, line cards contain stage 13

26
XYZ selects Cisco CRS-1
  • T-Com (division of Deutsche Telekom)
  • KT, Korea's leading service provider
  • SOFTBANK BB - for "Yahoo! BB" Super Backbone
  • Telstra Australia
  • Comcast
  • China Telecom
  • Free (Iliad Group) Fiber to the home in France
  • Lambda National Rail

27
Cisco 12000 (GSR) series
Internal Name BFR What about the CRS-1?
6 slot
4 slot
Depending on model 2.5 Gbps/slot 10 Gbps/slot 40
Gbps/slot (so max 1.28 Tbps)
16 slot
10 slot
28
Switch Fabric
  • Crossbar switch fabric.
  • 2.5Gbps fabric has a 16 x 16 crossbar and uses
    the ESLIP algorithm for scheduling.
  • 10Gbps fabric has a 64 x 64 crossbar and uses
    multichannel matching algorithm for scheduling.
  • Not sure about 40Gbps
  • 64 Byte Cells are used within the switching
    fabric.
  • 8 byte header, 48 byte payload and 8 byte CRC.
  • It takes roughly 160 nanoseconds to transmit a
    cell.
  • Unicast and Multicast data routing protocol
    packets are transmitting over the fabric.
  • Multicast packets are replicated within the
    fabric and transmitted to the destination line
    cards by means of partial fufillment. (Busy line
    cards are sent copies later when they are not
    busy).
  • Local traffic on a line card still has to transit
    the fabric.
  • e.g. a 40Gbps slot could have 4 10Gbps ports

http//cisco.cluepon.net
29
SCA - Scheduler Control ASIC
  • During each clock period (160ns)
  • Sending line cards send a fabric request to the
    SCA
  • SCA runs the ESLIP scheduling algorithm
  • SCA returns a fabric grant to the line card
  • Line card responds with a fabric grant accept
  • SCA sets the crossbar for that cell clock
  • SCA listens for fabric backpressure to stop
    scheduling for a particular line card

http//cisco.cluepon.net
30
Juniper T-series
  • TX Matrix
  • Connects up to 4 T640
  • Total 2.56 Tbps
  • T640
  • 16 slots (40 Gbps each)
  • OC768c
  • Total 640 Gbps
  • T320
  • 8 slots (40 Gbps each)
  • Total 320 Gbps

31
T640
Control Plane Card
Interface Cards
Packet Processing Cards
Switch Fabric Cards
32
Control Plane Card
  • 1.6-GHz Pentium IV processor with integrated
    256-KB Level 2 cache
  • 2-GB DRAM
  • 256-MB Compact flash drive for primary storage
  • 30-GB IDE hard drive for secondary storage
  • 10/100 Base-T auto-sensing RJ-45 Ethernet port
    for out-of-band management
  • Two RS-232 (DB9 connector) asynchronous serial
    ports for console and remote management

33
Packet Processing Card
  • L2/L3 Packet Processing ASICs remove Layer 2
    packet headers, segment incoming packets into 64
    Byte data cells for internal processing,
    reassemble data cells into L3 packets before
    transmission on the egress network interface, and
    perform L2 egress packet encapsulation.
  • A T-Series Internet Processor ASIC performs
    forwarding table lookups.
  • Queuing and Memory Interface ASICs manage the
    buffering of data cells in system memory and the
    queuing of egress packet notifications.
  • Priority queue into switch
  • Switch Interface ASICs manage the forwarding of
    data cells across the T640 routing node switch
    fabric.
  • Switch interface bandwidth considerably higher
    than network interface

34
Switch Fabric
  • For single T640 configuration, uses a 16 port
    crossbar (8 slots, each with 2 PFEs)
  • Request, grant
  • For flow control and fault detection
  • 4 parallel switch planes 1 redundant plane
  • Cell by cell distribution among planes (round
    robin)
  • Sequence numbers and reorder buffer at egress to
    maintain packet order
  • Fair Bandwidth Allocation (e.g. for when multiple
    ingress ports write to same egress port)
  • Graceful degradation (if 1 plane fails, just
    dont use it)

35
Switch Fabric
  • For multiple T640 configuration, uses a Clos
    switch (next slide)
  • The TX Matrix performs the middle stage
  • The 64x64 switch performed with the same 16x16
    crossbars as the T640
  • 4 switching planes
  • 1 redundant plane

36
Clos networks
  • 3-stage network (m, n, r)
  • m number of middle-staged switches
  • n number of input ports on input switches
    number o/p ports on o/p switches
  • r number of input/output switches
  • strictly non-blocking for unicast traffic iff
  • m gt 2n-1
  • Rearrangeably non blocking
  • m gt n

What would you expect Junipers to be?
37
Avici TSR
  • Scales from 40 Gbps to 5 Tbps
  • Each rack(14 racks max)
  • 40 router module slots
  • 4 route controller slots (no details)

38
Multi Service Connect (MSC)Line Cards
  • Interface Ports
  • Up to OC192c
  • Packet Processing (lookup)
  • Intel IXP 2400 network processor (next slide)
  • Meant for 2.5 Gbps processing
  • ASIC for QoS
  • Switch Fabric
  • Router node for the interconnect (in a couple
    slides)

Note this is 3 of the 4 main components on a
single board (which one is missing?)
39
Intel IXP2400
40
Interconnect
  • Bill Dally must have had some input (author of a
    white paper for Avici)
  • Topology
  • 3D Folded Torus 2x4x5 (40 nodes) single rack,
    14x8x5 (560) maximal configuration
  • 10 Gbps links
  • Routing source routing, random selection among
    24 minimal paths (limited non-minimal supported)
  • Flow Control 2 virtual channels for each output
    port (1120 max), each with their own buffers, one
    for best-effort, and one for guaranteed rate
    traffic

41
Topology
Passive backplane
6x4x5 system (3 racks of 2x4x5)
On right each circle is 5 line cards (in z
direction), backplane connects the 4 quadrants,
jumpers connect adjacent backplanes, loop back
connectors (jumpers) are placed at edge
machines. So each line represents 5
bidirectional channels (or 10 unidirectional)
In a fully-expanded 14x8x5, 560 line card,
system, one set of short cables is used to carry
the y-dimension channels between two rows of
racks.
42
Bisection Bandwidth Scaling
  • Claim can upgrade 3D torus 1 line card at a time
    (compare to crossbar, Clos, Benes)
  • Claims Benes can only double (but Cisco CRS-1
    scales to 1152 nodes)

speedup
2x2 x-y bisection constant as z dimension is
populated from 2x2x2 to 2x2x5
?
4x5 y-z bisection constant as x dimension
populated from 5x4x5 to 8x4x5
8x5 y-z bisection constant as x dimension
populated from 8x8x5 to 14x8x5
43
High Path Diversity
  • 3D torus has
    minimal paths
  • 8x8x8 gt 90 6 hop paths (avg message, not longest
    path)
  • At least 2 are edge disjoint
  • Load balance across paths
  • Routing randomly selects among 24 of the paths
  • Compare ability to and need to load balance for
    Crossbar, Clos?

44
Virtual Networks
  • 2 virtual channels per output port (best-effort,
    guaranteed bit rate 33us)
  • Max 1120 (14x5x8 torus with 2 per output)
  • Separate set of flit buffers at each channel for
    each virtual channels
  • Acts as an output queued crossbar
  • Makes torus non-blocking
  • Shared physical links
  • Never loaded to more than 2/3 due to load
    balancing and speedup
  • 72 Byte flits
  • worst-case expected waiting time to access a link
    is 60ns per hop

45
Foundry NetIron XMR(cleverly named XMR4000,
XMR8000, XMR16000, XMR 32000)
  • 4-, 8-,16-,and 32-slot racks
  • 40 Gbps per slot
  • (3 Tbps total capacity)
  • Up to 10 GigE (can be connected to SONET/SDH
    networks, but no built in optical)
  • As of March 2007, they do offer POS interfaces
  • Highest single rack switching capacity

46
Architecture
47
Packet Processing
  • Intel or AMCC network processor with offload
  • NetLogic NL6000
  • IPv4/IPv6 multilayer packet/flow classification
  • Policy-based routing and Policy enforcement (QoS)
  • Longest Prefix Match (CIDR)
  • Differentiated Services (DiffServ)
  • IP Security (IPSec)
  • Server Load Balancing
  • Transaction verification

48
Switch Fabric
  • Clos with data striping (same as planes)
  • Input queuing
  • Multiple priority queues for each output
  • 256k virtual queues
  • Output pulls data
  • Supports Multicast

49
Forwarding Tables Just to give some idea on sizes
  • NetIron XMR Industry leading scalability
  • 10 million BGP routes and up to 500 BGP peers
  • 1 million IPv4 routes in hardware (FIB)
  • 240,000 IPv6 routes in hardware (FIB)
  • 2,000 BGP/MPLS VPNs and up to 1 million VPN
    routes
  • 16,000 VLLs/VPLSes and up to 1 million VPLS MAC
    addresses
  • 4094 VLANs, and up to 2 million MAC addresses

50
Power Consumption(again, just to give some idea)
51
Cisco Juniper gt 90
  • Some recent (past 5 years) failed companies, I
    couldnt find any details on architectures
  • Chiaro
  • Axiowave
  • Pluris Inc
  • Procket (assets bought by Cisco)

52
Software Architectures orBus based architectures
53
Click Modular Router
54
General idea
  • Extensible toolkit for writing packet processors
  • Architecture centered on elements
  • Small building blocks
  • Perform simple operations e.g. decrement TTL
  • Written in C
  • Click routers
  • Directed graphs of elements
  • comes with library of 300, others contributed
    many others
  • Text files
  • Open source
  • Runs on Linux and BSD

From Bart Braem, Michael Voorhaen
55
Click graph
  • Elements connected by edges
  • Output ports to input ports
  • Describes possible packet flows
  • FromDevice(eth0) -gt Counter -gt Discard

From Bart Braem, Michael Voorhaen
56
Elements
  • Class
  • element type (reuse!)
  • Configuration string
  • initializes this instance
  • Input port(s)
  • Interface where packets arrive
  • Triangles
  • Output port(s)
  • Interface where packets leave
  • Squares
  • Instances can be named
  • myTee Tee

From Bart Braem, Michael Voorhaen
57
Push and pull ports
  • Push port
  • Filled square or triangle
  • Source initiates packet transfer
  • Event based packet flow
  • Pull port
  • Empty square or triangle
  • Destination initiates packet transfer
  • Used with polling, scheduling,
  • Agnostic port
  • Square-in-square or triangle-in-triangle
  • Becomes push or pull (inner square or triangle
    filled or empty)

From Bart Braem, Michael Voorhaen
58
Push and pull violations
  • Push port
  • Has to be connected to push or agnostic port
  • Conversion from push to pull
  • With push-to-pull element
  • E.g. queue
  • Pull port
  • Has to be connected to pull or agnostic port
  • Conversion from pull to push
  • With pull-to-push element
  • E.g. unqueue

From Bart Braem, Michael Voorhaen
59
Compound elements
  • Group elements in larger elements
  • Configuration with variables
  • Pass configuration to the internal elements
  • Can be anything (constant, integer, elements, IP
    address, )
  • Motivates reuse

From Bart Braem, Michael Voorhaen
60
Packets
  • Packet consists of
  • Payload
  • char
  • Access with struct
  • Annotations (metadata to simplify processing)
  • post-it
  • IP header information
  • TCP header information
  • Paint annotations
  • User defined annotations

From Bart Braem, Michael Voorhaen
61
Click scripts
  • Text files describing the Click graph
  • Elements with their configurations
  • Compound elements
  • Connections
  • src FromDevice(eth0)ctr Countersink
    Discardsrc -gt ctrctr -gt sink
  • FromDevice(eth0) -gt Counter -gt Discard

From Bart Braem, Michael Voorhaen
62
Click scripts (cont)
  • Input and output ports identified by number
    (0,1,..)
  • Input port -gt nr1Element -gt
  • Output port -gt Elementnr2 -gt
  • Both -gt nr1Elementnr2 -gt
  • If there is only one port number can be omitted
  • mypacketsIPClassifier(dst host
    myaddr,-)FromDevice(eth0) -gt
    mypacketsmypackets0 -gt Print(mine) -gt
    0Discardmypackets1 -gt Print(the
    others) -gt Discard

From Bart Braem, Michael Voorhaen
63
Compound elements in Click Scripts
  • elementclass DumbRouter myaddr mypackets
    IPClassifier(dst host myaddr,-) input0 -gt
    mypackets mypackets0 -gt 1output mypackets
    1 -gt 0outputu dumbrouter(1.2.3.4)FromD
    evice(eth0) -gt uu0 -gt Discardu1 -gt
    ToDevice(eth0)

From Bart Braem, Michael Voorhaen
64
Running Click
  • Multiple possibilities
  • Kernel module
  • Completely overrides Linux routing
  • High speed, requires root permissions
  • Userlevel
  • Runs as a daemon on a Linux system
  • Easy to install and still fast
  • Recommended
  • nsclick
  • Runs as a routing agent within the ns-2 network
    simulator
  • Multiple routers on 1 system
  • Difficult to install but less hardware needed

From Bart Braem, Michael Voorhaen
65
Where Click Is Used
  • MIT Roofnet (now Meraki Networks)
  • Wireless mesh networks
  • Mazu Network
  • network monitoring
  • Princetons VINI
  • Software Defined Radio Univ. of Colorado
  • Implemented on NPU (by group at Berkeley), FPGAs
    (by Xilinx and Colorado), multiprocessors (MIT)

66
The End
Write a Comment
User Comments (0)
About PowerShow.com