Title: Routing algorithms
1Routingalgorithms
- Yaakov (J) Stein Sept 2009
- Chief Scientist
- RAD Data Communications
2Outline
- Control Plane
- principles of IP routing
- longest prefix match
- RIBs and FIBs
-
- Forwarding plane
- classification and lookup mechanisms
- switching fabrics
3 4Routers
- A router is a combination of hardware, software,
and memory - that is responsible for forwarding packets
towards their destinations - Routers generally work at ISO layer 3 (network
layer) - but can also function at layer 2.5 (for MPLS)
- and may inspect higher layers, but only for
optimization - (QoS management, load balancing, etc.)
- Note that Ethernet switches technically filter
rather than forward - In order to correctly fulfill their function
(i.e. to know where to forward) - routers usually run routing protocols
- to exchange information between themselves
- Ethernet switches do not need such protocols as
they learn how to filter - So the router performs 2 distinct algorithms
- forwarding algorithm (forwarding component)
- routing algorithm (control component)
5What does a router do ?
- Control plane (routing algorithm)
- run routing protocols
- identify interface and next hop L2 addresses
- populate RIBs (if Link State, perform SPFs)
- scan all RIBs, and produce FIB (entries map FEC
to NH) - Data plane (forwarding algorithm)
- deframing (CRC/checksum/defragmentation/reassembly
/demapping) - parsing (pulling values from appropriate fields
simple IPv4 DA,
complex finding URL or MIB variable) - FEC classification (add metadata, based on DA,
DAToS, MPLS, ) - lookup / search
- packet modification and replication
- framing
- traffic management and queuing
- compression, encryption, etc.
6IP networks
- IP networks are made up of
- hosts
- middleboxes (e.g., firewalls, NATs, NAPTs,
Application Layer GWs) - routers (obsolete terminology gateways)
- It will be useful to differentiate between
- core routers (connect to other routers)
- edge routers (connect to hosts)
- We will see shortly that it is more complex than
that - To understand how a router is different from
other network elements - we need to know the basic principles the IP
protocol architecture - We will mainly deal with IPv4 unicast forwarding
Routing Slide 6
7The basics (1)
- The first principle of IP is the end-to-end (E2E)
principle - All functionality should be implemented only with
the knowledge - and help of the application at the end points
- The second principle is the hourglass model
- IP (l3) is the common layer
- below IP (L3) is not part of IP suite, above is
- Thus
- most functionality and state is in the hosts
- middlebox functionality is severely limited
- routers are limited to forwarding packets
- without extensive packet manipulation (exception
- TTL) - The third principle is that forwarding is
- connectionless
- on a hop-by-hop basis
Routing Slide 7
8The basics (2)
- The fourth principle is that unicast IP
forwarding is performed - based on a Destination Address (DA)
- Addresses must usually be unique (end-to-end
principle) - Hosts usually have a single IP address, routers
have many addresses - It is the responsibility of a service provider
(SP) to allocate addresses - IP addresses are not arbitrary, like Ethernet MAC
addresses - The fifth principle is that IP addresses are
aggregated into subnetworks - All addresses in a subnetwork share a common
prefix - Subnetworks may be further aggregated
- The sixth principle is that it is the
responsibility of the router - to forward towards the hosts subnet
- but it is not its responsibility to deliver the
packet on the subnet - the IP suite starts above L2
- subnets L2 (e.g., Ethernet, PPP) delivers the
packet to the host
9IP Routing types
- Distance Vector (Bellman-Ford), e.g. RIP, RIPv2,
- IGRP, EIGRP
- send ltaddr,costgt to neighbors
- routers maintain cost to all destinations
- need to solve count to ? problem
- Path Vector, e.g. BGP
- send ltaddr,cost,pathgt to neighbors
- similar to distance vector, but w/o count to ?
problem - like distance vector has slow convergence
- doesnt require consistent topology
- can support hierarchical topology gt exterior
protocol (EGP) - Link State, e.g. OSPF, IS-IS
- send ltneighbor-addr,costgt to all routers
- determine entire flat network topology (SPF -
Dijkstras algorithm) - fast convergence, guaranteed loopless gt
interior routing protocol (IGP) - convergence time is the time taken until all
routers work consistently - before convergence is complete packets may be
misforwarded, and there may be loops
1.1
10 11FIBs
- Based on the 6 principles we can understand what
a router does - The router looks at the packets DA
- It deduces to which subnet the packet belongs
- If the router can directly interface that subnet
- it must use the appropriate L2 to send the
packet to the host - Otherwise it must retrieve the next hop (router)
- that sends the packet towards the subnet
- The next router does the same
- If routing has converged there will be no loops
or black holes - but there may be during transients
- The information needed by the router to properly
forward packets - is stored in the Forwarding Information Base
(FIB) - The FIB associates address prefixes with Next
Hops (NHs) - (and, to save an additional lookup, usually with
L2 addresses as well) - Do not confuse the FIB with a Routing Information
Base (RIB)
12More on FIBs
- Simple (and primitive) routers have a routing
table - modern large routers have several different
databases - The FIB is designed to be fast to search
- we will talk about its data structure later
- RIBs are designed to be fast to update
- which is quite a different structure
- There may be many RIBs, one for each routing
protocol running - and static routes may be entered into any of
them - There are sometime other databases as well
- for example link state routing protocols
require a LSDB - from which the RIB is built
- The basic idea is that we build the FIB from the
RIBs
13Router interfaces
- Routers connect to hosts and to other routers via
interfaces - (from 1 to many thousands of interfaces per
router) - Routers are responsible for forwarding packets
- arriving at an ingress interface
- to an egress interface
- Interfaces have layer 3 and above properties
- and also contain layer a and 2 properties (ports)
- Each interface is assigned a unique IP address
- Interfaces are grouped into subnetworks
- All interfaces on a subnetwork share the same
prefix
14Prefixes and masks
- Since 1993 (RFC 1519 - CIDR) subnets can have any
length prefix - There are two ways of specifying the prefix
length - slash notation, e.g., 192.168.16.0/20
- note unspecified bits are set to zero
- mask notation, e.g., 192.168.16.0 with mask
255.255.240.0 - Note that 192.168.16.0/20 means all addresses
- from 192.168.16.0 through 192.168.31.255
- Note that it contains 192.168.16.0/21,
192.168.24.0/22, etc. - since they are in the range and have longer
prefixes (larger masks) - /32 are fully qualified IP addresses
- 0.0.0.0/0 matches every IP address
- it is the default route
- route taken when there is no matching entry in
FIB - the gateway of last resort
15Prefix tables
slash mask
A.B.0.0/16 255.255.0.0
A.B. 0.0/15 255.254.0.0
A.B. 0.0/14 255.252.0.0
A.B. 0.0/13 255.248.0.0
A.B. 0.0/12 255.240.0.0
A.B. 0.0/11 255.224.0.0
A.B. 0.0/10 255.192.0.0
A.B.0.0/9 255.128.0.0
A.0.0.0/8 255.0.0.0
A.0.0.0/7 254. 0.0.0
A.0.0.0/6 252. 0.0.0
A.0.0.0/5 248. 0.0.0
A.0.0.0/4 240. 0.0.0
A.0.0.0/3 224. 0.0.0
A.0.0.0/2 192. 0.0.0
A.0.0.0/1 128. 0.0.0
0.0.0.0/0 0.0.0.0
slash Mask
A.B.C.D/32 255.255.255.255
A.B.C.D/31 255.255.255.254
A.B.C.D/30 255.255.255.252
A.B.C.D/29 255.255.255.248
A.B.C.D/28 255.255.255.240
A.B.C.D/27 255.255.255.224
A.B.C.D/26 255.255.255.192
A.B.C.D/25 255.255.255.128
A.B.C.0/24 255.255.255.0
A.B.C.0/23 255.255.254.0
A.B.C.0/22 255.255.252.0
A.B.C.0/21 255.255.248.0
A.B.C.0/20 255.255.240.0
A.B.C.0/19 255.255.224.0
A.B.C.0/18 255.255.192.0
A.B.0.0/17 255.255.128.0
Note for /25 D0 or 128 for /26 D 0, 64,
128, or 192 etc.
16Special IP addresses
- Some IP addresses are reserved for special
purposes - they are not assigned by IANA
- and may require special treatment by router
prefix range purpose
0.0.0.0/8 0.0.0.0 0.255.255.255 defaults
10.0.0.0/8 10.0.0.0 10.255.255.255 private addresses
127.0.0.0/8 127.0.0.0 127.255.255.255 loopback addresses
169.254.0.0/16 169.254.0.0 - 169.254.255.255 zeroconf
172.16.0.0/12 172.16.0.0 - 172.31.255.255 private addresses
192.0.2.0/24 192.0.2.0 - 192.0.2.255 Documentation
192.88.99.0/24 192.88.99.0 - 192.88.99.255 IPv6-IPv4 relay
192.168.0.0/16 192.168.0.0 - 192.168.255.255 private addresses
198.18.0.0/15 198.18.0.0 - 198.19.255.255 device benchmark
224.0.0.0/4 224.0.0.0 239.255.255.255 multicast
240.0.0.0/4 240.0.0.0 255.255.255.255 reserved
17Longest prefix match and FECs
- To find the subnet, we need to look at the
packets DA - and to find the best match for the DA that is
known - find known the FIB entry that matches the longest
prefix of the DA (LPM) - All packets that are forwarded in the same way
are grouped - into a Forwarding Equivalence Class (FEC)
- In the simplest case, a FEC is simply a known IP
address prefix - i.e., the packets subnet
- In more complex cases it might get more complex
- for example, ToS field, source address, etc.
- Every packet has to be looked up in the FIB and
classified to a FEC - and this forwarding has to be done fast
18Example
- Assume the following FIB
- and lets look up 192.0.2.131 (last byte
10000011) - It matches the first entry with prefix length 0
(everything matches) - It matches the second with length 24 (first three
bytes) - It matches the third entry with prefix length 25
(last byte 1xxxxxxx) - It does not match the fourth entry (11xxxxxx )
- So the packet is forwarded to next hop C
prefix next hop IP address
0.0.0.0/0 (gateway of last resort) A
192.0.2.0/24 B
192.0.2.128/25 C
192.0.2.192/26 D
19Packet processing time
How much time do we have to process a packet
? disregarding L2 overhead, IPGs, etc. So
the FIB data structure has to be optimized for
fast look-up
100 Mbps 1 Gbps 10 Gbps
64 B 5 msec 500 nanosec 50 nanosec
256 B 20 msec 2 msec 200 nanosec
1500 B 120msec 12 msec 1.2 msec
20Policy and autonomy
- The FIB information is based on
- static routes
- dynamic routes determined by routing protocols
- gateway of last resort (default route) 0.0.0.0/0
- In general we need to apply policy, since
- there are conflicting sources of information
- may not want to use, or even believe, information
received from peers - it is all a matter of autonomy
- an Autonomous System can request service from
another - but can not force it to provide service
21Autonomous systems
- Routers are grouped into Autonomous Systems (ASs)
- ASs may be grouped into domains
- AS look to the outside world as single entity
(they usually have an AS ID) - Routers in the same AS obey a common policy, and
trust each other - AS are truly autonomous
- one AS can request another to forward a packet,
but can not force it to - Inside ASs we run Internal Gateway Protocols
(IGPs) - e.g. OSPF, IS-IS
- Between ASs we run External Gateway Protocols
(EGPs) - e.g., BGP
- A router that runs both is called an AS Border
Router (ASBR)
22More of the story
- Actually, it can get a lot more complicated
- In general, a router will be running multiple
routing protocols - For example
- one or more IGPs (RIP,OSPF, IS-IS) between
routers in the same AS - internal BGP (iBGP) between routers in the same
AS - (usually a full mesh, but when too complex we
can use route reflectors) - external BGP (eBGP) between ASBRs in different
ASs - How does a router know if a BGP session is iBGP
or eBGP ? - by the AS number !
- IGP is used to find a path to another router
(including ASBR) in the same AS - eBGP is used by ASBRs to learn / distribute
routes to other ASs - iBGP is used for ASBR to inform core routers of
external routes
23Simplest example
- Stub ASs (my home router)
- single homed to outside world
- single internal subnet, so dont need IGP
- single homed, so dont need to run BGP to ISP
- dont need to have an AS number
0.0.0.0/0
.1
192.168.0.0/24
.101
.102
.104
.105
.103
24More complex example
- Connecting to a server connected to another ISP
with dual homing - Routers 3 and 6 learn from eBGP how to reach
A.B.C.D - Policy determines that 3 will be used (see later)
- Router 1 learns from iBGP session that A.B.C.D is
reachable via router 3 - Router 1 learns from IGP that router 3 is
reachable via router 2 - Router 2 knows how to directly reach router 3
because of IGP adjacency - Packet from a.b.c.d is forwarded via 1-2-3 to AS
2 and to A.B.C.D
1
a.b.c.d
2
3
7
4
5
6
AS 2
AS 1
eBGP sessions
Full mesh of iBGP sessions
A.B.C.D
25Even more complex example
- Three ASs, with one possibly acting as a transit
domain
AS 1 would like to hand off the traffic to AS
2 AS 2 has no economic incentive to carry this
traffic But AS 2 gets the route from AS3 What can
AS 2 do to stop this ? (remember autonomy!) What
can AS 1 then do ?
26Rules for customer ASs
- Stub AS
- Single-homed AS does not need to learn routes
from provider - It only has to send all traffic via its unique
exit point (0.0.0.0/0) - Provider gets routes from static or IGP or
private-AS eBGP - Multihomed Nontransit AS
- AS advertises only its own routes to both SPs
- AS filters out traffic for foreign routes that
reach it via static/default routing - eBGP is not needed, but recommended for route
propagation and filtering - Multihomed Transit AS
- Uses eBGP to SPs and iBGP for transit traffic
27BGP rules
- There are
- internal routes
- external routes
- customer routes
- When eBGP learns a route it is repeated via iBGP
to all others in AS - thus all routers in AS learn it
- When iBGP learns a route it is repeated only to
externals via eBGP - since internals also get it directly
- When there is another ASBR that can reach the
same other AS - a second route is repeated by iBGP to the ASBR
- The ASBR will then make the decision as to which
to use !
28IGP rules
- IGPs are used between routers in the same AS
- so IGPs do not have sophisticated policy control
- routers usually blindly accept all information
received - For proper operation (no routing loops)
- all routers in AS must have the same IGP RIB
- for link state protocols (OSPF, IS-IS)
- there is a Link State Data Base (LSDB)
- from which IGP RIBs can be constructed (will be
explained shortly) - Because all routers have the same LSDB
- Although the forwarding is hop-by-hop
- the result is the same as if there were
coordination - IGPs
- do not scale to inifinity
- require complete knowledge
- are not suitable for interaction with
non-trusted routers - since a single misconfiguration can be fatal
29LSDBs
- We said before that LS routing protocols have
another database - LSDB contains representation of every router and
link in the AS - implicitly holding the complete topology of the
network - In addition, the LSDB associate costs (metric)
with every link - RIP - the metric is always hop count, no
non-trivial metric - OSPF - the metric is more general, for example
link length -
- These costs form a matrix
- The topology is symmetric, but the costs need not
be
From \ to Router A Router B Router C
Router A M(A,B) M(A,C)
Router B M(B,A) M(B,C)
Router C M(C,A) M(C,B)
30LSDBs and IGP RIBs
- Each router can independently calculate the
least-cost path - to every other router in the AS
- A Shortest Path First (SPF) algorithm (e.g.,
Dijkstras algorithm) - is used to compute a tree of the shortest paths
to all destinations - Each route in the SPF tree is an entire path
- but for each router we can extract the next hop
- and build the RIB for that router (each router
has its own RIB) - From the RIBs we build the FIB needed for
efficient forwarding of packets
31Graph search algorithms
- There are many algorithms for search on graphs
- Breadth first
- Bellman-Ford
- Iterative deepening
- Depth first (backtracking)
- Depth limited
- Best first
- Greedy algorithms
- Dijkstras algorithm
- Beam search
- A
- B
- etc. etc.
- We will discuss graphs, trees, etc. later on
32Dijkstras algorithm
- Graph search algorithm first described by Edsger
Dijkstra in 1959 - It assumes additive, non-negative, costs for each
link in graph - It is a best-first greedy algorithm
- Think of a city street map
- We want to from initial intersection to
destination one with the least walking - Start at the initial intersection its distance
is zero - Measure and label the distances to all adjacent
intersections (breadth first) - Choose the closest one (this is the greedy step)
- Consider all the neighbors of the chosen
intersection - If the distance (sum of the distance to the
chosen intersection and the distance from chosen
intersection to neighbor) is the shortest known
way to get to that neighbor - then remember that distance (not a tree!)
- Once you have considered all neighbors of the
intersection - mark the chosen intersection as visited (its
distance is now known) - Choose the unvisited intersection with shortest
distance - Continue until all intersections have been visited
33Dijkstras algorithm - formal
- Let's call the node we are starting with an
initial node. - The COST of node X will be the distance from the
initial node, - i.e., the sum of distances of all links along the
path from the initial node to X - Initialization
- Set initial nodes COST to zero, all others to
infinity - Set initial node as current, all other as
unvisited - Main step
- For all unvisited neighbours of current node
- Calculate their distances from the initial node
as - COST(neighbor) COST(current) DIST(current
to neighbor) - If this is less than what is presently marked,
overwrite the marking - When all neighbors have been considered, mark
current node as visited - (once visited, this nodes COST is final)
- Recurse
- Select the unvisited node with the smallest COST
as current node - Go to Main step
34Implementation issues
- If our graph has N nodes and L links
- In a straightforward implementation of Dijkstras
algorithm - finding the unvisited node with lowest cost takes
O(N) - this is done N times
- so the total computation for this is O(N2)
- the computation of the distances to every node
takes O(L) - since each link is followed once (to the node it
lands on) - So the total complexity is O(N2 L) O(N2)
- By using more sophisticated data structures
(Fibonacci heap) - this can be reduced to O(N log N L) O(N log
N)
35RIBs to FIB
- So we are finally ready to see how the FIB is
populated - First rejection rules are applied, for example
- do not accept routes from ASs without agreements
- do not accept routes that loop
- (e.g., BGP advertisements with AS number in the
AS-PATH) - Then install FIB entries according to policy, for
example - first install Static routes
- then routes from IGP RIB
- choose eBGP before iBGP
(hot potato
rule- get it out of my network - let someone else
handle) - if there are different routes from BGP
choose
the route with highest local preference - if routes have equal local preference
choose the
route with the shortest AS-PATH - if routes have equal AS-PATHs
choose the route
with the lowest origin number - if still equal choose highest BGP peer address
36Advertisement
- Not everything received is accepted for inclusion
in the FIB - Not everything accepted for inclusion in the FIB
is further advertised - Never advertise information not accepted to FIB !
37 38Forwarding
- Now that we have a FIB installed, lets forward
some packets ! - There are main steps to forwarding
- classification
- switching
- misc. (scheduling, queuing, QoS, compression,
encryption, ) - The operations
- may be stateless vs. statefull
- may change over time
- may involve peeking into the packet (Deep Packet
Inspection) - may be recursive
- Packet forwarding can be done by SW or HW or some
combination - We normally differentiate between
- the fast path (simple forwarding)
- the slow path (control protocol packets)
39Lookup and data structures
- Lookup comes in several varieties, such as
- Exact match (e.g., MAC addresses, VLANs, IP
multicast) - Longest Prefix Matching (LPM) (needed for
searching FIBs) - Range matching (e.g., ports, firewalls)
- In order to optimize lookup, we use appropriate
data structures - Wirths law Programs algorithms
data structures - We need to perform the following on our data
structures - Insert
- Delete
- Modify
- Search
- and to check the following metrics
- Time complexity for each of the above
- Size (spatial) efficiency
- Scalability
40LookUp Table
- The simplest (and fastest) data structure is the
LookUp Table (LUT) - AKA indexed array, Location Addressable Memory
(LAM) - The incoming address is used as an index to
access the (NH) information - We can put in more info, e.g., L2 type and
address, to save further lookups - Example
- Limitations
- only for exact match, not LPM
- limitation only for small number of possible
addresses, e.g. VLANs - We can use LUTs after other data structures that
return a key as an index
address interface NH L2 type L2 NH address
0 1 192.0.2.0 Ethernet 00-17-42-F7-14-14
1 2 192.0.2.16 PPP -
2 3 192.0.2.128 SDH VC ID
41Hash tables
- Hash tables enable handling a large number of
potential addresses - A hashing function is a function
- from a large variable (large number of bits or
large number of bytes) - to a small variable (small number of bits or
bytes) - which is white (small changes in the input create
widely different outputs) - Hashing long addresses returns a short index
- The problem is that the hashing function is not
11 - so there will always be the probability of hash
clashes (collisions) - Solutions
- perfect hashing only when addresses are known
ahead of time - index in table returns list of all addresses
stored - multiple hashing linear probing, quadratic
probing - Hashing is very good for exact match (e.g. MAC
addresses) - but is not suitable for LPM
42Hash implementation
- From an efficiency point of view
- hash tables are between LUTs and search tables
(to be discussed next) - To control collisions, we need a relatively large
table (birthday paradox !) - Example
- 99 probability of collision
- when 3000 entries are put into a hash table of
size 1 million - Using multiple hashing
- average computational load is O(1
keys/table-size) - A primitive hash function is the modulo hash
- H(key) addr mod table-size
- table-size should be prime must not be a power
of 2 or close - A better hash function is (for appropriate
integer m and fraction f) - H(key) Trunc m Frac( f addr)
- The quality of the hashing function depends on f
- For Fibonacci hashing f 1 / ?
- ? is the golden ratio ½ (v5 2) 1.618
23 people ½ 57 people 99
43Search table
- The most spatially efficient data structure is
the search table - It is similar to a LUT
- but the addresses being looked up are not
indexes - rather, we need to sequentially search the table
for the address - We can reduce the search time by ordering the
table - Search tables are good for LPM !
- For example
- Order FIB from most specific (longest prefix) to
least (0.0.0.0/0) - Loop through FIB list until find the first match
row prefix interface NH
0 192.168.16.0/24 1 10.10.1.1
1 192.168.196.0/20 2 10.16.54.2
2 192.168.0.0/17 2 10.16.1.16
3 0.0.0.0/0 3 10.1.1.0
44Limitations of search tables
- Search tables are not limited in size like LUTs
- but this comes at a price
- it is expensive to search for an address
- it is expensive to modify the information for an
address - it is very expensive to insert or delete
addresses (copy!) - Search table FIBs can be rebuilt from RIBs each
time - For exact match it is possible to speed up by
binary search - order by address
- guess position in middle of table range where key
should be - choose new range by comparison
45Linked lists
- Search tables are hard to maintain
- if we need to insert or delete an element we
need extensive moving - Linked lists are designed to simplify mechanics
of such updates - still need exhaustive search to find where to
insert - Linked lists can be singly or doubly linked
- Skip lists increase efficiency by enabling
skipping over ranges
46Linked list implementation
- If we already know where to insert or what to
delete or whom to modify - then linked lists are very efficient
- However, search is O(N) where N is the number of
entries - worst case N
- average case N / 2
- Properly constructed multi-level skip lists take
O(log N) on average - but are still O(N) in the worst case
- But if we already need to use double pointers
- then trees are better
47Tree structures
- A graph is a collection of nodes and edges
connecting them - In a directed graph the edges have direction
- and so or every edge there is a father node and
a child node - Nodes without children are called leaves
- A forest is a directed graph without loops (only
one path between 2 nodes) - A tree is a forest with a single root -- it thus
defines a partial order - (it is conventional to draw the tree upsidedown)
- A binary tree has no more than 2 output edges per
node - Trees can be implemented using arrays, pointers
(like linked lists), heaps, etc.
48Search trees
- Search trees may store data
- in their nodes
- in leaves only
- on edges
- combinations of the above
- For general search trees searching can be
breadth-first or depth-first - Breadth-first
- start at the root
- find all children of root and check for desired
data - if not found, find all childrens children
- recurse
- Depth first
- start at the root
- find first child and check for desired data
- if not found, find first child of first child
- recurse until data found or leaf
- if leaf backtrack to father and try the next
child
49Binary trees
- Binary Search Trees simplify storage and
manipulation - We call the children of a node L and R
- Perfect binary trees have exactly 2 children for
each internal node - Thus there are exactly 2H leaves, where H is the
height of the tree - Altogether there are (2H1 1) nodes
- To store keys in a binary search tree
- place keys in all nodes
- for every intermediate node N
- key(L) lt key(N)
- key(R) gt key(N)
- To search for a key in a BST
- start at root (set current node to root)
- check if key is stored in current node
- if not if key lt key(N) then set current to L
else set current to R - In the worst case this takes only H comparisons
(for perfect trees O(log N))
50Balanced binary trees
- Since the search complexity is proportional to
the BSTs height H - we would like to use BSTs with minimal height
- Balanced BSTs are not perfect BSTs, but as close
as possible to perfect - It is more complex to build a balanced BST, but
faster to search
51Tries
- From retrieve, but usually pronounced try
- A trie is an ordered tree with
- subvalues on edges
- values at leaves
- Tries are related to prefix search
representations - Tries were originally developed for exact match
- Tries can be binary or not
- Tries are good for all lexicographical searches
O(log n) - but particularly efficient for LPM
- Trie variants are the fastest known lookups -
O(log log n) - LC-trie used in modern Linux router
implementations
52Example trie
- Assume the following FIB
- 10.0.128.0/16
- 10.0.0.0/8
- 192.168.0.0/16
- 192.168.128.0/24
- 0.0.0.0/0
- Note
- in the plain trie,
- all IP prefixes have the same length
53Example compact trie
- Same FIB
- 10.0.128.0/16
- 10.0.0.0/8
- 192.168.0.0/16
- 192.168.128.0/24
- 0.0.0.0/0
- Now maximum length is 2
- We could also save memory by exploiting common
suffixes
54More trie variants
- Patricia trie
- Practical Algorithm to Retrieve Information Coded
in Alphanumeric - Binary trie which store in nodes the number of
bits - to skip before next decision point
- For LPM store prefixes in internal nodes
- Bucket tries
- Store multiple keys in leaves
- Multibit tries (M-tries)
- Fewer branching decisions than binary tries
- Multibit strides (usually variable strides)
- Level-compressed tries (LC-tries)
- Replace perfect subtrees in binary trie with
single degree 2k nodes - LPC trie level and path compressed
- And there are many more (Lulea, full-tree, )
55Example LC-trie
56Content Addressable Memory
- CAM (AKA associative memory)
- Addressable by content, rather than by location
(LAM) - Special purpose hardware
- Fastest possible lookup (essentially searches
entire table in one clock) - but limited in size
- usually drives regular memory for additional
storage - Binary CAM (BCAM) stores 0 or 1 in each bit
- Ternary CAM (TCAM) allows wildcards
- Can be used for LPM
- Can prioritize solution by
- number of bits matched
- order in table
- CAM technology today
- 32 to 144 bit keys
- 128K 512 K memories
- hundreds of millions searches per second
57Example 3 bit BCAM
encode output to access LAM
58Other uses of lookup
- Deep Packet Inspection
- URL lookup (often partial or with wildcards)
- can use Trees and tries
- XML information, patterns, etc.
- multiple encapsulations
- e.g. Ethernet in IP, MPLS over IP, etc.
- there are special-purpose languages to describe
such cases - Firewalls, Access Control
- source/destination IP addresses
source/destination ports 4-tuples - TCP/UDP ports in ranges
59Switching
- Once the packet has been classified we need to
properly forward it - The classifier result (the FEC) becomes a
metadata field - that controls the switch
- A switch fabric is combination of HW and SW
components - that enable moving the packet from input
interface to output interface - The metaphor is borrowed from woven material
- At low packets per second (pps) switching can be
all software - At high speeds highly parallel hardware is needed
60Blocking
- The major design constraint in switches is
blocking - Blocking occurs when some resource is being used
- and the present packet can not be processed
- We differentiate three types of blocking
- Output blocking (the egress port is in use)
- Internal blocking (some internal resource of the
switch is in use) - Input (head of line) blocking (packet can not
enter the switch) - A switch which is designed such that there is
never internal blocking - is called a nonblocking switch
- To alleviate blocking we can add buffers to store
the waiting packets - According to the type of blocking we wish to
avoid we have - output queues
- internal buffers
- input queues
61Switch types
- In the TDM days there were two switching
mechanism types - time domain switches (time slot interchange)
- spatial domain switches
- For packet networks there are no time slots
- but time domain switching can still be used
- shared memory switches (packets from all inputs
placed in single memory) - shared bus switches (all inputs and outputs share
a ring/hypercube, etc. ) - In spatial switching
- the packets destined for different output
interfaces - travel different internal paths
- The simplest spatial switch is the crossbar
- invented for analog telephone circuits
- does not scale well O(N2) possible connections
- Multilayer crossbars can scale better
62Shared memory/bus
- Shared memory switches are simple and cheap to
implement - The memory is the heart of the fabric
- it determines the throughput and delay
- The memory must run N times faster than the
ingress rates - Packets may be divided into frame buffers
- Scalability can be achieved by parallelism (bit
or byte slicing) - The architecture breaks down at high speeds
- Shared bus switches are similar in principle to
shared memory ones - The shared medium is the heart of the switch
- and determines its characteristics
- Scalability can be achieved by parallelism
(parallel buses) - The architecture breaks down at high speeds
63Multistage crossbars
- Crossbars are fast, but
- are not nonblocking
- do not scale
- These problems are solved by multistage crossbars
- The only blocking of a single stage crossbar is
output blocking - more than one packet needs to leave the egress
interface - Time-space switches solve this by sorting the
frames before switching - More complex architectures are time-space-time
switches - There are many multistage solutions to the
scaling problem
64Clos network
- The Clos network is more efficient than a single
NN crossbar - We divide N into r groups of n N n r
- It has 3 stages of crossbars
- the first stage has r groups of nn crossbars
- the second stage has n groups of rr crossbars
- the third stage has r groups of nn crossbars
- The shuffle rule Xth output of Yth crossbar
connects to Yth input of Xth crossbar - Instead of N2 n2r2 connections
- just r2 n 2 r n2
- For example
- if r n vN
- then instead of n4 just 3n 3
- a reduction by 3/n 3 / vN
65Benes network
- The Benes network contains only 22 crossbars
- and is built recursively
- For N 2n, the network has 2n-1 stages and 4 N
log (N-1/2) connections - The shuffle rule is that 1st outputs go to the
1st block, 2nd to the 2nd
Benes(n-1)
Benes(n-1)
66Banyan network
- The Banyan network is a binary tree of 22
crossbars - with exactly one path from each ingress
interface to each egress interface - At each stage the next bit of the identifier
determines the switching - Banyan switches are time and space efficient
- O(log N) delay
- ½ N log N 22 crossbars
- The main problem is internal blocking at any of
the crossbars - One way to relieve this is to speed up the
crossbars and to add buffers
shuffle
shuffle
67Batchers sort
- To prevent internal blocking in a Banyan switch
- we can first sort the frames according to egress
interface - When this is done with Batchers parallel sorter
- ½ log N ( log N 1) stages of simple sorters
- we get the Batcher-Banyan switch