Data plane algorithms in routers

About This Presentation

Title:

Data plane algorithms in routers

Description:

... 478 R8 ,,,, 479 R1 M,S,25,, 5 R1 M,S,25,,TCP 4 R1 M,S,25,,UDP 3 R1 M,S,25,123, 2 R1 M,S,25,123TCP 1 R1 M,S,25,123,UDP 0 Action Cross Product ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 31

Provided by: ComputerS110

Learn more at: http://archive.dimacs.rutgers.edu

Category:

more less

Transcript and Presenter's Notes

Title: Data plane algorithms in routers

1
Data plane algorithms in routers

From prefix lookup to deep packet inspection
Cristian Estan, University of Wisconsin-Madison

2
What is the data plane?

The part of the router handling the traffic
Data plane algorithms applied to every packet
Successive packets typically treated
independently
Example deciding on which link to send a packet
Throughput defined as number of packets or bytes
handled per second is very important
Line speed keeping up with the rate at which
traffic can be transmitted over the wire or fiber
Example 10Gbps router has 32 ns to handle 40
byte packet
Memory usage limited by technology and costs
Can afford at most tens of megabits of fast
on-chip memory

3
A generic data plane problem

Router has many directives composed of a guard,
and an associated action (all guards distinct)
There is a simple procedure for testing how well
a guard matches a packet
For each packet, find the guard that matches
best and take the associated action
Example routing table lookup
Each guard is an IP prefix (between 0 and 32
bits)
Matching procedure is the guard a prefix of the
32 bit destination IP address
Best defined as longest matching prefix

4
The rules of the game

Matching against all guards in sequence is too
slow
We build a data structure that captures the
semantics of all guards and use it for matching
Primary metrics
How fast the matching algorithm is
How much memory the data structure needs
Time to build data structure also has some
importance
We can cheat (but we wont today) by
Using binary or ternary content-addressable
memories
Using other forms of hardware support

5
Measuring algorithm complexity

Execution cost measured in number of memory
accesses to read data structure
Actual data manipulation operations typically
very simple
On some platforms we can read wide words
Worst case performance most important
Worst case defined with respect to input, not
guards
Caching has been proven ineffective for many
settings
Using algorithms with good amortized complexity,
but bad worst case requires large buffers

6
Overview

Longest matching prefix
Trie-based algorithms
Uni-bit and multi-bit tries (fixed stride and
variable stride)
Leaf pushing
Bitmap compression of multi-bit trie nodes
Tree bitmap representation for multi-bit trie
nodes
Binary search on ranges
Binary search on prefix lengths
Classification on multiple fields
Signature matching

7
Longest matching prefix

Used in routing table lookup (a.k.a. forwarding)
for finding the link on which to send a packet
Guard a bit string of 0 to w bits called IP
prefix
Action a single byte interface identifier
Input a w-bit string representing the
destination IP address of the packet (w is 32 for
IPv4,128 for IPv6)
Output the interface associated with the longest
guard matching the input
Size of problem hundreds of thousands of prefixes

8
Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
0 P5
1
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
9
Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
11000010
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
Longest matching prefix
0 P5
1

P2
P7
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
10
Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
11000010
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
Longest matching prefix
0 P5
1

P7
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
11
Lulea bitmap compression
Bitmap supporting fast counting
Compressed node
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
00000 1
00001 0
00010 0
00011 0
00100 1
00101 1
00110 1
00111 0
01000 1
01001 0
01010 1
01011 0
01100 1
01101 0
01110 0
01111 1
10000 1
10001 0
10010 1
10011 0
10100 1
10101 1
10110 1
10111 0
11000 0
11001 0
11010 0
11011 0
11100 1
11101 1
11110 1
11111 0
When the compression bitmaps are large it is
expensive to count bits during lookup. The bitmap
is divided into chunks and a pre-computed
auxiliary array stores the number of bits set
before each chunk. The lookup algorithm needs to
count only bits set within one chunk.
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
000 1
001 0
010 0
011 0
100 1
101 1
110 1
111 1
P1

P5

P9
Repeating entries are stored only once in the
compressed array. An auxiliary bitmap is needed
to find the right entry in the compressed node.
It stores a 0 for positions that do not differ
from the previous one.
00 0
01 4
10 8
11 13
Input
11001010
Representing node as tree bitmap
Pointers to children and prefixes are stored in
separate structures. Prefixes of all lengths are
stored, thus leaf pushing is not needed and
update is fast. Bitmaps have 1s corresponding to
entries that are not empty.
Longest matching prefix
0 1
1 1
00 0
01 0
10 0
11 0
000 0
001 0
010 0
011 0
100 1
101 1
110 1
111 1

P7
13013
P2
000 0
001 0
010 0
011 0
100 1
101 0
110 1
111 0
P1
P2
P3
P6
P7
P9
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111

12
Binary search on ranges

Divide w-bit address space into maximal
continuous ranges covered by same prefix
Build array or balanced (binary) search tree with
boundaries of ranges
At lookup time perform O(log(n)) search
Not better than multi-bit tries with compression,
but it is not covered by patents

13
Binary search on prefix lengths

Core idea for each prefix length represented in
the routing table, have a hash table with the
prefixes
Can find longest matching prefix after looking up
in each hash table the prefix of the address with
corresponding length
Binary search on prefix lengths is faster
Simple but wrong algorithm if you find prefix at
length x store it as best match and look for
longer matching prefixes, otherwise look for
shorter prefixes
Problem what if there is both a shorter and a
longer prefix, but no prefix at length x?
Solution insert marker at length x when there
are longer prefixes. Must store with marker
longest matching shorter prefix. Markers lead to
moderate increase in memory usage.
Promising algorithm for IPv6 (w128)

14
Papers on longest matching prefix

G. Varghese Network algorithmics an
interdisciplinary approach to designing fast
networked devices, chapter 11, Morgan Kaufmann
2005
V. Srinivasan, G. Varghese Faster IP lookups
using controlled prefix expansion, ACM Trans. on
Comp. Sys., Feb. 1999
M. Degermark, A. Brodnik, S. Carlsson, S. Pink
Small forwarding tables for fast routing
lookups, ACM SIGCOMM, 1997
W. Eatherton, Z. Dittia, G. Varghese Tree Bitmap
Hardware / Software IP Lookups with Incremental
Updates, http//www-cse.ucsd.edu/varghese/PAPERS
/willpaper.pdf
B. Lampson, V. Srinivasan, G. Varghese IP
lookups using multiway and multicolumn search,
IEEE Infocom, 1998
M. Waldvogel, G. Varghese, J. Turner, B.
Plattner, Scalable high-speed IP lookups, ACM
Trans. on Comp. Sys., Nov. 2001

15
Overview

Longest matching prefix
Classification on multiple fields
Solution for two-dimensional case grid of tries
Bit vector linear search
Cross-producting
Decision tree approaches
Signature matching

16
Packet classification problem

Required for security, recognizing packets with
quality of service requirements
Guard prefixes or ranges for k header fields
Typically source and destination prefix, source
and destination port range, and exact value or
for protocol
All fields must match for rule to apply
Action drop, forward, map to a certain traffic
class
Input a tuple with the values of the k header
fields
Output the action associated with the first rule
that matches the packet (rules are strictly
ordered)
Size of problem thousands of classification rules

17
Example of classification rule set
External time server TO
Router that filters traffic
Mail gateway M
Internet
Net
Internal time server TI
Secondary name server S
Destination IP Source IP Dest Port Src Port Protocol Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP/ACK R7
R8
18
A geometric view of packet classification
R1
R1
R3
R3
Destination address space
R2
R2
Source address space
Source address space

In theory number of regions defined can be much
larger than number of rules
Any algorithm that guarantees O(n) space for all
rule sets of size n needs O(log(n)k-1) time for
classification

19
The two dimensional case source and destination
IP addresses

For each destination prefix in rule set, link to
corresponding node in destination IP trie a trie
with source prefixes of rules using this
destination prefix
Matching algorithm must use backtracking to visit
all source tries
Grid of tries by pre-computing switch pointers
in destination tries and propagating some
information about more general rules, matching
may proceed without backtracking
Memory used proportional to number of rules
Matching time O(w) with constant depending on
stride
Extended grid of tries handles 5 fields and has
good run time and memory in practice

20
Dest IP Dest IP
M 11110111
TI 00001111
Net 00000111
00000101
Source IP Source IP
S 11110011
TO 11011011
Net 11010111
11010011
Dest Port Dest Port
25 10000111
53 01100111
23 00010111
123 00001111
00000111
Dest IP Src IP Dest Port Src Port Proto Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP R7
R8
00000101
11010011
00000111
11110111
10110111
00000001
Proto Proto
UDP 11111101
TCP 10110111
10110101
Src Port Src Port
123 11111111
11110111
R8

Bit vector approaches do linear search through
rule set
For each field we pre-compute a structure (e.g.
trie) to find most specific prefix or range
distinguished by rule set
For each rule, a single bit represents whether a
given most specific prefix matches rule or not
We associate with each range a bitmap of size n
encoding which of the rules may match a packet in
that prefix or range
Classification algorithm first computes for each
field of the packet the most specific
prefix/range it belongs to
By then AND-ing together the k bitmaps of size n
we find matching rules
Works well for hardware solutions that allow wide
memory reads
Scales poorly to large rule sets

21
Dest Port
25
53
23
123

Dest IP
M
TI
Net

Src IP
S
TO
Net

Proto
UDP
TCP

Src Port
123

Dest IP Src IP Dest Port Src Port Proto Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP R7
R8
312033046131478
Dest IP - Src IP Rule bitmap Class
0 M,S 11110011 C1
1 M,TO 11010011 C2
2 M,Net 11010111 C3
3 M, 11010011 C2
4 TI,S 00000011 C4
5 TI,T0 00001011 C5
6 TI,Net 00000111 C6
7 TI, 00000011 C4
8 Net,S 00000011 C4
9 Net,TO 00000011 C4
10 Net,Net 00000111 C6
11 Net, 00000011 C4
12 ,S 00000001 C7
13 ,TO 00000001 C7
14 ,Net 00000100 C8
15 , 00000001 C7
Cross Product Action
0 M,S,25,123,UDP R1
1 M,S,25,123TCP R1
2 M,S,25,123, R1
3 M,S,25,,UDP R1
4 M,S,25,,TCP R1
5 M,S,25,, R1

478 ,,,,TCP R8
479 ,,,, R8
Cross-producting performs longest prefix matching
separately for all fields and combines the
results in a single step by looking up the
matching rule in a pre-computed table explicitly
listing the first matching rule for each element
of the cross-product. The size of this table is
the product of the numbers of recognized
prefixes/ranges for the individual fields. Due to
its memory requirements this method is not
feasible.

44523480
Equivalenced cross-producting (a.k.a. recursive
flow classification or RFC) combines the results
of the per-field longest matching prefix
operations two by two. The pairs of values are
grouped in equivalence classes and in general
there are much fewer equivalence classes than
pairs of values. This leads to significant memory
savings as compared to simple cross-producting.
This algorithm provides fast packet
classification, but compared to other algorithms,
the memory requirements are relatively large (but
feasible in some settings).
16 entries, 8 distinct classes
Src IP
Dest IP
Src Port
Dest Port
Proto
Final result
22
Decision tree approaches

At each node of the tree test a bit in a field or
perform a range test
Large fan-out leads to shallow trees and fast
classification
Leaves contain a few rules traversed linearly
Interior nodes may contain rules that match also
Tests may look at bits from multiple fields
A rule may appear in multiple nodes of the
decision tree this can lead to increased memory
usage
Tree built using heuristics that pick fields to
compare on that divide remaining rules relatively
evenly among descendants
Fast and compact on rule sets used today

23
Papers on packet classification

G. Varghese Network algorithmics , chapter 12
V. Srinivasan, G. Varghese, S. Suri, M.
Waldvogel, Fast and Scalable Layer Four
Switching, ACM SIGCOMM, Sep. 1998
F. Baboescu, S. Singh, G. Varghese, Packet
classification for core routers Is there an
alternative to CAMs?, IEEE Infocom, 2003
P. Gupta, N. McKeown, Packet classification on
multiple fields, ACM SIGCOMM 1999
T. Woo, A modular approach to packet
classification Algorithms and results, IEEE
Infocom, 2000
S. Singh, F. Baboescu, G. Varghese, Packet
classification using multidimensional cutting,
SIGCOMM, 2003

24
Overview

Longest matching prefix
Classification on multiple fields
Signature matching
String matching
Regular expression matching w/ DFAs and D2FAs

25
Signature matching

Used in intrusion prevention/detection,
application classification, load balancing
Guard a byte string or a regular expression
Action drop packet, log alert, set priority,
direct to specific server
Input byte string from the payload of packet(s)
Hence the name deep packet inspection
Output the positions at which various signatures
match or the identifier of the highest priority
signature that matches
Size of problem hundreds of signatures per
protocol

26
String matching

Most widely used early form of deep packet
inspection, but the more expressive regular
expressions have superceded strings by now
Still used as pre-filter to more expensive
matching operations by popular open source
IDS/IPS Snort
Matching multiple strings a well-studied problem
A. Aho, M. Corasick. Efficient string matching
An aid to bib- liographic search, Communications
of the ACM, June 1975
Many hardware-based solutions published in last
decade
Matching time independent of number of strings,
memory requirements proportional to sum of their
sizes

27
Regular expression matching

Deterministic and non-deterministic finite
automata (DFAs and NFAs) can match regular
expressions
NFAs more compact but require backtracking or
keeping track of sets of states during matching
Both representations used in hardware and
software solutions, but only DFA based solutions
can guarantee throughput in software
DFAs have a state space explosion problem
From DFAs recognizing individual signatures we
can build a DFA that recognizes entire signature
set in a single pass
Size of combined DFA much larger than sum of
sizes for DFAs recognizing individual signatures
Multiple combined DFAs are used to match
signature set

28
S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J.
Turner, Algorithms to Accelerate Multiple
Regular Expressions Matching for Deep Packet
Inspection, ACM SIGCOMM, September 2006
Delayed Input DFA (D2FA)
Deterministic finite automaton (DFA)
State 0
State 1
State 2
State 3
State 0
State 1
State 2
State 3
Default transitions

2

0
19
12
12
4
4
4
8
8
19
12
12
4
2
8
2
8
25
18
25
6
41
5
41
5
25
18
25
41
41
41
5
5

2
8
2

25
18
25
41
41
41
5
5

6

5
41

19
12
12
4
4
4
8
8

Input
410052
If the current state variable meets an
acceptance condition (e.g. whether the state
identifier is larger than a given threshold), the
automaton raises an alert.
Crt. state
1
12
Set of regular expr. D2FAs with no bound on default path length D2FAs with no bound on default path length D2FAs with no bound on default path length D2FAs d.p.l.4
Set of regular expr. Avg. d.p.l. Max d.p.l. Memory Memory
Cisco590 18.32 57 0.80 1.56
Cisco103 16.65 54 0.98 1.54
Cisco7 19.61 61 2.58 3.31
Linux56 7.68 30 1.64 1.87
Linux10 5.14 20 8.59 9.08
Snort11 5.86 9 1.57 1.66
Bro648 6.45 17 0.45 0.51
D2FAs build on the observation that for many
pairs of states, the transition tables are very
similar and it is enough to store the
differences. The lookup algorithm may need to
follow multiple default transitions until it
finds a state that explicitly stores a pointer to
the next state it needs to transition to. Since
this is a throughput concern, the algorithm for
constructing D2FAs allows the user to set a limit
on the length of the maximum default path.
The memory columns report the ratio between the
number of transitions used by the D2FA and the
corresponding DFA.
29
Conclusions

Networking devices implement more and more
complex data plane processing to better control
traffic
The algorithms and data structures used have big
performance impact
Often set of rules to be matched against has
specific structure
Algorithms exploiting this structure may give
good performance even if it is impossible to find
an algorithm that gives good performance on all
possible rule sets

30
Thats all folks!

Write a Comment

User Comments (0)