Title: Data Plane Algorithms in Network Processing Systems
1Data Plane Algorithms in Network Processing
Systems
- Lec. 24/25 from prefix lookup to deep packet
inspection - Slides adapted from Cristian Estan, University of
Wisconsin-Madison - Contents from George Varghese network
algorithmics and ACM/IEEE papers
2What is the data plane?
- The part of the router handling the traffic
- Data plane algorithms applied to every packet
- Successive packets typically treated
independently - Example deciding on which link to send a packet
- Throughput defined as number of packets or bytes
handled per second is very important - Line speed keeping up with the rate at which
traffic can be transmitted over the wire or fiber - Example 10Gbps router has 32 ns to handle 40
byte packet - Memory usage limited by technology and costs
- Can afford at most tens of megabits of fast
on-chip memory - Network Processor is one of executing engines of
data plane of router
3A generic data plane problem
- Router has many directives composed of a guard,
and an associated action (all guards distinct) - There is a simple procedure for testing how well
a guard matches a packet - For each packet, find the guard that matches
best and take the associated action - Example routing table lookup
- Each guard is an IP prefix (between 0 and 32
bits) - Matching procedure is the guard a prefix of the
32 bit destination IP address - Best defined as longest matching prefix
4The rules of the game
- Matching against all guards in sequence is too
slow - We build a data structure that captures the
semantics of all guards and use it for matching - Primary metrics
- How fast the matching algorithm is
- How much memory the data structure needs
- Time to build data structure also has some
importance - We can cheat (but we wont today) by
- Using binary or ternary content-addressable
memories - Using other forms of hardware support
5Measuring algorithm complexity
- Execution cost measured in number of memory
accesses to read data structure - Actual data manipulation operations typically
very simple - On some platforms we can read wide words
- Worst case performance most important
- Worst case defined with respect to input, not
guards - Caching has been proven ineffective for many
settings - Using algorithms with good amortized complexity,
but bad worst case requires large buffers
6Overview
- Longest matching prefix
- Trie-based algorithms
- Uni-bit and multi-bit tries (fixed stride and
variable stride) - Leaf pushing
- Bitmap compression of multi-bit trie nodes
- Tree bitmap representation for multi-bit trie
nodes - Binary search on ranges
- Binary search on prefix lengths
- Classification on multiple fields
- Signature matching
7Longest matching prefix
- Used in routing table lookup (a.k.a. forwarding)
for finding the link on which to send a packet - Guard a bit string of 0 to w bits called IP
prefix - Action a single byte interface identifier
- Input a w-bit string representing the
destination IP address of the packet (w is 32 for
IPv4,128 for IPv6) - Output the interface associated with the longest
guard matching the input - Size of problem hundreds of thousands of prefixes
8Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
0 P5
1
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
9Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
11000010
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
Longest matching prefix
0 P5
1
P2
P7
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
10Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
11000010
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
Longest matching prefix
0 P5
1
P7
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
11Lulea bitmap compression
Bitmap supporting fast counting
Compressed node
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
00000 1
00001 0
00010 0
00011 0
00100 1
00101 1
00110 1
00111 0
01000 1
01001 0
01010 1
01011 0
01100 1
01101 0
01110 0
01111 1
10000 1
10001 0
10010 1
10011 0
10100 1
10101 1
10110 1
10111 0
11000 0
11001 0
11010 0
11011 0
11100 1
11101 1
11110 1
11111 0
When the compression bitmaps are large it is
expensive to count bits during lookup. The bitmap
is divided into chunks and a pre-computed
auxiliary array stores the number of bits set
before each chunk. The lookup algorithm needs to
count only bits set within one chunk.
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
000 1
001 0
010 0
011 0
100 1
101 1
110 1
111 1
P1
P5
P9
Repeating entries are stored only once in the
compressed array. An auxiliary bitmap is needed
to find the right entry in the compressed node.
It stores a 0 for positions that do not differ
from the previous one.
00 0
01 4
10 8
11 13
Input
11001010
Representing node as tree bitmap
Pointers to children and prefixes are stored in
separate structures. Prefixes of all lengths are
stored, thus leaf pushing is not needed and
update is fast. Bitmaps have 1s corresponding to
entries that are not empty.
Longest matching prefix
0 1
1 1
00 0
01 0
10 0
11 0
000 0
001 0
010 0
011 0
100 1
101 1
110 1
111 1
P7
13013
P2
000 0
001 0
010 0
011 0
100 1
101 0
110 1
111 0
P1
P2
P3
P6
P7
P9
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
12Binary search on ranges
- Divide w-bit address space into maximal
continuous ranges covered by same prefix - Build array or balanced (binary) search tree with
boundaries of ranges - At lookup time perform O(log(n)) search
- Not better than multi-bit tries with compression,
but it is not covered by patents
13Binary search on prefix lengths
- Core idea for each prefix length represented in
the routing table, have a hash table with the
prefixes - Can find longest matching prefix after looking up
in each hash table the prefix of the address with
corresponding length - Binary search on prefix lengths is faster
- Simple but wrong algorithm if you find prefix at
length x store it as best match and look for
longer matching prefixes, otherwise look for
shorter prefixes - Problem what if there is both a shorter and a
longer prefix, but no prefix at length x? - Solution insert marker at length x when there
are longer prefixes. Must store with marker
longest matching shorter prefix. Markers lead to
moderate increase in memory usage. - Promising algorithm for IPv6 (w128)
14Papers on longest matching prefix
- G. Varghese Network algorithmics an
interdisciplinary approach to designing fast
networked devices, chapter 11, Morgan Kaufmann
2005 - V. Srinivasan, G. Varghese Faster IP lookups
using controlled prefix expansion, ACM Trans. on
Comp. Sys., Feb. 1999 - M. Degermark, A. Brodnik, S. Carlsson, S. Pink
Small forwarding tables for fast routing
lookups, ACM SIGCOMM, 1997 - W. Eatherton, Z. Dittia, G. Varghese Tree Bitmap
Hardware / Software IP Lookups with Incremental
Updates, http//www-cse.ucsd.edu/varghese/PAPERS
/willpaper.pdf - B. Lampson, V. Srinivasan, G. Varghese IP
lookups using multiway and multicolumn search,
IEEE Infocom, 1998 - M. Waldvogel, G. Varghese, J. Turner, B.
Plattner, Scalable high-speed IP lookups, ACM
Trans. on Comp. Sys., Nov. 2001
15Overview
- Longest matching prefix
- Classification on multiple fields
- Solution for two-dimensional case grid of tries
- Bit vector linear search
- Cross-producting
- Decision tree approaches
- Signature matching
16Packet classification problem
- Required for security, recognizing packets with
quality of service requirements - Guard prefixes or ranges for k header fields
- Typically source and destination prefix, source
and destination port range, and exact value or
for protocol - All fields must match for rule to apply
- Action drop, forward, map to a certain traffic
class - Input a tuple with the values of the k header
fields - Output the action associated with the first rule
that matches the packet (rules are strictly
ordered) - Size of problem thousands of classification rules
17Example of classification rule set
External time server TO
Router that filters traffic
Mail gateway M
Internet
Net
Internal time server TI
Secondary name server S
Destination IP Source IP Dest Port Src Port Protocol Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP/ACK R7
R8
18A geometric view of packet classification
R1
R1
R3
R3
Destination address space
R2
R2
Source address space
Source address space
- In theory number of regions defined can be much
larger than number of rules - Any algorithm that guarantees O(n) space for all
rule sets of size n needs O(log(n)k-1) time for
classification
19The two dimensional case source and destination
IP addresses
- For each destination prefix in rule set, link to
corresponding node in destination IP trie a trie
with source prefixes of rules using this
destination prefix - Matching algorithm must use backtracking to visit
all source tries - Grid of tries by pre-computing switch pointers
in destination tries and propagating some
information about more general rules, matching
may proceed without backtracking - Memory used proportional to number of rules
- Matching time O(w) with constant depending on
stride - Extended grid of tries handles 5 fields and has
good run time and memory in practice
20Dest IP Dest IP
M 11110111
TI 00001111
Net 00000111
00000101
Source IP Source IP
S 11110011
TO 11011011
Net 11010111
11010011
Dest Port Dest Port
25 10000111
53 01100111
23 00010111
123 00001111
00000111
Dest IP Src IP Dest Port Src Port Proto Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP R7
R8
00000101
11010011
00000111
11110111
10110111
00000001
Proto Proto
UDP 11111101
TCP 10110111
10110101
Src Port Src Port
123 11111111
11110111
R8
- Bit vector approaches do linear search through
rule set - For each field we pre-compute a structure (e.g.
trie) to find most specific prefix or range
distinguished by rule set - For each rule, a single bit represents whether a
given most specific prefix matches rule or not - We associate with each range a bitmap of size n
encoding which of the rules may match a packet in
that prefix or range - Classification algorithm first computes for each
field of the packet the most specific
prefix/range it belongs to - By then AND-ing together the k bitmaps of size n
we find matching rules - Works well for hardware solutions that allow wide
memory reads - Scales poorly to large rule sets
21Dest Port
25
53
23
123
Dest IP
M
TI
Net
Src IP
S
TO
Net
Proto
UDP
TCP
Src Port
123
Dest IP Src IP Dest Port Src Port Proto Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP R7
R8
312033046131478
Dest IP - Src IP Rule bitmap Class
0 M,S 11110011 C1
1 M,TO 11010011 C2
2 M,Net 11010111 C3
3 M, 11010011 C2
4 TI,S 00000011 C4
5 TI,T0 00001011 C5
6 TI,Net 00000111 C6
7 TI, 00000011 C4
8 Net,S 00000011 C4
9 Net,TO 00000011 C4
10 Net,Net 00000111 C6
11 Net, 00000011 C4
12 ,S 00000001 C7
13 ,TO 00000001 C7
14 ,Net 00000100 C8
15 , 00000001 C7
Cross Product Action
0 M,S,25,123,UDP R1
1 M,S,25,123TCP R1
2 M,S,25,123, R1
3 M,S,25,,UDP R1
4 M,S,25,,TCP R1
5 M,S,25,, R1
478 ,,,,TCP R8
479 ,,,, R8
Cross-producting performs longest prefix matching
separately for all fields and combines the
results in a single step by looking up the
matching rule in a pre-computed table explicitly
listing the first matching rule for each element
of the cross-product. The size of this table is
the product of the numbers of recognized
prefixes/ranges for the individual fields. Due to
its memory requirements this method is not
feasible.
44523480
Equivalenced cross-producting (a.k.a. recursive
flow classification or RFC) combines the results
of the per-field longest matching prefix
operations two by two. The pairs of values are
grouped in equivalence classes and in general
there are much fewer equivalence classes than
pairs of values. This leads to significant memory
savings as compared to simple cross-producting.
This algorithm provides fast packet
classification, but compared to other algorithms,
the memory requirements are relatively large (but
feasible in some settings).
16 entries, 8 distinct classes
Src IP
Dest IP
Src Port
Dest Port
Proto
Final result
22Decision tree approaches
- At each node of the tree test a bit in a field or
perform a range test - Large fan-out leads to shallow trees and fast
classification - Leaves contain a few rules traversed linearly
- Interior nodes may contain rules that match also
- Tests may look at bits from multiple fields
- A rule may appear in multiple nodes of the
decision tree this can lead to increased memory
usage - Tree built using heuristics that pick fields to
compare on that divide remaining rules relatively
evenly among descendants - Fast and compact on rule sets used today
23Papers on packet classification
- G. Varghese Network algorithmics , chapter 12
- V. Srinivasan, G. Varghese, S. Suri, M.
Waldvogel, Fast and Scalable Layer Four
Switching, ACM SIGCOMM, Sep. 1998 - F. Baboescu, S. Singh, G. Varghese, Packet
classification for core routers Is there an
alternative to CAMs?, IEEE Infocom, 2003 - P. Gupta, N. McKeown, Packet classification on
multiple fields, ACM SIGCOMM 1999 - T. Woo, A modular approach to packet
classification Algorithms and results, IEEE
Infocom, 2000 - S. Singh, F. Baboescu, G. Varghese, Packet
classification using multidimensional cutting,
SIGCOMM, 2003
24Overview
- Longest matching prefix
- Classification on multiple fields
- Signature matching
- String matching
- Regular expression matching w/ DFAs and D2FAs
25Signature matching
- Used in intrusion prevention/detection,
application classification, load balancing - Guard a byte string or a regular expression
- Action drop packet, log alert, set priority,
direct to specific server - Input byte string from the payload of packet(s)
- Hence the name deep packet inspection
- Output the positions at which various signatures
match or the identifier of the highest priority
signature that matches - Size of problem hundreds of signatures per
protocol
26String matching
- Most widely used early form of deep packet
inspection, but the more expressive regular
expressions have superceded strings by now - Still used as pre-filter to more expensive
matching operations by popular open source
IDS/IPS Snort - Matching multiple strings a well-studied problem
- A. Aho, M. Corasick. Efficient string matching
An aid to bib- liographic search, Communications
of the ACM, June 1975 - Many hardware-based solutions published in last
decade - Matching time independent of number of strings,
memory requirements proportional to sum of their
sizes
27Regular expression matching
- Deterministic and non-deterministic finite
automata (DFAs and NFAs) can match regular
expressions - NFAs more compact but require backtracking or
keeping track of sets of states during matching - Both representations used in hardware and
software solutions, but only DFA based solutions
can guarantee throughput in software - DFAs have a state space explosion problem
- From DFAs recognizing individual signatures we
can build a DFA that recognizes entire signature
set in a single pass - Size of combined DFA much larger than sum of
sizes for DFAs recognizing individual signatures - Multiple combined DFAs are used to match
signature set
28S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J.
Turner, Algorithms to Accelerate Multiple
Regular Expressions Matching for Deep Packet
Inspection, ACM SIGCOMM, September 2006
Delayed Input DFA (D2FA)
Deterministic finite automaton (DFA)
State 0
State 1
State 2
State 3
State 0
State 1
State 2
State 3
Default transitions
2
0
19
12
12
4
4
4
8
8
19
12
12
4
2
8
2
8
25
18
25
6
41
5
41
5
25
18
25
41
41
41
5
5
2
8
2
25
18
25
41
41
41
5
5
6
5
41
19
12
12
4
4
4
8
8
Input
410052
If the current state variable meets an
acceptance condition (e.g. whether the state
identifier is larger than a given threshold), the
automaton raises an alert.
Crt. state
1
12
Set of regular expr. D2FAs with no bound on default path length D2FAs with no bound on default path length D2FAs with no bound on default path length D2FAs d.p.l.4
Set of regular expr. Avg. d.p.l. Max d.p.l. Memory Memory
Cisco590 18.32 57 0.80 1.56
Cisco103 16.65 54 0.98 1.54
Cisco7 19.61 61 2.58 3.31
Linux56 7.68 30 1.64 1.87
Linux10 5.14 20 8.59 9.08
Snort11 5.86 9 1.57 1.66
Bro648 6.45 17 0.45 0.51
D2FAs build on the observation that for many
pairs of states, the transition tables are very
similar and it is enough to store the
differences. The lookup algorithm may need to
follow multiple default transitions until it
finds a state that explicitly stores a pointer to
the next state it needs to transition to. Since
this is a throughput concern, the algorithm for
constructing D2FAs allows the user to set a limit
on the length of the maximum default path.
The memory columns report the ratio between the
number of transitions used by the D2FA and the
corresponding DFA.
29Conclusions
- Networking devices implement more and more
complex data plane processing to better control
traffic - The algorithms and data structures used have big
performance impact - Often set of rules to be matched against has
specific structure - Algorithms exploiting this structure may give
good performance even if it is impossible to find
an algorithm that gives good performance on all
possible rule sets
30Thats all folks!