Data plane algorithms in routers - PowerPoint PPT Presentation

About This Presentation
Title:

Data plane algorithms in routers

Description:

... 478 R8 *,*,*,*,* 479 R1 M,S,25,*,* 5 R1 M,S,25,*,TCP 4 R1 M,S,25,*,UDP 3 R1 M,S,25,123,* 2 R1 M,S,25,123TCP 1 R1 M,S,25,123,UDP 0 Action Cross Product ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 31
Provided by: ComputerS110
Category:

less

Transcript and Presenter's Notes

Title: Data plane algorithms in routers


1
Data plane algorithms in routers
  • From prefix lookup to deep packet inspection
  • Cristian Estan, University of Wisconsin-Madison

2
What is the data plane?
  • The part of the router handling the traffic
  • Data plane algorithms applied to every packet
  • Successive packets typically treated
    independently
  • Example deciding on which link to send a packet
  • Throughput defined as number of packets or bytes
    handled per second is very important
  • Line speed keeping up with the rate at which
    traffic can be transmitted over the wire or fiber
  • Example 10Gbps router has 32 ns to handle 40
    byte packet
  • Memory usage limited by technology and costs
  • Can afford at most tens of megabits of fast
    on-chip memory

3
A generic data plane problem
  • Router has many directives composed of a guard,
    and an associated action (all guards distinct)
  • There is a simple procedure for testing how well
    a guard matches a packet
  • For each packet, find the guard that matches
    best and take the associated action
  • Example routing table lookup
  • Each guard is an IP prefix (between 0 and 32
    bits)
  • Matching procedure is the guard a prefix of the
    32 bit destination IP address
  • Best defined as longest matching prefix

4
The rules of the game
  • Matching against all guards in sequence is too
    slow
  • We build a data structure that captures the
    semantics of all guards and use it for matching
  • Primary metrics
  • How fast the matching algorithm is
  • How much memory the data structure needs
  • Time to build data structure also has some
    importance
  • We can cheat (but we wont today) by
  • Using binary or ternary content-addressable
    memories
  • Using other forms of hardware support

5
Measuring algorithm complexity
  • Execution cost measured in number of memory
    accesses to read data structure
  • Actual data manipulation operations typically
    very simple
  • On some platforms we can read wide words
  • Worst case performance most important
  • Worst case defined with respect to input, not
    guards
  • Caching has been proven ineffective for many
    settings
  • Using algorithms with good amortized complexity,
    but bad worst case requires large buffers

6
Overview
  • Longest matching prefix
  • Trie-based algorithms
  • Uni-bit and multi-bit tries (fixed stride and
    variable stride)
  • Leaf pushing
  • Bitmap compression of multi-bit trie nodes
  • Tree bitmap representation for multi-bit trie
    nodes
  • Binary search on ranges
  • Binary search on prefix lengths
  • Classification on multiple fields
  • Signature matching

7
Longest matching prefix
  • Used in routing table lookup (a.k.a. forwarding)
    for finding the link on which to send a packet
  • Guard a bit string of 0 to w bits called IP
    prefix
  • Action a single byte interface identifier
  • Input a w-bit string representing the
    destination IP address of the packet (w is 32 for
    IPv4,128 for IPv6)
  • Output the interface associated with the longest
    guard matching the input
  • Size of problem hundreds of thousands of prefixes

8
Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
0 P5
1
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
9
Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
11000010
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
Longest matching prefix
0 P5
1

P2
P7
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
10
Controlled prefix expansion with stride 3
Leaf pushing
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P5
001 P4
010 P4
011 P4
100 P3
101 P3
110 P3
111 P3
Multi-bit trie with fixed stride
P1 000
P1 001
P1 010
P1 011
P2 100
P2 101
P2 110
P2 111
P3 100
P4 100000
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
000 P1
001 P1
010 P1
011 P1
100 P3
101 P5
110 P7
111 P9
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
Routing table
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111
000
001
010 P8
011 P8
100
101
110
111
000 P7
001 P7
010 P8
011 P8
100 P7
101 P7
110 P7
111 P7
Input
Multi-bit trie with variable stride
Leaf pushing reduces memory usage but increases
update time
11000010
000 P5
001 P4
010 P4
011 P4
100
101
110
111
000 P1
001 P1
010 P1
011 P1
100 P3 3
101 P5
110 P7 2
111 P9
Longest matching prefix
0 P5
1

P7
0
1
Uni-bit trie
0 P4
1
Given a maximum trie height h and a routing table
of size n dynamic programming algorithm computes
optimal variable stride trie in O(nw2h)
0 P3
1 P6
0 P1
1 P2
0
1 P8
0
1
0
1
00
01 P8
10
11
0 P7
1 P9
11
Lulea bitmap compression
Bitmap supporting fast counting
Compressed node
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
P1 000
P1 001
P1 010
P1 011
P3 100
P4 100001
P4 100010
P4 100011
P5 100000
P6 101
P7 110
P8 110010
P8 110011
P9 111
00000 1
00001 0
00010 0
00011 0
00100 1
00101 1
00110 1
00111 0
01000 1
01001 0
01010 1
01011 0
01100 1
01101 0
01110 0
01111 1
10000 1
10001 0
10010 1
10011 0
10100 1
10101 1
10110 1
10111 0
11000 0
11001 0
11010 0
11011 0
11100 1
11101 1
11110 1
11111 0
When the compression bitmaps are large it is
expensive to count bits during lookup. The bitmap
is divided into chunks and a pre-computed
auxiliary array stores the number of bits set
before each chunk. The lookup algorithm needs to
count only bits set within one chunk.
000 P1
001 P1
010 P1
011 P1
100
101 P5
110
111 P9
000 1
001 0
010 0
011 0
100 1
101 1
110 1
111 1
P1

P5

P9
Repeating entries are stored only once in the
compressed array. An auxiliary bitmap is needed
to find the right entry in the compressed node.
It stores a 0 for positions that do not differ
from the previous one.
00 0
01 4
10 8
11 13
Input
11001010
Representing node as tree bitmap
Pointers to children and prefixes are stored in
separate structures. Prefixes of all lengths are
stored, thus leaf pushing is not needed and
update is fast. Bitmaps have 1s corresponding to
entries that are not empty.
Longest matching prefix
0 1
1 1
00 0
01 0
10 0
11 0
000 0
001 0
010 0
011 0
100 1
101 1
110 1
111 1

P7
13013
P2
000 0
001 0
010 0
011 0
100 1
101 0
110 1
111 0
P1
P2
P3
P6
P7
P9
P1 0
P2 1
P3 100
P4 1000
P5 100000
P6 101
P7 110
P8 11001
P9 111


12
Binary search on ranges
  • Divide w-bit address space into maximal
    continuous ranges covered by same prefix
  • Build array or balanced (binary) search tree with
    boundaries of ranges
  • At lookup time perform O(log(n)) search
  • Not better than multi-bit tries with compression,
    but it is not covered by patents

13
Binary search on prefix lengths
  • Core idea for each prefix length represented in
    the routing table, have a hash table with the
    prefixes
  • Can find longest matching prefix after looking up
    in each hash table the prefix of the address with
    corresponding length
  • Binary search on prefix lengths is faster
  • Simple but wrong algorithm if you find prefix at
    length x store it as best match and look for
    longer matching prefixes, otherwise look for
    shorter prefixes
  • Problem what if there is both a shorter and a
    longer prefix, but no prefix at length x?
  • Solution insert marker at length x when there
    are longer prefixes. Must store with marker
    longest matching shorter prefix. Markers lead to
    moderate increase in memory usage.
  • Promising algorithm for IPv6 (w128)

14
Papers on longest matching prefix
  • G. Varghese Network algorithmics an
    interdisciplinary approach to designing fast
    networked devices, chapter 11, Morgan Kaufmann
    2005
  • V. Srinivasan, G. Varghese Faster IP lookups
    using controlled prefix expansion, ACM Trans. on
    Comp. Sys., Feb. 1999
  • M. Degermark, A. Brodnik, S. Carlsson, S. Pink
    Small forwarding tables for fast routing
    lookups, ACM SIGCOMM, 1997
  • W. Eatherton, Z. Dittia, G. Varghese Tree Bitmap
    Hardware / Software IP Lookups with Incremental
    Updates, http//www-cse.ucsd.edu/varghese/PAPERS
    /willpaper.pdf
  • B. Lampson, V. Srinivasan, G. Varghese IP
    lookups using multiway and multicolumn search,
    IEEE Infocom, 1998
  • M. Waldvogel, G. Varghese, J. Turner, B.
    Plattner, Scalable high-speed IP lookups, ACM
    Trans. on Comp. Sys., Nov. 2001

15
Overview
  • Longest matching prefix
  • Classification on multiple fields
  • Solution for two-dimensional case grid of tries
  • Bit vector linear search
  • Cross-producting
  • Decision tree approaches
  • Signature matching

16
Packet classification problem
  • Required for security, recognizing packets with
    quality of service requirements
  • Guard prefixes or ranges for k header fields
  • Typically source and destination prefix, source
    and destination port range, and exact value or
    for protocol
  • All fields must match for rule to apply
  • Action drop, forward, map to a certain traffic
    class
  • Input a tuple with the values of the k header
    fields
  • Output the action associated with the first rule
    that matches the packet (rules are strictly
    ordered)
  • Size of problem thousands of classification rules

17
Example of classification rule set
External time server TO
Router that filters traffic
Mail gateway M
Internet
Net
Internal time server TI
Secondary name server S
Destination IP Source IP Dest Port Src Port Protocol Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP/ACK R7
R8
18
A geometric view of packet classification
R1
R1
R3
R3
Destination address space
R2
R2
Source address space
Source address space
  • In theory number of regions defined can be much
    larger than number of rules
  • Any algorithm that guarantees O(n) space for all
    rule sets of size n needs O(log(n)k-1) time for
    classification

19
The two dimensional case source and destination
IP addresses
  • For each destination prefix in rule set, link to
    corresponding node in destination IP trie a trie
    with source prefixes of rules using this
    destination prefix
  • Matching algorithm must use backtracking to visit
    all source tries
  • Grid of tries by pre-computing switch pointers
    in destination tries and propagating some
    information about more general rules, matching
    may proceed without backtracking
  • Memory used proportional to number of rules
  • Matching time O(w) with constant depending on
    stride
  • Extended grid of tries handles 5 fields and has
    good run time and memory in practice

20
Dest IP Dest IP
M 11110111
TI 00001111
Net 00000111
00000101
Source IP Source IP
S 11110011
TO 11011011
Net 11010111
11010011
Dest Port Dest Port
25 10000111
53 01100111
23 00010111
123 00001111
00000111
Dest IP Src IP Dest Port Src Port Proto Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP R7
R8
00000101
11010011
00000111
11110111
10110111
00000001
Proto Proto
UDP 11111101
TCP 10110111
10110101
Src Port Src Port
123 11111111
11110111
R8
  • Bit vector approaches do linear search through
    rule set
  • For each field we pre-compute a structure (e.g.
    trie) to find most specific prefix or range
    distinguished by rule set
  • For each rule, a single bit represents whether a
    given most specific prefix matches rule or not
  • We associate with each range a bitmap of size n
    encoding which of the rules may match a packet in
    that prefix or range
  • Classification algorithm first computes for each
    field of the packet the most specific
    prefix/range it belongs to
  • By then AND-ing together the k bitmaps of size n
    we find matching rules
  • Works well for hardware solutions that allow wide
    memory reads
  • Scales poorly to large rule sets

21
Dest Port
25
53
23
123

Dest IP
M
TI
Net

Src IP
S
TO
Net

Proto
UDP
TCP

Src Port
123

Dest IP Src IP Dest Port Src Port Proto Action
M 25 R1
M 53 UDP R2
M S 53 R3
M 23 R4
TI TO 123 123 UDP R5
Net R6
Net TCP R7
R8
312033046131478
Dest IP - Src IP Rule bitmap Class
0 M,S 11110011 C1
1 M,TO 11010011 C2
2 M,Net 11010111 C3
3 M, 11010011 C2
4 TI,S 00000011 C4
5 TI,T0 00001011 C5
6 TI,Net 00000111 C6
7 TI, 00000011 C4
8 Net,S 00000011 C4
9 Net,TO 00000011 C4
10 Net,Net 00000111 C6
11 Net, 00000011 C4
12 ,S 00000001 C7
13 ,TO 00000001 C7
14 ,Net 00000100 C8
15 , 00000001 C7
Cross Product Action
0 M,S,25,123,UDP R1
1 M,S,25,123TCP R1
2 M,S,25,123, R1
3 M,S,25,,UDP R1
4 M,S,25,,TCP R1
5 M,S,25,, R1

478 ,,,,TCP R8
479 ,,,, R8
Cross-producting performs longest prefix matching
separately for all fields and combines the
results in a single step by looking up the
matching rule in a pre-computed table explicitly
listing the first matching rule for each element
of the cross-product. The size of this table is
the product of the numbers of recognized
prefixes/ranges for the individual fields. Due to
its memory requirements this method is not
feasible.

44523480
Equivalenced cross-producting (a.k.a. recursive
flow classification or RFC) combines the results
of the per-field longest matching prefix
operations two by two. The pairs of values are
grouped in equivalence classes and in general
there are much fewer equivalence classes than
pairs of values. This leads to significant memory
savings as compared to simple cross-producting.
This algorithm provides fast packet
classification, but compared to other algorithms,
the memory requirements are relatively large (but
feasible in some settings).
16 entries, 8 distinct classes
Src IP
Dest IP
Src Port
Dest Port
Proto
Final result
22
Decision tree approaches
  • At each node of the tree test a bit in a field or
    perform a range test
  • Large fan-out leads to shallow trees and fast
    classification
  • Leaves contain a few rules traversed linearly
  • Interior nodes may contain rules that match also
  • Tests may look at bits from multiple fields
  • A rule may appear in multiple nodes of the
    decision tree this can lead to increased memory
    usage
  • Tree built using heuristics that pick fields to
    compare on that divide remaining rules relatively
    evenly among descendants
  • Fast and compact on rule sets used today

23
Papers on packet classification
  • G. Varghese Network algorithmics , chapter 12
  • V. Srinivasan, G. Varghese, S. Suri, M.
    Waldvogel, Fast and Scalable Layer Four
    Switching, ACM SIGCOMM, Sep. 1998
  • F. Baboescu, S. Singh, G. Varghese, Packet
    classification for core routers Is there an
    alternative to CAMs?, IEEE Infocom, 2003
  • P. Gupta, N. McKeown, Packet classification on
    multiple fields, ACM SIGCOMM 1999
  • T. Woo, A modular approach to packet
    classification Algorithms and results, IEEE
    Infocom, 2000
  • S. Singh, F. Baboescu, G. Varghese, Packet
    classification using multidimensional cutting,
    SIGCOMM, 2003

24
Overview
  • Longest matching prefix
  • Classification on multiple fields
  • Signature matching
  • String matching
  • Regular expression matching w/ DFAs and D2FAs

25
Signature matching
  • Used in intrusion prevention/detection,
    application classification, load balancing
  • Guard a byte string or a regular expression
  • Action drop packet, log alert, set priority,
    direct to specific server
  • Input byte string from the payload of packet(s)
  • Hence the name deep packet inspection
  • Output the positions at which various signatures
    match or the identifier of the highest priority
    signature that matches
  • Size of problem hundreds of signatures per
    protocol

26
String matching
  • Most widely used early form of deep packet
    inspection, but the more expressive regular
    expressions have superceded strings by now
  • Still used as pre-filter to more expensive
    matching operations by popular open source
    IDS/IPS Snort
  • Matching multiple strings a well-studied problem
  • A. Aho, M. Corasick. Efficient string matching
    An aid to bib- liographic search, Communications
    of the ACM, June 1975
  • Many hardware-based solutions published in last
    decade
  • Matching time independent of number of strings,
    memory requirements proportional to sum of their
    sizes

27
Regular expression matching
  • Deterministic and non-deterministic finite
    automata (DFAs and NFAs) can match regular
    expressions
  • NFAs more compact but require backtracking or
    keeping track of sets of states during matching
  • Both representations used in hardware and
    software solutions, but only DFA based solutions
    can guarantee throughput in software
  • DFAs have a state space explosion problem
  • From DFAs recognizing individual signatures we
    can build a DFA that recognizes entire signature
    set in a single pass
  • Size of combined DFA much larger than sum of
    sizes for DFAs recognizing individual signatures
  • Multiple combined DFAs are used to match
    signature set

28
S. Kumar, S. Dharmapurikar, F. Yu, P. Crowley, J.
Turner, Algorithms to Accelerate Multiple
Regular Expressions Matching for Deep Packet
Inspection, ACM SIGCOMM, September 2006
Delayed Input DFA (D2FA)
Deterministic finite automaton (DFA)
State 0
State 1
State 2
State 3
State 0
State 1
State 2
State 3
Default transitions

2

0
19
12
12
4
4
4
8
8
19
12
12
4
2
8
2
8
25
18
25
6
41
5
41
5
25
18
25
41
41
41
5
5





2
8
2

25
18
25
41
41
41
5
5



6

5
41

19
12
12
4
4
4
8
8

Input
410052
If the current state variable meets an
acceptance condition (e.g. whether the state
identifier is larger than a given threshold), the
automaton raises an alert.
Crt. state
1
12
Set of regular expr. D2FAs with no bound on default path length D2FAs with no bound on default path length D2FAs with no bound on default path length D2FAs d.p.l.4
Set of regular expr. Avg. d.p.l. Max d.p.l. Memory Memory
Cisco590 18.32 57 0.80 1.56
Cisco103 16.65 54 0.98 1.54
Cisco7 19.61 61 2.58 3.31
Linux56 7.68 30 1.64 1.87
Linux10 5.14 20 8.59 9.08
Snort11 5.86 9 1.57 1.66
Bro648 6.45 17 0.45 0.51
D2FAs build on the observation that for many
pairs of states, the transition tables are very
similar and it is enough to store the
differences. The lookup algorithm may need to
follow multiple default transitions until it
finds a state that explicitly stores a pointer to
the next state it needs to transition to. Since
this is a throughput concern, the algorithm for
constructing D2FAs allows the user to set a limit
on the length of the maximum default path.
The memory columns report the ratio between the
number of transitions used by the D2FA and the
corresponding DFA.
29
Conclusions
  • Networking devices implement more and more
    complex data plane processing to better control
    traffic
  • The algorithms and data structures used have big
    performance impact
  • Often set of rules to be matched against has
    specific structure
  • Algorithms exploiting this structure may give
    good performance even if it is impossible to find
    an algorithm that gives good performance on all
    possible rule sets

30
Thats all folks!
Write a Comment
User Comments (0)
About PowerShow.com