Eric Keller

About This Presentation

Title:

Eric Keller

Description:

The way packets forwarded hasn't (IP) Meant for communication between machines ... Cliff Click graph to Verilog, standard interface on modules ... – PowerPoint PPT presentation

Number of Views:104

Avg rating:3.0/5.0

Slides: 60

Provided by: Eri7112

Category:

more less

Transcript and Presenter's Notes

Title: Eric Keller

1
Multi-Level Architecture for Data Plane
Virtualization

Eric Keller
Oral General Exam
5/5/08

2
The Internet (and IP)

Usage of Internet continuously evolving
The way packets forwarded hasnt (IP)
Meant for communication between machines
Address tied to fixed location
Hierarchical addressing
Best-effort delivery
Addresses easy to spoof
Great innovation at the edge (Skype/VoIP,
BitTorrent)
Programmability of hosts at application layer
Cant add any functionality into network

3
Proposed Modifications

Many proposals to modify some aspect of IP
No single one is best
Difficult to deploy
Publish/Subscribe mechanism for objects
Instead of routing on machine address, route on
object ID
e.g. DONA (Data oriented network architecture),
scalable simulation
Route through intermediary points
Instead of communication between machines
e.g. i3 (internet indirection infrastructure),
DOA (delegation oriented architecture)
Flat Addressing to separate location from ID
Instead of hierarchical based on location
e.g. ROFL (routing on flat labels), SEIZE
(scalable and efficient, zero-configuration
enterprise)

4
Challenges

Want to Innovate in the Network
Cant because networks are closed
Need to lower barrier for who innovates
Allow individuals to create a network and define
its functionality
Virtualization as a possible solution
For both network of future and overlay networks
Programmable and sharable
Examples PlanetLab, VINI

5
Network Virtualization

Running multiple virtual networks at the same
time over a shared physical infrastructure
Each virtual network composed of virtual routers
having custom functionality

Physical machine
Virtual router
Virtual network e.g. blue virtual routers plus
Blue links
6
Virtual Network Tradeoffs

Goal Enable custom data planes per virtual
network
Challenge How to create the shared network nodes

Programmability
Performance
Isolation
7
Virtual Network Tradeoffs

Goal Enable custom data planes per virtual
network
Challenge How to create the shared network nodes

Programmability
How easy is it to add new functionality? What is
the range of new functionality that can be
added? Does it extend beyond software routers?
Performance
Isolation
8
Virtual Network Tradeoffs

Goal Enable custom data planes per virtual
network
Challenge How to create the shared network nodes

Programmability
Does resource usage by one virtual networks have
an effect on others? Faults? How secure is it
given a shared substrate?
Performance
Isolation
9
Virtual Network Tradeoffs

Goal Enable custom data planes per virtual
network
Challenge How to create the shared network nodes

Programmability
How much overhead is there for sharing? What is
the forwarding rate? Throughput? Latency?
Performance
Isolation
10
Virtual Network Tradeoffs

Network Containers
Duplicate stack or data structures
e.g. Trellis, OpenVZ, Logical Router
Extensible Routers
Assemble custom routers from common functions
e.g. Click, Router Plug Ins, Scout
Virtual MachinesClick
Run operating system on top of another operating
system
e.g. Xen, PL-VINI (Linux-VServer)

Programmability
Programability, Isolation, Performance
Performance
Isolation
Programmability, Isolation, Performance
Programmability, Isolation, Performance
11
Outline

Architecture
Implementation
Virtualizing Kernel
Challenges with kernel execution
Extending beyond commodity hardware
Evaluation
Conclusion/Future Work

12
Outline

Architecture
Implementation
Virtualizing Kernel
Challenges with kernel execution
Extending beyond commodity hardware
Evaluation
Conclusion/Future Work

13
User Experience (Creating a virtual network)

Custom functionality
Custom user environment on each node (for
controlling virtual router)
Specify single nodes packet handling as graph of
common functions
Isolated from others sharing same node
Allocated share of resources (e.g. CPU, memory,
bandwidth)
Protected from faults in others (e.g. another
virtual router crashing)
Highest performance possible

For example
User Control Environment
Determine Shortest Path
Config/Query interface
Populate routing tables
A1
A2
A3
From devices
To devices
Check Header, Destination Lookup
A4
A5
14
Lightweight Virtualization

Combine graphs into single graph
Provides lightweight virtualization
Add extra packet processing (e.g. mux/demux)
Needed to direct packets to the correct graph
Add resource accounting

Graph 1
Master graph
combine
Graph 2
Master Graph
Graph 1
Output port
Input port
Graph 2
15
Increasing Performance and Isolation

Partition into multiple graphs across multiple
targets
Each target with different capabilities
Performance, Programmability, Isolation
Add connectivity between targets
Unified run-time interface (it appears as a
single graph)
To query and configure the forwarding capabilities

Target0 graph
Graph 1
Target1 graph
Master graph
partition
combine
Graph 2
Target2 graph
16
Examples of Multi-Level

Fast Path/Slow Path
IPv4 forwarding in fast path, exceptions in slow
path
i3 Chord ring lookup function in fast path,
handling requests in slow path
Preprocessing
IPSec do encryption/decryption in HW, rest in
SW
Offloading
TCP Offload
TCP Splicing
Pipeline of coarse grain services
e.g. transcoding, firewall
SoftRouter from Bell Labs

17
Outline

Architecture
Implementation
Virtualizing Kernel
Challenges with kernel execution
Extending beyond commodity hardware
Evaluation
Conclusion/Future Work

18
Implementation

Each network has custom functionality
Specified as graph of common functions
Click modular router
Each network allocated share of resources
e.g. CPU
Linux-VServer single resource accounting for
both control and packet processing
Each network protected from faults in others
Library of elements considered safe
Container for unsafe elements
Highest performance possible
FPGA for modules with HW option, Kernel for
modules without

19
Click Background Overview

Software architecture for building flexible and
configurable routers
Widely used commercially and in research
Easy to use, flexible, high performance (missing
sharable)
Routers assembled from packet processing modules
(Elements)
Simple and Complex
Processing is directed graph
Includes a scheduler
Schedules tasks (a series of elements)

20
Linux-VServer
21
Linux-VServer Click NetFPGA
click
click
click
Coordinating Process
Installer
Installer
Installer
Click
Click on NetFPGA
22
Outline

Architecture
Implementation
Virtualizing Click in the Kernel
Challenges with kernel execution
Extending beyond software routers
Evaluation
Conclusion/Future Work

23
Virtual Kernel Mode Click

Want to run in Kernel mode
Close to 10x higher performance than user mode
Use library of safe elements
Since Kernel is shared execution space
Need resource accounting
Click scheduler does not do resource accounting
Want resource accounting system-wide (i.e. not
just inside of packet processing)

24
Resource Accounting with VServer

Purpose of Resource Accounting
Provides isolation between virtual networks
Unified resource accounting
For packet processing and control
VServers Token Bucket Extension to Linux
Scheduler
Controls eligibility of processes/threads to run
Integrating with Click
Each individual Click configuration assigned to
its own thread
Each thread associated with VServer context
Basic mechanism is to manipulate the task_struct

25
Outline

Architecture
Implementation
Virtualizing Kernel
Challenges with kernel execution
Extending beyond software routers
Evaluation
Conclusion/Future Work

26
Unyielding Threads

Linux kernel threads are cooperative (i.e. must
yield)
Token scheduler controls when eligible to start
Single long task can have short term disruptions
Affecting delay and jitter on other virtual
networks
Token bucket does not go negative
Long term, a virtual network can get more than
its share

Tokens added (rate A)
Size of Bucket (S)
Min tokens to exec (M)
Tokens consumed (1 per scheduler tick)
27
Unyielding Threads (solution)

Determine maximum allowable execution time
e.g. from token bucket parameters, network
guarantees
Determine pipelines execution time
Elements from library have known execution times
Custom elements execution times are unknown
Break pipeline up (for known)
Execute inside of container (for unknown)

elem1
elem2
elem3
elem1
elem2
elem3
elem1
elem2
elem3
From Kern
To User
28
Custom Elements Written in C

Elements have access to global state
Kernel state/functions
Click global state
Could
Pre-compile in user mode
Pre-compile with restricted header files
Not perfect
With C, you can manipulate pointers
Instead, custom elements are unknown (unsafe)
Execute in container in user space

29
Outline

Architecture
Implementation
Virtualizing Kernel
Challenges with kernel execution
Extending beyond commodity hardware
Evaluation
Conclusion/Future Work

30
Extending beyond commodity HW

PC Programmable NIC (e.g. NetFPGA)
FPGA on PCI card
4 GigE ports
On board SRAM and DRAM
Jon Turners Pool of Processing Elements with
crossbar
PEs can be GPP, NPU, FPGA
Switch Fabric Crossbar

Partition between FPGA and Software Generalize
Partition among PEs
31
FPGA Click

Two previous approach
Cliff Click graph to Verilog, standard
interface on modules
CUSP Optimize Click graph by parallelizing
internal statements.
Our approach
Build on Cliff by integrating FPGAs into Click
(the tool)
Software Analogies
Connection to outside environment
Packet Transfer
Element specification and implementation
Run-time querying and configuration
Memory
Notifiers
Annotations

FromDevice (eth0)
Element (LEN 5)
ToDevice (eth0)
32
Outline

Architecture
Implementation
Virtualizing Kernel
Challenges with kernel execution
Extending beyond commodity hardware
Evaluation
Conclusion/Future Work

33
Experimental Evaluation

Is multi-level the right approach?
i.e. is it worth effort to support kernel and
FPGA
Does programmability imply less performance?
What is the overhead of virtualization?
From container when you need to go to user
space.
From using multiple threads when running in
kernel.
Are the virtual networks isolated in terms of
resource usage?
What is the maximum short-term disruption from
unyeilding threads?
How long can a task run without leading to
long-term unfairness?

34
Setup
n1
Modify header (IP and ETH) To be from n1 to n2.
PC3000 on Emulab 3GHz, 2GB RAM
n0
n2
rtr
Generates Packets from n0 to n1, tagged with
time Receives packets, diffs the current time
and packet time (and stores avg in mem)
The router under test (Linux or a Click config)
n3
35
Is multi-Level the right approach?

Performance benefit going from user to kernel,
and
Kernel to FPGA
Programmability imply less performance?
Not sacrificing performance by introducing
programmability

36
What is the overhead of virtualization?From
container

When you must go to user space, what is the cost
of executing in a container?
Overhead of executing in a VServer is minimal

37
What is the overhead of virtualization? From
using multiple threads
Thread (each runs X tasks/yield)
Put same click graph in each thread Round robin
traffic between them
4portRouter (compound element)
RoundRobin
PollDevice
ToDevice
4portRouter (compound element)
38
How long to run before yielding

tasks per yield
Low gt high context switching, I/O executes often
High gt low context switching, I/O executes
infrequently

39
What is the overhead of virtualization? From
using multiple threads

Given sweet spot for each of virtual networks
Increasing number of virtual networks from 1 to
10 does not hurt aggregate performance
significantly
Alternatives to consider
Single threaded with VServer
Single threaded, modify Click to do resource
accounting
Integrate polling into threads

40
What is the maximum short-term disruption from
unyeilding threads?

Profile of (some) Elements
Standard N port router example - 5400 cycles
(1.8us)
RadixIPLookup (167k entries) - 1000 cycles
Simple Elements
CheckLength - 400 cycles
Counter - 700 cycles
HashSwitch - 450 cycles
Maximum Disruption is length of longest task
Possible to break up pipelines

RoundTrip CycleCount
Infinite Source
Elem
Discard NoFree
41
How long can a task run without leading to
long-term unfairness?
Chewy
4portRouter (compound element)
Infinite Source
Discard
Limited to 15
4portRouter (compound element)
Infinite Source
Discard
Count cycles
42
How long can a task run without leading to
long-term unfairness?
10k extra cycles / task
Zoomed In

Tasks longer than 1 token can lead to unfairness
Run long executing elements in user-space
performance overhead of user-space is not as big
of an issue

43
Outline

Architecture
Implementation
Virtualizing Kernel
Challenges with kernel execution
Extending beyond commodity hardware
Evaluation
Conclusion/Future Work

44
Conclusion

Goal Enable custom data planes per virtual
network
Tradeoffs
Performance
Isolation
Programmability
Built a multi-level version of Click
FPGA
Kernel
Container

45
Future Work

Scheduler
Investigate alternatives to improve efficiency
Safety
Process to certify element as safe (can it be
automated?)
Applications
Deploy on VINI testbed
Virtual router migration
HW/SW Codesign Problem
Partition decision making
Specification of elements (G language)

46
Questions
47
Backup
48
Signs of Openness

There are signs that network owners and equipment
providers are opening up
Peer-to-peer and network provider collaboration
Allowing intelligent selection of peers
e.g. Pando/Verizon (P4P), BitTorrent/Comcast
Router Vendor API
allowing creation of software to run on routers
e.g. Juniper PSDP, Cisco AXP
Cheap and easy access to compute power
Define functionality and communication between
machines
e.g. Amazon EC2, Sun Grid

49
Example 1 User/Kernel Partition

Execute unsafe elements in container
Add communication elements

container
u1
fk
tk
u1
User Kernel
s1
s2
s3
tu
fu
Safe (s1, s2, s3) Unsafe (u1)
s1
s2
s3
ToUser (tu), FromKernel (fk) ToKernel(tk),
FromUser (fu)
50
Example 2 Non-Commodity HW

PC Programmable NIC (e.g. NetFPGA)
FPGA on PCI card
4 GigE ports
On board SRAM and DRAM
Jon Turners Pool of Processing Elements with
crossbar
PEs can be GPP, NPU, FPGA
Switch Fabric Crossbar

Partition between FPGA and Software Generalize
Partition among PEs
51
Example 2 Non-Commodity HW

Redrawing the picture for FPGA/SW
Elements can have HW implementation, SW
implementation, or both (choose one)

fd
td
sw1
sw1
Software FPGA
hw1
hw2
hw3
tc
fc
Software (sw1) Hardware (hw1, hw2, hw3)
hw1
hw2
hw3
ToCPU (tc), FromDevice (fd) ToDevice(td), FromCPU
(fc)
52
Connection to outside environment

In Linux, the Board is set of devices (e.g.
eth0)
Can query Linux for whats available
Network driver (to read/write packets)
Inter process communication (for comm with
handlers)
FPGA is a chip on a board
Using eth0 needs
Pins to connect to
Some on chip logic (in form of IP Core)
Board API
Specify available devices
Specify size of address block - used by char
driver
Provide elaborate() function
Generates a top level Verilog module
Generates a UCF file (pin assignments)

53
Packet Transfer

In software it is a function call
In FPGA use a pipeline of elements with a
standard interface
Option1 Stream packet through, 1 word at a time
Could just be the header
Push/Pull a bit tricky
Option2 Pass pointer
But would have to go to memory (inefficient)

data
Element1
Element2
ctrl
valid
ready
54
Element specification and implementation

Need
Meta-data
Specify packet processing
Specify run-time querying handling (next slide)
Meta-data
Use Click C API
Ports
Registers to use specific devices
e.g. FromDevice(eth0) registers to use eth0
Packet Processing
Use C to print out Verilog
Specialized based on instantiation parameters
(config. string)
Standard interface for packet
Standard interface for handler
Currently memory mapped register

55
Run-time querying and configuration

Query state and update configuration in elements
e.g. add ADDR/MASK GW OUT

user

When Creating Element
Request Addr Block
Specify software handlers
Uses read/write methods to get data
Allocating Addresses
Given total size, and
size of each elements requested block
Generating Decode Logic

click
telnet
char driver
kernel
PCI
FPGA
decode
elem1
elem2
elem3
56
Memory

In software
malloc
static arrays
Share table through global variables or passing
pointer
Elements that do no packet processing (passed as
configuration string to elements)
In FPGA
Elements have local memory (registers/BRAM)
Unshared (off-chip) memories treat like a
device
Shared (off-chip) global memories (Unimplemented)
Globally shared vs. Shared between subset of
elements
Elements that do no packet processing
(Unimplemented)

57
Notifiers, Annotations

Notifiers
Element registers as listener or notifier
In FPGA, create extra signal(s) from notifier to
listener
Annotations
Extra space in Packet data structure
Used to mark packet with info not in packet
Which input port packet arrived in
Result of lookup
In software
fixed byte array
In FPGA
packet is streamed through, so adding extra bytes
is simple

58
User/Kernel Communication

Add communication elements
Use mknod for each direction
ToUser/FromUser store packets and provide file
functions
ToKernel/FromKernel use file I/O

container
u1
fk
tk
u1
User Kernel
s1
s2
s3
tu
fu
Safe (s1, s2, s3) Unsafe (u1)
s1
s2
s3
ToUser (tu), FromKernel (fk) ToKernel(tk),
FromUser (fu)
59
FPGA/Software Communication

Add communication elements
ToCPU/FromCPU uses device that communicates with
Linux over PCI bus
Network driver in Linux
ToDevice/FromDevice standard Click element

fd
td
sw1
sw1
Software FPGA
hw1
hw2
hw3
tc
fc
hw1
hw2
hw3
Software (sw1) Hardware (hw1, hw2, hw3)
ToCPU (tc), FromDevice (fd) ToDevice(td), FromCPU
(fc)

Write a Comment

User Comments (0)