Title: Supercharged PlanetLab Platform, Control Overview
1Supercharged PlanetLab Platform, Control Overview
- Fred Kuhns
- fredk_at_arl.wustl.edu
- Applied Research Laboratory
- Washington University in St. Louis
2Prototype Organization
- One NP blade (with RTM) implements Line Card
- separate ingress/egress pipelines
- Second NP hosts multiple slice fast-paths
- multiple static code options for diverse slices
- configurable filters and queues
- GPEs run standard Planetlab OS with vServers
3Connecting an SPP
East Coast
Local/Regional
West Coast
Host
plab/SPP
ARP endstations and intermediate routers
plab/SPP
plab/SPP
Ethernet SW
point-to-point
point-to-point
SPP
For now assume there is just a single connection
to the public Internet
sw
Host
4System Block Diagram
PLC
ReBoot how??
Substrate Control Daemon (SCD) Boot and
Configuration Control (BCC)
External Interfaces
RTM
RTM
SPP Node
10 x 1GbE
NPE
NPE
NPE
GPE
LC
GPE
ARP Table FIB
NAT Tunnel filters (in/out)
bootcd
Power Control Unit (has own IP Address)
cacert.pem boot_server plnode.txt
PCI
PCI
pl_netflow
user slivers
flow stats (netflow)
xscale
xscale
xscale
xscale
sppnode.txt
NPU-A
NPU-B
NPU-A
NPU-B
TCAM
TCAM
GE
GE
vnet
SPI
SPI
interfaces
Hub
Fabric Ethernet Switch (10Gbps, data path)
move pl_netflow to cp?
Base Ethernet Switch (1Gbps, control)
manage LC Tables
I2C (IPMI)
Control Processor (CP)
Standalone GPEs
tftp, dhcpd
sshd
httpd
routed
Resource DB
nodeconf.xml
Shelf manager
route DB
boot files
Slivers DB
user info
flow stats
All flow monitoring done at Line Card
5Software Components
- Utilities parts of BCC to generate config and
distribution files - Node configuration and management generate
config files, dhcp, tftp, ramdisk - Boot CD and distribution file management (images,
RPM and tar files) for GPEs and CP. - Control processor
- Boot and Configuration Control (BCC)
- System Resource Manager (SRM)
- System Node Manager (SNM)
- user authentication and ssh forwarding daemon
- http daemon providing a node specific interface
to netflow data (planetflow) - Routing protocol daemon (BGP/OSPF/RIP) for
maintaining FIB in Line Card - General Purpose Element (GPE)
- Local Boot Manager (LBM) Modified BootManager
running on the GPEs - Resource Manager Proxy (RMP)
- Node Manager Proxy (NMP), that is the required
changes to existing Node Manager software. - Network Processor Element (NPE)
- Substrate Control Daemon (SCD, formally known as
wuserv) - kernel module to read/write memory locations
(wumod) - Command interpreter for configuring NPU memory
(wucmd) - Modified Radisys and Intel source ramdisk Linux
kernel
6Boot and Configuration Control
7Boot and Configuration Control
- Read config file and allocate IP subnets and
addresses for substrate - Initialize Hub (delegate to SRM)
- base and fabric switches
- Initialize any switches not within the chassis
- Create dhcp configuration file and start daemon
- assigns control IP subnets and addresses
- assigns internal substrate IP subnet on fabric
Ethernet - Initialize Line Card to forward all traffic to CP
- Use the control interface, base or front panel
(Base only connected to NPUA). - All ingress traffic sent to CP
- What about Egress traffic when we are
multi-homed, either through different physical
ports or one port with more than one next hop? - We could assume only one physical port and one
next hop. - This is a general issue, the general solution is
to run routing protocols on the CP and keep the
line cards TCAM up to date. - Start remaining system level services (i.e.
daemons) - wuarl daemons
- system daemons sshd, httpd, routed
- System Node Manager maintains user login
information for ssh forwarding
8Boot and Configuration Control
- Assist GPE in booting
- Download from PLC SPP specific version of the
BootManager and NodeManager tar/rpm
distributions. - Downloads/maintains Planetlab bootstrap
distribution - Updated BootCD
- The boot CD contains SPP config file with CP
address, spp_config. - No modifications to initial boot scripts, they
contact the BCC over the fabric interface (using
the substrate IP subnet) and download the next
stage. - GPEs obtain distribution files from the BCC on
the CP - SPP changes are confined to the BootManager and
NodeManager sources (that is the plan) - PLC Database updated to place all SPP nodes in
the SPP Node Group, we use this to trigger
additional special processing. - Modified BootManager scripts configure control
interfaces (Base) and 2 Fabric interfaces (2 per
Hub). - Creates/Updates spp_config file on GPE node
- Installs BootStrap source then overwrites the
NodeManager with our modified version.
9Node Manager
10System Node Manager
- Logically the top-half of the PlanetLab Node
Manager - PLC API method GetSlivers()
- periodically call PLC for current list of slices
assigned to this node - assign system slivers to each GPE, then split
application slivers across available GPEs - keep persistent tables to handle daemon crashes
or local device reboots - Local GetSlivers() (xmlrpc interface) to GPEs
- Node Manager Proxys (per GPE) list of allocated
slivers along with other node specific
datatimestamp, list of configuration files,
node id, node groups, network addresses, assigned
slivers - Resource management across GPEs
- Manage Pool and VM RSpec assignment for each GPE
- opportunity to extend RSpecs to account for
distributed resources. - Perform top-half processing of the per GPE NMP
api (exported to sliver on this only). Calls on
one GPE may impact resource assignments or sliver
status on a different GPE - Ticket(), GetXIDs(), GetSSHKeys(), Create(),
Destroy(), Start(), Stop(), GetEffectiveRSpec(),
GetRSpec(), GetLoans(), validate_loans(),
SetLoans() - Currently the node manager uses CA Certs and SSH
keys when communicating with PLC, we will need to
do the same. But we can relax security between
SNM and the NMPs. - Tightly coupled with the System Resource Manager
- Maintain a globally unique (to the node) Sliver
ID which corresponds to what we call the
meta-router ID and make available to SRM when
enabling fast path processing (VLANs, UDP Port
numbers etc). - must request/maintain list of available GPEs and
resource availability on each. Used for
allocating slivers to GPEs and handling RSpecs. - SRM may delegate GPE management to SNM.
11SNM Questions
- Robustness -- not contemplating for this version
- If a GPE goes down do we migrate slivers to
remaining GPEs? - If a GPE is added do we migrate some slivers to
new GPE to load balance? - Intermediate solution
- If GPE goes down then mark the corresponding
slices as unmapped and reassign to remaining
GPEs - No migration of slivers when GPEs are added, just
assign new slivers to the new GPE - Do we need to intercept any of the API calls made
against the PLC? - What about the boot manager api calls and the
uploading of boot log files (alpina boot logs)? - implementation of the remote reboot command and
console logging.
12Node Manager Proxy
- Bottom-Half of existing Node Manager
- modify GetSliver() to call the System Node
Manager. - use base interface and different security
(currently they wrap xmlrpc calls with a curl
command which includethe PLCs certified public
key). - Forward GPE oriented sliver resource operations
to SNM see API list in SNM description
13System Resource Manager
14System Resource Manager
LC
GPE
NMP
RMP
root context
Resource DB
planetlab OS
NPE
SCD
FPk
FPk
FPk
15System Resource Manager
- Maintains table describing system hardware
components and their attributes - NPEs code-options, memory blocks, counters, TCAM
entries - GPEs and HW attributes
- Sliver attributes corresponding to internal
representations and control mechanisms - unique Sliver ID (aka meta-router ID)
- global port space across assigned IP addresses
- fast path VLAN assignment and corresponding IP
Subnets - HUB Management
- Manage fabric Ethernet switches (including any
used external to the Chassis or in a
multi-chassis scenario) - Manage base SW
- Manage line card table entries??
16System Resource Management
- Allocate Global port space
- input Slice ID, Global IP address0, protoUDP,
Port0 - actions allocate port
- output IP Address, Port, Proto or 0 cant
allocate - Allocate Sliver ID
- input Slice name
- actions
- Allocate unique Sliver ID and assign to slice
- allocate VLAN ID (1-to-1 map of sliver ID to
VLAN) - output Sliver ID, VLAN ID
- Allocate NPE code option (internal)
- input Sliver ID, code option id
- action Assign NPE slot to slice
- Allocate code option instance from an eligible
NPE NPE, instance ID - Allocate memory block for instance (the instance
ID is just an index into an array of preallocated
memory blocks). - output NPE Instance NPE ID, Slot Number
- Allocate Stats Index
17System Resource manager
- Add Tunnel (aka Meta-Interface) to NPE Instance
- input Sliver ID, NPE Instance, IP Address, UDP
Port - actions
- Add mapping to NPE demux table VLANIP AddrUDP
Port lt-gt Instance ID - Update instances attribute blocktunnel fields,
exception/local delivery, QID, physical port,
Ethernet addr for NPE/LC - Update next hop table (result index map to next
hop tunnel) - Set default QM weights, number of queues,
thresholds. - Update Line Card Ingress and Egress lookup
tables tunnel, NPE Ethernet address, physical
port, QIDs etc.?? - Update LC ingress and egress queue attributes for
tunnel?? - Create NPE Sliver instance
- Input Slice ID IP address, UDP Port
Interface ID, Physical Port SRAM block
filter table entries of queues queues of
packet buffers code option amount of SRAM
required total reserved bandwidth - Actions
- Allocate NPE code option
- Add tunnel to NPE Instance
- enable Sliver VLAN on associated fabric interface
ports - delegate to RMP configure GPE vnet module (via
RMP) to accept Slivers VLAN traffic. Open UDP
Port for data and control in root context and
pass back to client. - output (NPE code option) Instance number
18Resource Manager Proxy
- Act as intermediary between client virtual
machines and the node control infrastructure. - all exported interfaces are implemented by the
RMP - managing the life cycle of an NPE code instance
- accessing instance data and memory locations
- read/write to code option instances memory block
- get/set queue attributes threshold, weight
- get/add/remove/update lookup table entries (i.e.
TCAM filters) - get/clear pre/post queue counters, for a given
stats index - one-time or periodic get
- get packet/byte counter for tunnel at Line card
- allocate/release local Port
19Example Scenarios
20Default Traffic Configurations
Control messages sent over an isolated base
Ethernet switch. For isolation and security
PE
NPE
GPE
NMP
Line card performs NAT like function for traffic
from vservers.
RMP
MP
root context
planetlab OS
4
3
2
1
x
x
x
x
10GbE (fabric, data)
5
6
1GbE (base, control)
x
x
Substrate
CP
LC
user login info
Resource DB
Default traffic forwarded to CP over 10Gbps
Ethernet switch (aka fabric)
PLC
sliver tbl
21Logging Into a Slice
PE
NPE
GPE
NMP
Host (located within node)
RMP
MP
root context
planetlab OS
Once authenticated, session forwarded to
appropriate GPE and vserver.
4
3
2
1
x
x
x
x
10GbE (fabric, data)
5
6
1GbE (base, control)
x
x
Substrate
CP
LC
ssh fwder
user login info
Resource DB
ssh connection directed to CP for user
authentication
PLC
sliver tbl
22Update Local Slice Definitions
PE
NPE
GPE
NMP
Host (located within node)
RMP
MP
root context
planetlab OS
4
3
2
1
x
x
x
x
10GbE (fabric, data)
5
6
1GbE (base, control)
x
x
Substrate
update local database, allocate slice instances
(slivers) to GPE nodes
CP
LC
user login info
Resource DB
retrieve/update slice descriptions
PLC
sliver tbl
23Creating Local Slice Instance
create new slice
retrieve/update slice descriptions
PE
NPE
GPE
NMP
Host (located within node)
RMP
MP
root context
planetlab OS
4
3
2
1
x
x
x
x
10GbE (fabric, data)
5
6
1GbE (base, control)
x
x
Substrate
CP
LC
user login info
Resource DB
PLC
sliver tbl
24Allocating NPE (Creating Meta-Router)
Open local socket for exception and local
delivery traffic return to client vserver
Allocate NPE sliver code option, SRAM,
Interfaces/Ports, etc
NPE
FP - fast path
PE
GPE
NMP
Host (located within node)
FPk
RMP
MP
root context
planetlab OS
Forward request to System resource manager
Returns status and assigned global Port number
4
3
2
1
VLANk
x
x
x
x
10GbE (fabric, data)
5
6
1GbE (base, control)
x
x
Substrate
CP
LC
MI1
user login info
Resource DB
Allocate shared NPE resources, associate with new
slice fast path SRAM block filter table
entries of queues of packet buffers code
option amount of SRAM required total reserved
bandwidth
Allocate global UDP port for requested
interface(s) configure Line card.
PLC
sliver tbl
Allocate and Enable VLAN to isolate internal
slice traffic, VLANk
25Managing the Data Path
- Allocate or Delete NPE Slice instance
- Add, remove or alter filters
- each slice is allocated a portion of the NPEs
TCAM - Read or write to per slice memory blocks in SRAM
- each slice is allocated a block of SRAM
- Read counters
- one time or periodic
- Set Queue rate or threshold.
- Get queue lengths
NPE
GPE
NMP
DPl
DPl
FPk
RMP
SCD
root context
planetlab OS
2
1
x
x
10GbE (fabric, data)
6
1GbE (base, control)
x
CP
user login info
Resource DB
sliver tbl
FP - fast path
26Misc Functions
27Other LC Functions
- Line Card Table maintenance
- multi-homed SPP node must be able to send packets
to the correct next hop router/endsystem - random traffic from/to the GPE must be handled
correctly - tunnels represent point-to-point connections so
it may be alright to explicitly indicate which of
possibly several interfaces and next (Ethernet)
hop devices the tunnel should be bound - alternatively if were are running the routing
protocols we could provide the user with the
output port as a utility program. - But there are problems with running routing
protocols we could forward all route updates to
the CP. But standard implementations assume the
interfaces are physically connected to the
endsystem. - We could play tricks as vini does.
- or we assume that there is only one interface
connected to one Ethernet device. - NAT Functions
- traffic originating from within SPP
- may also want to selective map global proto/port
number to specific GPEs? - ARP and FIB on Line card
- route daemon runs on CP and keeps FIB up to date
- ARP runs on xscale and maps FIB next hop entries
to their corresponding Ethernet destination
addresses. - netflow
- flow-based statistics collection
- SRM collects periodically and posts via web
28Other Functions
- vnet
- isolation based on VLAN IDs
- support port reservations
- ssh forwarding
- maintain user login information on CP
- modify ssh daemon (or have wrapper) to forward
user logins to correct GPE - rebooting Node (spp), even when line card fails??