Title: Networking in the Linux Kernel
1Networking in the Linux Kernel
2Introduction
- Overview of the Linux Networking implementation
- Covered
- Data path through the kernel
- Quality of Service features
- Hooks for extensions (netfilter, KIDS, protocol
demux placement) - VLAN Tag processing
- Virtual Interfaces
- Not covered
- Kernels prior to 2.4.20, or 2.6
- Specific protocol implementations
- Detailed analysis of existing protocols, such as
TCP. This is covered only in enough detail to
see how they link to higher/lower layers.
3OSI Model
- The Linux kernel adheres closely to the OSI
7-layer networking model
4OSI Model (Interplay)
- Layers generally interact in the same manner, no
matter where placed
Layer N1 Data
Add header and/or trailer
Layer N1Control
Layer N1 Data
Pass to layer N as raw data
Layer N Data
5Socket Buffer
- When discussing the data path through the linux
kernel, the data being passed is stored in
sk_buff structures (socket buffer). - Packet Data
- Management Information
- The sk_buff is first created incomplete, then
filled in during passage through the kernel,
both for received packets and for sent packets. - Packet data is normally never copied. We just
pass around pointers to the sk_buff and change
structure members
6Socket Buffer
All sk_buffs are members of a queue
next
sk_buff
sk_buff
prev
list
Packet Data cloned sk_buffs share data, but not
control
head
MAC Header
data
IP Header
tail
TCP Header
end
sk
dev_rx
dev
data
associated device
source device
socket
struct sk_buff is defined in include/linux/skbuf
f.h
7Socket Buffer
- sk_buff features
- Reference counts for cloned buffers
- Separate allocation pool and support
- Functions for manipulating the data space
- Very feature-rich this is a very complex,
detailed structure, encapsulating information
from protocols at multiple layers - There are also numerous support functions for
queues of sk_buffs.
8Data Path Overview
user
socket
socket
kernel
socket demux
TCP
UDP
Layer 4 protocol demux
protocol
protocol
IP
Layer 3 protocol demux
net_rx_action()
softirq
QueueDiscipline
DMA rings
Driver
kernel
NetworkDevice
hardware
9OSI Layers 12 Data Link
- The code presented resides mostly in the
following files - include/linux/netdevice.h
- net/core/skbuff.c
- net/core/dev.c
- net/dev/core.c
- arch/i386/irq.c
- drivers/net/net_init.c
- net/sched/sch_generic.c
- net/ethernet/eth.c (for layer 3 demux)
10Data Link Data Path
IP
net_rx_action()
poll_queue
softirq
Layer 3
Layer 2
enqueue()
dev-gtpoll()
netif_rx_schedule() Add device pointer to
poll_queue
netif_receive_skb()
QueueDiscipline
DMA Rings
Driver
net_interrupt(net_rx, net_tx, net_error)
kernel
DMA
NetworkDevice
hardware
11Data Link Features
- NAPI
- Old API would reach interrupt livelock under 60
MBps - New API ensures earliest possible drop under
overload - Packet received at NIC
- NIC copies to DMA ring (struct skbuff rx_ring)
- NIC raises interrupt via netif_rx_schedule()
- Further interrupts are blocked
- Clock-based softirq calls softirq_rx(), which
calls dev-gtpoll() - dev-gtpoll() calls netif_receive_skb(), which does
protocol demux (usually calling ip_rcv() ) - Backward compatibility for non-DMA interfaces
maintained - All legacy devices use the same backlog
(equivalent to DMA ring) - Backlog queue is treated just like all other
modern devices - Per-CPU poll_list of devices to poll
- Ensures no packet re-ordering necessary
- No memory copies in kernel packet stays in the
sk_buff at the same memory location until passed
to user space
12Data Link transmission
- Transmission
- Packet sent from IP layer to Queue Discipline
- Any appropriate QoS in qdisc discussed later
- qdisc notifies network driver when its time to
send calls hard_start_xmit() - Place all ready sk_buff pointers in tx_ring
- Notifies NIC that packets are ready to send
- NIC signals (via interrupt) when packet(s)
successfully transmitted. (Highly variable on
when interrupt is sent!) - Interrupt handler queues transmitted packets for
deallocation - At next softirq, all packets in completion_queue
are deallocated
13Data Link VLAN Features
- Still dependent on individual NICs
- Not all NICs implement VLAN filtering
- A partial list is available at need (not included
here) - For non-VLAN NICs, linux filters in software and
passes to the appropriate virtual interface for
ingress priotization and layer 3 protocol demux - net/8021q/vlan_dev.c (and others in this
directory) - Virtual interface passes through to real
interface - No VID-based demux needed for received packets,
as different VLANs are irrelevant to the IP
layer. - Some changes in 2.6 still need to research this
14OSI Layer 3 Internet
- The code presented resides mostly in the
following files - net/ipv4/ip_input.c process packet arrivals
- net/ipv4/ip_output.c process packet departures
- net/ipv4/ip_forward.c process packet traversal
- net/ipv4/ip_fragment.c IP packet fragmentation
- net/ipv4/ip_options.c IP options
- net/ipv4/ipmr.c IP multicast
- net/ipv4/ipip.c IP over IP, also good virtual
interface example
15Internet Data Path
Note chart copied from DataTagsA Map of the
Networking Code in the Linux Kernel
16Internet Features
- Netfilter hooks in many places
- INPUT, OUTPUT, FORWARD (iptables)
- NF_IP_PRE_ROUTING ip_rcv()
- NF_IP_LOCAL_IN ip_local_deliver()
- NF_IP_FORWARD ip_forward()
- NF_IP_LOCAL_OUT ip_build_and_send_pkt()
- NF_IP_POST_ROUTING ip_finish_output()
- Connection tracking in IPv4, not in TCP/UDP/ICMP.
- Used for NAT, which must maintain connection
state in violation of OSI Layering - Can also gather statistics for networking usage,
but all of this functionality comes from the
netfilter module
17Socket Structure and System Call Mapping
- The following files are useful
- include/linux/net.h
- net/socket.c
- There are two significant data structures
involved, the socket and the net_proto_family.
Both involve arrays of function pointers to
handle each system call type that is relevant.
18System Call socket
- From user space, an application calls
socket(family,type, protocol) - The kernel calls sys_socket(), which calls
sock_create() - sock_create references net_familiesfamily, an
array of network protocol families, to find the
corresponding protocol family, loading any
modules necessary on the fly. - If the module is loaded, it is loaded as
net_pf_ltnumgt, where the protocol family number
is used directly in the string. For TCP, the
family is PF_INET (was AF_INET), and the type is
SOCK_STREAM - Note linux has a hard limit of 32 protocol
families. (These include PF_INET, PF_PACKET,
PF_NETLNK, PF_INET6, etc.) - Layer 4 Protocols are registered in
inet_add_protocol() (include/net/protocol.h), and
socket interfaces are registered by
inet_register_protosw(). Raw IP datagram sockets
are registered like any other Layer 4 protocol. - Once the correct family is found, sock_create
allocates an empty socket, obtains a mutex, and
calls net_familiesfamily-gtcreate(). This is
protocol-specific, and filles in the socket
structure. The socket structure includes another
function array, ops, which maps all system calls
valid on file descriptors. - sys_socket() calls sock_map_fd() to map the new
socket to a file descriptor, and returns it.
19Other socket System Calls
- Subsequent socket system calls are passed to the
appropriate function in socket-gtops. These
include (exhaustive list)
Technically, Linux offers only one socket system
call, sys_socket-call(), which multiplexes to all
other system calls via the first parameter. This
means that socket-based protocols could provide
new and different system calls via a library and
a mux, although this is never done in practice.
- release
- bind
- connect
- socketpair
- accept
- getname
- poll
- ioctl
- listen
- shutdown
- setsockopt
- getsockopt
- sendmsg
- recvmsg
- mmap
- sendpage
20PF_PACKET
- A brief word on the PF_PACKET Protocol family
- PF_PACKET creates a socket bound directly to a
network device. The call may specify a packet
type. All packets sent to this socket are sent
directly over the device, and all incoming
packets of this type are delivered directly to
the socket. No processing is done in the kernel.
Thus, this interface can and is used to
create user-space protocol implementations.
(E.g., PPPoE uses this with packet type
ETH_P_PPP_DISC)
21Quality of Service Mechanisms
- Linux has two QoS mechanisms
- Traffic Control
- Provides for multiple queues and priority schemes
within those queues between the IP layer and the
network device - Defaults are 100-packet queues with 3 priorities
and a FIFO ordering. - KIDS (Karlsruhe Implementation architecture of
Differentiated Services) - Designed to be component-extensible at runtime.
- Consists of a set of components with similar
interfaces that can be plugged together in almost
arbitrarily complex constructions - Neither mechanism implements the higher-level
traffic agreements, such as Traffic Conditioning
Agreements (TCAs). MPLS is offered in Linux 2.6.
22Traffic Control
- Traffic Control consists of three types of
components - Queue Disciplines
- These implement the actual enqueue() and
dequeue() - Also has child components
- Filters
- Filters classify traffic received at a Queue
Discipine into Classes - Normally children of a Queuing Discipline
- Classes
- These hold the packets classified by Filters, and
have associated queuing disciplines to determine
the queuing order. - Normally children of a Filter and parents of
Queuing Displines - Components are connected into structures called
trees, although technically they arent true
trees because they allow upward (cyclical) links.
23Traffic Control Example
This is a typical TC tree. The top-level Queuing
Discipline is the only access point from the
outside, the outer queue. From external
access, this is a single queue structure. Internal
ly, packets eceived at the outer queue are
matched against each filter in order. The first
match wins, with a final default case. Dequeue
requests to the outer queue are passed along
recursively to the inner queues to find a packet
ready for sending.
24Traffic Control (Contd)
- The TC architecture supports a number of
pre-built filters, classes, and disciplines,
found in net/sched/cls_ are filters, whereas
sch_ are disciplines (classes collocated with
disciplines). - Some disciplines
- ATM
- Class-Based Queuing
- Clark-Shenker-Zhang
- Differentiated Services mark
- FIFO
- RED
- Hierarchical Fair Service Curve (SIGCOMM97)
- Hierarchical Token Bucket
- Network Emulator (For protocol testing)
- Priority (3 levels)
- Generic RED
- Stochastic Fairness Queuing
- Token Bucket
- Equalizer (for equalizing line rates of different
links)
25KIDS
- KIDS establishes 5 general component types (by
interface) - Operative Components receive a packet and runs
an algorithm on it. The packet may be modified
or simply examined. E.g., Token Buckets, RED,
Shaper - Queue Components Data structures used to
enqueue/dequeue. Includes FIFO,
Earliest-Deadline-First (EDF), etc. - Enqueuing Components enqueue packets based on
special methods tail-enqueue, head-enqueue,
EDF-enqueue, etc. - Dequeuing Components dequeue based on special
methods - Strategic Components strategies for dequeue
requests. E.g., WFQ, Round Robin
26KIDS (Contd)
- KIDS has 8 different hook points in the linux
kernel, 5 at the IP layer and 3 at Layer 2 - IP_LOCAL_IN just prior to delivery to Layer 4
- IP_LOCAL_OUT just after leaving Layer 4
- IP_FORWARD packet being forwarded (router)
- IP_PRE_ROUTING Packet newly arrived at IP layer
from interface - IP_POST_ROUTING Packet routed from IP to Layer
2 - L2_INPUT_ltdevgt Packet has just arrived from
interface - L2_ENQUEUE_ltdevgt Packet is being queued at
Layer 2 - L2_DEQUEUE_ltdevgt Packet is being transmitted by
Layer 2