Networking in the Linux Kernel - PowerPoint PPT Presentation

About This Presentation
Title:

Networking in the Linux Kernel

Description:

cloned sk_buff's share data, but not control. struct sk_buff is defined in: ... Reference counts for cloned buffers. Separate allocation pool and support ... – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 27
Provided by: mlw2
Category:

less

Transcript and Presenter's Notes

Title: Networking in the Linux Kernel


1
Networking in the Linux Kernel
2
Introduction
  • Overview of the Linux Networking implementation
  • Covered
  • Data path through the kernel
  • Quality of Service features
  • Hooks for extensions (netfilter, KIDS, protocol
    demux placement)
  • VLAN Tag processing
  • Virtual Interfaces
  • Not covered
  • Kernels prior to 2.4.20, or 2.6
  • Specific protocol implementations
  • Detailed analysis of existing protocols, such as
    TCP. This is covered only in enough detail to
    see how they link to higher/lower layers.

3
OSI Model
  • The Linux kernel adheres closely to the OSI
    7-layer networking model

4
OSI Model (Interplay)
  • Layers generally interact in the same manner, no
    matter where placed

Layer N1 Data
Add header and/or trailer
Layer N1Control
Layer N1 Data
Pass to layer N as raw data
Layer N Data
5
Socket Buffer
  • When discussing the data path through the linux
    kernel, the data being passed is stored in
    sk_buff structures (socket buffer).
  • Packet Data
  • Management Information
  • The sk_buff is first created incomplete, then
    filled in during passage through the kernel,
    both for received packets and for sent packets.
  • Packet data is normally never copied. We just
    pass around pointers to the sk_buff and change
    structure members

6
Socket Buffer
All sk_buffs are members of a queue
next
sk_buff
sk_buff
prev
list
Packet Data cloned sk_buffs share data, but not
control
head
MAC Header
data
IP Header
tail
TCP Header
end
sk
dev_rx
dev
data
associated device
source device
socket
struct sk_buff is defined in include/linux/skbuf
f.h
7
Socket Buffer
  • sk_buff features
  • Reference counts for cloned buffers
  • Separate allocation pool and support
  • Functions for manipulating the data space
  • Very feature-rich this is a very complex,
    detailed structure, encapsulating information
    from protocols at multiple layers
  • There are also numerous support functions for
    queues of sk_buffs.

8
Data Path Overview
user
socket
socket
kernel
socket demux
TCP
UDP
Layer 4 protocol demux
protocol
protocol
IP
Layer 3 protocol demux
net_rx_action()
softirq
QueueDiscipline
DMA rings
Driver
kernel
NetworkDevice
hardware
9
OSI Layers 12 Data Link
  • The code presented resides mostly in the
    following files
  • include/linux/netdevice.h
  • net/core/skbuff.c
  • net/core/dev.c
  • net/dev/core.c
  • arch/i386/irq.c
  • drivers/net/net_init.c
  • net/sched/sch_generic.c
  • net/ethernet/eth.c (for layer 3 demux)

10
Data Link Data Path
IP
net_rx_action()
poll_queue
softirq
Layer 3
Layer 2
enqueue()
dev-gtpoll()
netif_rx_schedule() Add device pointer to
poll_queue
netif_receive_skb()
QueueDiscipline
DMA Rings
Driver
net_interrupt(net_rx, net_tx, net_error)
kernel
DMA
NetworkDevice
hardware
11
Data Link Features
  • NAPI
  • Old API would reach interrupt livelock under 60
    MBps
  • New API ensures earliest possible drop under
    overload
  • Packet received at NIC
  • NIC copies to DMA ring (struct skbuff rx_ring)
  • NIC raises interrupt via netif_rx_schedule()
  • Further interrupts are blocked
  • Clock-based softirq calls softirq_rx(), which
    calls dev-gtpoll()
  • dev-gtpoll() calls netif_receive_skb(), which does
    protocol demux (usually calling ip_rcv() )
  • Backward compatibility for non-DMA interfaces
    maintained
  • All legacy devices use the same backlog
    (equivalent to DMA ring)
  • Backlog queue is treated just like all other
    modern devices
  • Per-CPU poll_list of devices to poll
  • Ensures no packet re-ordering necessary
  • No memory copies in kernel packet stays in the
    sk_buff at the same memory location until passed
    to user space

12
Data Link transmission
  • Transmission
  • Packet sent from IP layer to Queue Discipline
  • Any appropriate QoS in qdisc discussed later
  • qdisc notifies network driver when its time to
    send calls hard_start_xmit()
  • Place all ready sk_buff pointers in tx_ring
  • Notifies NIC that packets are ready to send
  • NIC signals (via interrupt) when packet(s)
    successfully transmitted. (Highly variable on
    when interrupt is sent!)
  • Interrupt handler queues transmitted packets for
    deallocation
  • At next softirq, all packets in completion_queue
    are deallocated

13
Data Link VLAN Features
  • Still dependent on individual NICs
  • Not all NICs implement VLAN filtering
  • A partial list is available at need (not included
    here)
  • For non-VLAN NICs, linux filters in software and
    passes to the appropriate virtual interface for
    ingress priotization and layer 3 protocol demux
  • net/8021q/vlan_dev.c (and others in this
    directory)
  • Virtual interface passes through to real
    interface
  • No VID-based demux needed for received packets,
    as different VLANs are irrelevant to the IP
    layer.
  • Some changes in 2.6 still need to research this

14
OSI Layer 3 Internet
  • The code presented resides mostly in the
    following files
  • net/ipv4/ip_input.c process packet arrivals
  • net/ipv4/ip_output.c process packet departures
  • net/ipv4/ip_forward.c process packet traversal
  • net/ipv4/ip_fragment.c IP packet fragmentation
  • net/ipv4/ip_options.c IP options
  • net/ipv4/ipmr.c IP multicast
  • net/ipv4/ipip.c IP over IP, also good virtual
    interface example

15
Internet Data Path
Note chart copied from DataTagsA Map of the
Networking Code in the Linux Kernel
16
Internet Features
  • Netfilter hooks in many places
  • INPUT, OUTPUT, FORWARD (iptables)
  • NF_IP_PRE_ROUTING ip_rcv()
  • NF_IP_LOCAL_IN ip_local_deliver()
  • NF_IP_FORWARD ip_forward()
  • NF_IP_LOCAL_OUT ip_build_and_send_pkt()
  • NF_IP_POST_ROUTING ip_finish_output()
  • Connection tracking in IPv4, not in TCP/UDP/ICMP.
  • Used for NAT, which must maintain connection
    state in violation of OSI Layering
  • Can also gather statistics for networking usage,
    but all of this functionality comes from the
    netfilter module

17
Socket Structure and System Call Mapping
  • The following files are useful
  • include/linux/net.h
  • net/socket.c
  • There are two significant data structures
    involved, the socket and the net_proto_family.
    Both involve arrays of function pointers to
    handle each system call type that is relevant.

18
System Call socket
  • From user space, an application calls
    socket(family,type, protocol)
  • The kernel calls sys_socket(), which calls
    sock_create()
  • sock_create references net_familiesfamily, an
    array of network protocol families, to find the
    corresponding protocol family, loading any
    modules necessary on the fly.
  • If the module is loaded, it is loaded as
    net_pf_ltnumgt, where the protocol family number
    is used directly in the string. For TCP, the
    family is PF_INET (was AF_INET), and the type is
    SOCK_STREAM
  • Note linux has a hard limit of 32 protocol
    families. (These include PF_INET, PF_PACKET,
    PF_NETLNK, PF_INET6, etc.)
  • Layer 4 Protocols are registered in
    inet_add_protocol() (include/net/protocol.h), and
    socket interfaces are registered by
    inet_register_protosw(). Raw IP datagram sockets
    are registered like any other Layer 4 protocol.
  • Once the correct family is found, sock_create
    allocates an empty socket, obtains a mutex, and
    calls net_familiesfamily-gtcreate(). This is
    protocol-specific, and filles in the socket
    structure. The socket structure includes another
    function array, ops, which maps all system calls
    valid on file descriptors.
  • sys_socket() calls sock_map_fd() to map the new
    socket to a file descriptor, and returns it.

19
Other socket System Calls
  • Subsequent socket system calls are passed to the
    appropriate function in socket-gtops. These
    include (exhaustive list)

Technically, Linux offers only one socket system
call, sys_socket-call(), which multiplexes to all
other system calls via the first parameter. This
means that socket-based protocols could provide
new and different system calls via a library and
a mux, although this is never done in practice.
  • release
  • bind
  • connect
  • socketpair
  • accept
  • getname
  • poll
  • ioctl
  • listen
  • shutdown
  • setsockopt
  • getsockopt
  • sendmsg
  • recvmsg
  • mmap
  • sendpage

20
PF_PACKET
  • A brief word on the PF_PACKET Protocol family
  • PF_PACKET creates a socket bound directly to a
    network device. The call may specify a packet
    type. All packets sent to this socket are sent
    directly over the device, and all incoming
    packets of this type are delivered directly to
    the socket. No processing is done in the kernel.
    Thus, this interface can and is used to
    create user-space protocol implementations.
    (E.g., PPPoE uses this with packet type
    ETH_P_PPP_DISC)

21
Quality of Service Mechanisms
  • Linux has two QoS mechanisms
  • Traffic Control
  • Provides for multiple queues and priority schemes
    within those queues between the IP layer and the
    network device
  • Defaults are 100-packet queues with 3 priorities
    and a FIFO ordering.
  • KIDS (Karlsruhe Implementation architecture of
    Differentiated Services)
  • Designed to be component-extensible at runtime.
  • Consists of a set of components with similar
    interfaces that can be plugged together in almost
    arbitrarily complex constructions
  • Neither mechanism implements the higher-level
    traffic agreements, such as Traffic Conditioning
    Agreements (TCAs). MPLS is offered in Linux 2.6.

22
Traffic Control
  • Traffic Control consists of three types of
    components
  • Queue Disciplines
  • These implement the actual enqueue() and
    dequeue()
  • Also has child components
  • Filters
  • Filters classify traffic received at a Queue
    Discipine into Classes
  • Normally children of a Queuing Discipline
  • Classes
  • These hold the packets classified by Filters, and
    have associated queuing disciplines to determine
    the queuing order.
  • Normally children of a Filter and parents of
    Queuing Displines
  • Components are connected into structures called
    trees, although technically they arent true
    trees because they allow upward (cyclical) links.

23
Traffic Control Example
This is a typical TC tree. The top-level Queuing
Discipline is the only access point from the
outside, the outer queue. From external
access, this is a single queue structure. Internal
ly, packets eceived at the outer queue are
matched against each filter in order. The first
match wins, with a final default case. Dequeue
requests to the outer queue are passed along
recursively to the inner queues to find a packet
ready for sending.
24
Traffic Control (Contd)
  • The TC architecture supports a number of
    pre-built filters, classes, and disciplines,
    found in net/sched/cls_ are filters, whereas
    sch_ are disciplines (classes collocated with
    disciplines).
  • Some disciplines
  • ATM
  • Class-Based Queuing
  • Clark-Shenker-Zhang
  • Differentiated Services mark
  • FIFO
  • RED
  • Hierarchical Fair Service Curve (SIGCOMM97)
  • Hierarchical Token Bucket
  • Network Emulator (For protocol testing)
  • Priority (3 levels)
  • Generic RED
  • Stochastic Fairness Queuing
  • Token Bucket
  • Equalizer (for equalizing line rates of different
    links)

25
KIDS
  • KIDS establishes 5 general component types (by
    interface)
  • Operative Components receive a packet and runs
    an algorithm on it. The packet may be modified
    or simply examined. E.g., Token Buckets, RED,
    Shaper
  • Queue Components Data structures used to
    enqueue/dequeue. Includes FIFO,
    Earliest-Deadline-First (EDF), etc.
  • Enqueuing Components enqueue packets based on
    special methods tail-enqueue, head-enqueue,
    EDF-enqueue, etc.
  • Dequeuing Components dequeue based on special
    methods
  • Strategic Components strategies for dequeue
    requests. E.g., WFQ, Round Robin

26
KIDS (Contd)
  • KIDS has 8 different hook points in the linux
    kernel, 5 at the IP layer and 3 at Layer 2
  • IP_LOCAL_IN just prior to delivery to Layer 4
  • IP_LOCAL_OUT just after leaving Layer 4
  • IP_FORWARD packet being forwarded (router)
  • IP_PRE_ROUTING Packet newly arrived at IP layer
    from interface
  • IP_POST_ROUTING Packet routed from IP to Layer
    2
  • L2_INPUT_ltdevgt Packet has just arrived from
    interface
  • L2_ENQUEUE_ltdevgt Packet is being queued at
    Layer 2
  • L2_DEQUEUE_ltdevgt Packet is being transmitted by
    Layer 2
Write a Comment
User Comments (0)
About PowerShow.com