Title: INF5061: Multimedia data communication using network processors
1 Introduction
INF5061Multimedia data communication using
network processors
2Overview
- Course topic and scope
- Background
- software-based network systems
- challenges and new requirements
- evolution of network processors
- (Very) short overview of some example network
processors
3INF5061The Course
4Lecturers
-
- Carsten Griwodzemail griff _at_ ifi
- Pål Halvorsenemail paalh _at_ ifi
5About INF5061 Topic Scope
- Content The course gives
- an overview of network processor cards
(architectures and use) - an introduction of how to program Intel IXP
network processors - some ideas of how to use network processors
6About INF5061 Topic Scope
- Lab-assignmentsAn important part of the course
are lab-assignments where the students should
make a program for the Intel IXP2400 network
processor - wwpingbump download and run
- protocol statistics extend the wwpingbump to
give processor, interface and protocol statistics - packet bridge with ARP support forward packet
to correct interface (of 3 available) - transparent load balancer balance load and
forward packets to the right machine in a cluster
of two with same IP address - HTTP protocol translator add support in the
transparent load balancer for HTTP streaming
having an RTSP/RTP server
7About INF5061 Exam (10sp)
- Prerequisite mandatory assignments
- lab assignment 2 protocol statistics
- presentation of a relevant paper
- Graded assignments
- lab assignment 4 transparent load balancer
- deliver code
- short demo/explanation of code (to lecturers
only) - lab assignment 5 HTTP protocol translator
- deliver code and a short report
- present and demonstrate to the class at the end
of the course - Final exam oral exam (???/12-2005)
- selected chapters from the Comer book and IXP
documentation - lecture slides (including slides from presented
papers) - content of lab assignments
8About INF5060 Exam (5sp)
- Mandatory assignment
- lab assignment 5 HTTP protocol translator
- deliver code and a short report
- present and demonstrate to the class at the end
of the course - approved assignment gives a passed course
(INF5060)
9Available Resources
- Book Douglas E. Comer Network Systems Design
using Network Processors Intel IXP2xxx
Version, Pearson Prentice Hall, 2004 - Other resources will be placed at
- http//www.ifi.uio.no/paalh/INF5061
- Login inf5061
- Password ixp
- Manuals for IXP2400 /paalh/INF5061/IXP2400
- Code /paalh/INF5061/code
10Disclaimer
- In the field of network processors, I am a tyro
- Definition Tyro \Tyro\, n. pl. Tyros. A
beginner in learning one who is in the
rudiments of any branch of study a person
imperfectly acquainted with a subject a
novice - Then, by definition, in the field of network
processors, we are all tyros - In our defense, when it comes to network
processors, everyone is a tyro
11Background and Motivation
12Software-Based Network System
- Uses conventional, shared hardware (e.g., a PC)
- Software
- runs the entire system
- allocates memory
- controls I/O devices
- performs all protocol processing
- First generation network systems
13Review of General Data Path on Conventional
Computer Hardware Architectures
sending
receiving
forwarding
application
application
application
communication system
communication system
communication system
transport (TCP/UDP)
network(IP)
link
14Review of Conventional Computer Hardware
Architectures
Intel D850MD Motherboard - Intel Hub Architecture
(850 Chipset)
RDRAM connectors
CPU socket
RDRAM interface
system bus
hub interface
PCI bus
Memory Controller Hub
I/O Controller Hub
PCI connectors
15Forwarding Example for an Intermediate Node
Intel Hub Architecture
application
user space kernel space
Note- one single average MPEG-II DVD stream
require 330-660 packets per second of 1500
Bytes (4-8 Mbps) - then use smaller packets, add
concurrent clients, other applications,
communication system
Pentium 4 Processor
registers
cache(s)
communication system
application
network card
16Main Packet Processing Costs
- Copying used when moving a packet from one
memory location to another - expensive (proportional to packet size)
- should be avoided whenever possible (use
pointers) - Checksuming used to detect errors
- expensive (proportional to packet size)
- transport layer payload header
- network layer header
- Fragmentation/reassembly needed when packet is
larger than smallest MTU - generate headers header checksum
- receiving many small data fragments
17Question
- Which is growing faster?
- network bandwidth
- processing power
- Note if network bandwidth is growing faster
- CPU may be the bottleneck
- need special-purpose hardware
- conventional hardware will become irrelevant
- Note if processing power is growing faster
- no problems with processing
- network/busses will be bottlenecks
18Growth Of Technologies
Mbps
year
19Packet Rates and Software Processing
64 B 1500 B
10BASE-T (10 Mbps) 19.531 833
1000BASE-T (1 Gbps) 1.953.125 83.333
OC-192 (9.95 Gbps) 19.439.453 829.416
- Packet rates (packets per second)
- Packet processing (MIPS, assuming 5K instructions
per packet) - the Comer book uses 10K instructions as an upper
bound per packet - it varies according to which protocols are used,
implementation, data size, etc. - more if moved through a fire wall
- engineering rule 1GHz general purpose CPU
1Gbps network data rate - Note this is only processing time must be
added to handle interrupts and move data into
memory - Thus, software running on a general-purpose
processor is insufficient to handle high-speed
networks because the aggregate packet rate
exceeds the capabilities of the CPU
64 B 1500 B
10BASE-T (10 Mbps) 97,65 4,17
1000BASE-T (1 Gbps) 9.765,63 416,67
OC-192 (9.95 Gbps) 97.197,27 4.147,08
20The Network System Challenges
- Data rates in general keep increasing
- Network rate gt CPU rate gt memory, busses and I/O
interfaces - Protocols and applications keep evolving
- System design, implementation and testing is time
consuming and expensive - Systems often contain errors
- Special-purpose hardware (ASIC) designed for one
type of system can usually not be reused - Host machine must inspect all incoming packets
-
- Challenge find ways to improve the design and
manufacture of complex networking systems
21Statement of Hope
- If there is hope, it lies in
- 1990 faster CPUs
- 1995 the application specific integrated
circuit (ASIC) designers - 2002 the programmers!
- Programmability
- we need a programmable device with more
capability than a conventional CPU - key to low-cost hardware for next generation
network systems - compared to ASIC designs, it is more flexible,
easier and faster to upgrade, and thus, less
expensive
22First Generation
- General idea To optimize computation, move
operations that account for the most CPU time
from software into hardware - Onboard
- address recognition and filtering
- onboard buffering
- DMA
- buffer and operation chaining
- Add hardware to NIC
- off-the-shelf chips for layer 2
- ASICs for layer 3
- Allows each NIC to operate independently
- effectively a multiprocessor
- total processing power increased dramatically
23Second Generation (early 1990s)
- Designed for greater scale
- Decentralized architecture
- additional computational power on each NIC
- NIC implements classification and forwarding
- High-speed internal interconnection mechanism
- interconnects NICs
- provides fast data path
- Multiple network interfaces
- High-speed hardware interconnects NICs
- General-purpose processor only handles exceptions
- Sufficient for medium speed interfaces (100 Mbps)
24Third Generation (late 1990s)
- Almost all packet processing off-loaded from CPU
- Special-purpose ASICs handle lower layer
functions - Embedded (RISC) processor handles layer 4
- CPU only handles low-demand processing
- Functionality partitioned further
- Additional hardware on each NIC
- Onboard
- classification
- forwarding
- traffic policing
- monitoring and statistics
25Third Generation (late 1990s)
- Enough, are third generation sufficient??
- Almost!!
- But not quite! -(
- Whats the problem?
- high cost
- long time to market
- difficult to test
- expensive and time-consuming to change
- even trivial changes require silicon respin
- 18-20 month development cycle
- little reuse across products and versions
- require in-house expertise (ASIC designers)
26Network Processors The Idea in a Nutshell
- Devise new hardware building blocks, but make
them programmable - Include support for protocol processing and I/O
- General-purpose processor(s) for control tasks
- Special-purpose processor(s) for packet
processing and table lookup - Include functional units for tasks such as
checksum computation, hashing, - Integrate as much as possible onto one chip
- Call the result a network processor
27Review of Conventional Computer Hardware
Architectures
Intel D850MD Motherboard - Intel Hub Architecture
(850 Chipset)
RDRAM connectors
CPU socket
RDRAM interface
system bus
hub interface
PCI bus
Memory Controller Hub
I/O Controller Hub
PCI connectors
28Network Processors Main Idea
Traditional system - slow - resource demanding -
shared with other operations
Network processors - a computer within the
computer - special, programmable hardware -
offloads host resources
29Designing a Network Processor
- Depends on
- operations network processor will perform
- role of network processor in overall system
- Goals
- generality sufficient for all protocols, all
protocol processing tasks and all possible
networks - high speed scale to high bit rates and high
packet rates - Key point A network processor is not designed
to process a specific protocol or part of a
protocol. Instead, designers seek a minimal set
of instructions that are sufficient to handle an
arbitrary protocol processing task at high speed
30Where to Place Network Processors
- Thus, network processors is somewhere in the
middle
performance
- Goal increase performance and reduce costs
ASIC designs
- Increase performance
- known issues
- must partition packet processing into
- separate functions
- to achieve highest speed, must handle
- each function with separate hardware
- unknown issues
- which functions to choose
- what hardware building blocks to use
- how to interconnect building blocks
network processors
software on conventional prosessor
cost
- Decrease costs
- Economics driving a gold rush
- NPs will dramatically lower production
- costs for network systems
- good NP designs worth lots of
31Explosion of Commercial Products
- 1990 ? 2000 network processors transformed from
interesting curiosity to mainstream product - used to reduce both overall costs and time to
market - 2002 over 30 vendors with a vide range of
architectures - e.g.,
- Multi-Chip Pipeline (Agere)
- Augmented RISC Processor (Alchemy)
- Embedded Processor Plus Coprocessors (Applied
Micro Circuit Corporation) - Pipeline of Homogeneous Processors (Cisco)
- Pipeline of Heterogeneous Processors (EZchip)
- Configurable Instruction Set Processors
(Cognigine) - Extensive And Diverse Processors (IBM)
- Flexible RISC Plus Coprocessors (Motorola)
- Internet Exchange Processor (Intel)
32Agere PayloadPlusA Short Overview
33Agere PayloadPlus (APP)
- Agere PayloadPlus (APP)
- consists of both programmable hardware and
software - consists of both data and control planes (i.e.,
slow and fast plane) - APP defines HW architectures, SW mechanisms,
interconnection mechanisms and interfaces, BUT
does not specify how to implement them. - Several versions of APP exist differing in the
number and types of functional units, degree of
parallelism and internal bandwidth (2.
generation 5 models)
34APP Conceptual Pipeline
- State engine
- initiate, configure and control classifier and
traffic manager - receives control from classifier
- update statistics (e.g., packet count)
- check packets against profiles(and inform
classifier) - Forwarder
- get packet from classifier
- perform traffic shaping and management
- fragment packet (if necessary)
- modify headers (if necessary)
- Classifier
- extract packets from ingress
- classify packet
- send statistics to state engine
- reassemble blocks
- pass packet to forwarder together with
classification decision
35APP550 Chip
36APP550 Chip
- Memory interfaces
- two types of physical memory
- fast cycle RAM (FCRAM) for fast memory accesses
- double data rate SRAM (DDR-SRAM) for high
throughput - the different memory types are usually used like
this
37APP550 Chip
- Media interfaces
- several to form fast data paths
- two external connections
- cell-oriented (ATM)
- packet-oriented (Ethernet)
-
38APP550 Chip
- Scheduling interface interfaces
- an external scheduling interface
- external logic can use information about queues
- PCI bus interfaces
- allows communication with host CPU
- mainly to control the whole operation
39APP550 Chip
- Coprocessor interfaces
- APP550 should be able to process a packet
- BUT, to accommodate special cases, e.g., adding
additional headers a co-processor interface is
provided
40APP550 Chip
41APP550 Chip
- Stream Editor (SED)
- two parallel engines
- modify outgoing packets (e.g., checksum, TTL, )
- configurable, but not programmable
- Packet (protocol data unit) assembler
- collect all blocks of a frame
- not programmable
- Pattern Processing Engine
- patterns specified by programmer
- programmable using a special high-level language
- only pattern matching instructions
- parallelism by hardware using multiple copies
and several sets of variables - access to different memories
- Reorder Buffer Manager
- transfers data between classifier and traffic
manager - ensure packet order due to parallelism and
variable processing time in the pattern
processing -
- Traffic Manager
- schedule packets and shape traffic flow
- programmable via scripts
- sends packets to output interface
- according to implemented policy
- discard packets
- choose queue
- State Engine
- gather information (statistics) for scheduling
- verify flow within bounds
- provide an interface to the host
- configure and control other functional units
42APP550 Full Duplex
- Clock rate for APP550 is 233 MHz
- One chip cannot manage packet at wire speed in
both directions often two in parallel (one each
direction) - all features needed in both direction?
- classification only one direction ? checks
outgoing packets and enqueues using special queue
43Intel IXP1200 / 2400A Short Overview
44IXA Internet Exchange Architecture
- IXA is a broad term to describe the Intel network
architecture (HW SW, control- data plane) - IXP Internet Exchange Processor
- processor that implements IXA
- IXP1200 is the first IXP chip (4 versions)
- IXP2xxx has now replaced the first version
- IXP1200 basic features
- 1 embedded 232 MHz StrongARM
- 6 packet 232 MHz µengines
- onboard memory
- 4 x 100 Mbps Ethernet ports
- multiple, independent busses
- low-speed serial interface
- interfaces for external memory and I/O busses
-
- IXP2400 basic features
- 1 embedded 600 MHz XScale
- 8 packet 600 MHz µengines
- 3 x 1 Gbps Ethernet ports
45IXP1200 Architecture
PCI bus - allow IXP to connect to I/O devices -
enable use of host CPU - rate 2.2 Gbps
SRAM bus - shared bus (several external units) -
usually control rather than data - rate 3.71 Gbps
Serial line - connects to the RISC - intended
for control and management - rate 38 Kbps
- SDRAM bus
- - provide access to external SDRAM memory
- used to store packets
- - can also pass addresses, control/store
operations, etc. - - rate 7.42 Gbps
- IX (Intel eXchange) bus
- enable higher rates compared to PCI
- form fast path (IXP and high-speed interfaces)
- - interface to other IXP cards
- - 4.4 Gbps
46IXP1200 Architecture
RISC processor - StrongARM running Linux -
control, higher layer protocols and exceptions -
232 MHz
Access units - coordinate access to external
units
Scratchpad - on-chip memory - used for IPC and
synchronization
Microengines - low-level devices with limited
set of instructions - transfers between memory
devices - packet processing - 232 MHz
47IXP1200 Processor Hierarchy
General-Purpose Processor - used for control and
management - running general applications
RISC processor - chip configuration interface
(serial line) - control, higher layer protocols
and exceptions
I/O processors (microengines) - transfers
between memory devices - packet processing
- Coprocessors
- - real-time clock and timers
- IX bus controller
- hashing unit
- ...
Physical interface processors - implement layer
1 2 processing
48IXP1200 Memory Hierarchy
49IXP1200 Memory Hierarchy
- Different memory types
- are organized into different addressable data
units (words or longwords) - have different access times
- connected to different busses
- Therefore, to achieve optimal performance,
programmers must understand the organization and
allocate items from the appropriate type
50IXP Performance Improvement Forwarding
- Linux 2.4 vs. IXP 1200
- Intel P4 host machine
- The forwarding latency improvement itself may
only be relevant to very time-sensitive
interactive applications - Offloading at least equally important
51IXP1200 ? IXP2400
PCI bus
IXP1200
SRAM bus
SRAM access
PCI access
Embedded RISK CPU (StrongARM)
SRAM
FLASH
SCRATCH memory
MEMORYMAPPEDI/O
SDRAM access
IX access
DRAM
DRAM bus
IX bus
52IXP2400 Architecture
- Coprocessors
- hash unit
- 4 timers
- general purpose I/O pins
- external JTAG connections
- several bulk cyphers (IXP2850 only)
- checksum (IXP2850 only)
-
PCI bus
IXP2400
RISC processor - StrongArm ? XScale - 233 MHz ?
600 MHz
SRAM bus
SRAM access
PCI access
Embedded RISK CPU (XScale)
SRAM
coprocessor
SCRATCH memory
FLASH
slowport access
- Media Switch Fabric
- forms fast path for transfers
- interconnect for several IXP2xxx
- Slowport
- shared inteface to external units
- used for FlashRom during bootstrap
Microengines - 6 ? 8 - 233 MHz ? 600 MHz
SDRAM access
MSFaccess
DRAM
microengine 8
DRAM bus
receive bus
transmit bus
- Receive/transmit buses
- shared bus ? separate busses
53IXP2400 Architecture
- Memory
- generally more of everything
- generally larger gap between CPUs and memory
access in terms of cycles - local memory on each microengine
- saving temporary results
- private per packet processor
- small (2560 bytes)
- low latency (one cycle)
- accessed through special registers
54IXP2400 Packet Processing
PCI bus
SRAM bus
SRAM access
PCI access
Embedded RISK CPU (XScale)
SRAM
coprocessor
SCRATCH memory
FLASH
slowport access
SDRAM access
MSFaccess
DRAM
DRAM bus
receive bus
transmit bus
55IXP2400 Use
- Easier to use and understand
- Pure Linux environment (except if workbench)
- More stable
- Faster to reset
56Summary
- The network challenges are many
- Challenge find ways to improve the design and
manufacture of complex networking systems - Hope (2002 version) lies in the programmers and
network processors - We will use Intel IXP2400 as an example which
offers - embedded processor plus parallel packet
processors - connections to external memories and buses
- Next time how to start programming these monsters