Title: Architectural Characterization of TCP/IP Processing on the Intel
1Architectural Characterization of TCP/IP
Processing on the Intel Pentium M Processor
- Srihari Makineni Ravi Iyer
- Communications Technology Lab
- Intel Corp.
- srihari.makineni ravishankar.iyer_at_intel.com
HPCA-10
2Outline
- Motivation
- Overview of TCP/IP
- Setup and Configuration
- TCP/IP Performance Characteristics
- Throughput and CPU Utilization
- Architectural Characterization
- TCP/IP in server workloads
- Ongoing work
3Motivation
- Why TCP/IP?
- TCP/IP is the protocol of choice for data
communications - What is the problem?
- So far system capabilities allowed TCP/IP to
process data at Ethernet speeds - But Ethernet speeds are jumping rapidly (1 to 10
Gbps) - Requires efficient processing to scale to these
speeds - Why architectural characterization?
- Analyze performance characteristics and identify
processor architectural features that impact
TCP/IP processing
4TCP/IP Overview
Application
Buffer
User
Kernel
Sockets Interface
TCP/IP Stack
TCB
Tx 1
Tx 2
Tx 3
ETH
IP
TCP
Desc 1
Driver
Desc 2
Network Hardware
DMA
Eth Pkt 1
5TCP/IP Overview
Application
User
Signal/ Copy
Kernel
Sockets Interface
TCP/IP Stack
Rx 2
Rx 3
Rx 1
Buffer
Copy
Driver
Descriptor
Payload
Network Hardware
Eth Pkt 1
6Setup and Configuration
- Test setup
- System Under Test (SUT)
- Intel Pentium M processor _at_ 1600MHz, 1MB (64B
line) L2 cache - 2 Clients
- Four way Itanium 2 processor _at_ 1GHz 3MB L3
(128B line) cache
- Operating System
- Microsoft Windows 2003 Enterprise Edition
- Network
- SUT 4Gbps total (2 dual port Gigabit NICs)
- Clients 2Gbps per client (1 dual port Gigabit
NIC)
7Setup and Configuration
- Tools
- NTttcp
- Microsoft application to measure TCP/IP
performance - Tool to extract CPU performance counters
- Settings
- 16 connections (4 per NIC port)
- Overlapped I/O
- Large Segment Offload (LSO)
- Regular Ethernet frames (1518 bytes)
- Checksum offload to NIC
- Interrupt coalescing
8Throughput and CPU Utilization
- Lower Rx performance for gt 512 byte buffer sizes
- Rx and Tx (no LSO) CPU utilization is 100
- Benefit of LSO is significant (250 for 64KB
buffer) - Lower throughput for lt 1KB buffers is due to
buffer locking
TCP/IP processing _at_ 1Gbps 1460 bytes requires
gt1 CPU
9Processing Efficiency
- 64 byte buffer
- Tx (lso) 17.13 and Rx 13.7
- 64 KB buffer
- Tx (lso) 0.212, Tx (no LSO) 0.53 and Rx 1.12
Several cycles are needed to move a bit,
especially for Rx
10Architectural Characterization
- Rx CPI higher than Tx for gt512 byte buffers
- Tx (LSO) CPI is higher than Tx (no LSO)!!!
CPI needs to come down to achieve TCP/IP scaling
11Architectural Characterization
- Rx pathlength increase is significant after 1460
byte buffer sizes - For 64KB, TCP/IP stack has to receive and process
45 packets - Lower CPI for Tx (no LSO) over Tx (LSO) is due to
higher PL
High PL shows that there is room for stack
optimizations
12Architectural Characterization
- Last level Cache Performance
- Rx has higher misses
- Primary reason for higher CPI
- Lot of compulsory misses
- Source buffer, descriptors, may be destination
buffer - Tx (no LSO) has slightly higher misses per bit
Rx performance does not scale with cache size
(many compulsory misses)
13Architectural Characterization
- L1 Data Cache Performance
- 32KB of data cache in Pentium M processor
- As expected L1 data cache misses are more for Rx
- For Rx, 68 to 88 of L1 misses resulted in L2
hits
Larger L1 data cache has limited impact on TCP/IP
14Architectural Characterization
- L1 Instruction Cache Performance
- L1 Instruction Cache Performance
- 32KB instruction cache in Pentium M processor
- Tx (no LSO) MPI is lower because of code temporal
locality - Rx code path generated L1 instruction capacity
misses
Larger L1 instruction cache helps RX processing
15Architectural Characterization
- TLB Performance
- Size
- 128 instruction and 128 data TLB entries
- iTLB misses increase faster than dTLB misses
16Architectural Characterization
- 19-21 branch instructions
- Misprediction rate is higher in Tx than Rx for lt
512 byte buffer size
gt98 accuracy in branch prediction
17Architectural Characterization
- CPI Contributors
- RX is more memory intensive than TX
- Frequency Scaling
- Poor Frequency scaling due to memory latency
overhead
Frequency Scaling alone will not deliver 10x gain
18TCP/IP in Server Workloads
- Webserver
- TCP/IP data path overhead is 28
- Back-End (database server with iSCSI)
- TCP/IP data path overhead is 35
- Front-End (e-commerce server)
- TCP/IP data path overhead is 29
TCP/IP Processing is significant in commercial
server workloads
19Conclusions
- Major Observations
- TCP/IP processing _at_ 1Gbps 1460 bytes requires
gt1 CPU - CPI needs to come down to achieve TCP/IP scaling
- High PL shows that there is room for stack
optimizations - Rx performance does not scale w/ cache size (gt
compulsory misses) - Larger L1 data cache has limited impact on TCP/IP
- Larger L1 instruction cache helps RX processing
- gt98 accuracy in branch prediction
- Frequency Scaling alone will not deliver 10x gain
- TCP/IP Processing is significant in commercial
server workloads - Key Issues
- Memory Stall Time Overhead
- Pathlength (O/S Overhead, etc)
20Ongoing work
- Investigating Solutions to the Memory Latency
Overhead - Copy Acceleration
- Low cost synchronous/asynchronous copy engine
- DCA
- Incoming data is pushed into processors cache
instead of memory - Light weight Threads to hide memory access
latency - Switch-on-event threads small context low
switching overhead - Smart Caching
- Cache structures and policies for networking
- Partitioning
- Optimized TCP/IP stack running on dedicated
processor(s) or core(s) - Other Studies
- Connection processing, bi-directional data
- Application interference
21QA