Architectural Characterization of TCP/IP Processing on the Intel - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Architectural Characterization of TCP/IP Processing on the Intel

Description:

... (2 dual port Gigabit NICs) Clients 2Gbps per client (1 dual port Gigabit ... Optimized TCP/IP stack running on dedicated processor(s) or core(s) Other Studies ... – PowerPoint PPT presentation

Number of Views:32
Avg rating:3.0/5.0
Slides: 22
Provided by: smak
Category:

less

Transcript and Presenter's Notes

Title: Architectural Characterization of TCP/IP Processing on the Intel


1
Architectural Characterization of TCP/IP
Processing on the Intel Pentium M Processor
  • Srihari Makineni Ravi Iyer
  • Communications Technology Lab
  • Intel Corp.
  • srihari.makineni ravishankar.iyer_at_intel.com

HPCA-10
2
Outline
  • Motivation
  • Overview of TCP/IP
  • Setup and Configuration
  • TCP/IP Performance Characteristics
  • Throughput and CPU Utilization
  • Architectural Characterization
  • TCP/IP in server workloads
  • Ongoing work

3
Motivation
  • Why TCP/IP?
  • TCP/IP is the protocol of choice for data
    communications
  • What is the problem?
  • So far system capabilities allowed TCP/IP to
    process data at Ethernet speeds
  • But Ethernet speeds are jumping rapidly (1 to 10
    Gbps)
  • Requires efficient processing to scale to these
    speeds
  • Why architectural characterization?
  • Analyze performance characteristics and identify
    processor architectural features that impact
    TCP/IP processing

4
TCP/IP Overview
  • Transmit

Application
Buffer
User
Kernel
Sockets Interface
TCP/IP Stack
TCB
Tx 1
Tx 2
Tx 3
ETH
IP
TCP
Desc 1
Driver
Desc 2
Network Hardware
DMA
Eth Pkt 1
5
TCP/IP Overview
  • Receive

Application
User
Signal/ Copy
Kernel
Sockets Interface
TCP/IP Stack
Rx 2
Rx 3
Rx 1
Buffer
Copy
Driver
Descriptor
Payload
Network Hardware
Eth Pkt 1
6
Setup and Configuration
  • Test setup
  • System Under Test (SUT)
  • Intel Pentium M processor _at_ 1600MHz, 1MB (64B
    line) L2 cache
  • 2 Clients
  • Four way Itanium 2 processor _at_ 1GHz 3MB L3
    (128B line) cache
  • Operating System
  • Microsoft Windows 2003 Enterprise Edition
  • Network
  • SUT 4Gbps total (2 dual port Gigabit NICs)
  • Clients 2Gbps per client (1 dual port Gigabit
    NIC)

7
Setup and Configuration
  • Tools
  • NTttcp
  • Microsoft application to measure TCP/IP
    performance
  • Tool to extract CPU performance counters
  • Settings
  • 16 connections (4 per NIC port)
  • Overlapped I/O
  • Large Segment Offload (LSO)
  • Regular Ethernet frames (1518 bytes)
  • Checksum offload to NIC
  • Interrupt coalescing

8
Throughput and CPU Utilization
  • Lower Rx performance for gt 512 byte buffer sizes
  • Rx and Tx (no LSO) CPU utilization is 100
  • Benefit of LSO is significant (250 for 64KB
    buffer)
  • Lower throughput for lt 1KB buffers is due to
    buffer locking

TCP/IP processing _at_ 1Gbps 1460 bytes requires
gt1 CPU
9
Processing Efficiency
  • Hz/bit
  • 64 byte buffer
  • Tx (lso) 17.13 and Rx 13.7
  • 64 KB buffer
  • Tx (lso) 0.212, Tx (no LSO) 0.53 and Rx 1.12

Several cycles are needed to move a bit,
especially for Rx
10
Architectural Characterization
  • CPI
  • Rx CPI higher than Tx for gt512 byte buffers
  • Tx (LSO) CPI is higher than Tx (no LSO)!!!

CPI needs to come down to achieve TCP/IP scaling
11
Architectural Characterization
  • Pathlength
  • Rx pathlength increase is significant after 1460
    byte buffer sizes
  • For 64KB, TCP/IP stack has to receive and process
    45 packets
  • Lower CPI for Tx (no LSO) over Tx (LSO) is due to
    higher PL

High PL shows that there is room for stack
optimizations
12
Architectural Characterization
  • Last level Cache Performance
  • Rx has higher misses
  • Primary reason for higher CPI
  • Lot of compulsory misses
  • Source buffer, descriptors, may be destination
    buffer
  • Tx (no LSO) has slightly higher misses per bit

Rx performance does not scale with cache size
(many compulsory misses)
13
Architectural Characterization
  • L1 Data Cache Performance
  • 32KB of data cache in Pentium M processor
  • As expected L1 data cache misses are more for Rx
  • For Rx, 68 to 88 of L1 misses resulted in L2
    hits

Larger L1 data cache has limited impact on TCP/IP
14
Architectural Characterization
  • L1 Instruction Cache Performance
  • L1 Instruction Cache Performance
  • 32KB instruction cache in Pentium M processor
  • Tx (no LSO) MPI is lower because of code temporal
    locality
  • Rx code path generated L1 instruction capacity
    misses

Larger L1 instruction cache helps RX processing
15
Architectural Characterization
  • TLB Performance
  • TLB Performance
  • Size
  • 128 instruction and 128 data TLB entries
  • iTLB misses increase faster than dTLB misses

16
Architectural Characterization
  • Branch Behavior
  • 19-21 branch instructions
  • Misprediction rate is higher in Tx than Rx for lt
    512 byte buffer size

gt98 accuracy in branch prediction
17
Architectural Characterization
  • CPI Contributors
  • RX is more memory intensive than TX
  • Frequency Scaling
  • Poor Frequency scaling due to memory latency
    overhead

Frequency Scaling alone will not deliver 10x gain
18
TCP/IP in Server Workloads
  • Webserver
  • TCP/IP data path overhead is 28
  • Back-End (database server with iSCSI)
  • TCP/IP data path overhead is 35
  • Front-End (e-commerce server)
  • TCP/IP data path overhead is 29

TCP/IP Processing is significant in commercial
server workloads
19
Conclusions
  • Major Observations
  • TCP/IP processing _at_ 1Gbps 1460 bytes requires
    gt1 CPU
  • CPI needs to come down to achieve TCP/IP scaling
  • High PL shows that there is room for stack
    optimizations
  • Rx performance does not scale w/ cache size (gt
    compulsory misses)
  • Larger L1 data cache has limited impact on TCP/IP
  • Larger L1 instruction cache helps RX processing
  • gt98 accuracy in branch prediction
  • Frequency Scaling alone will not deliver 10x gain
  • TCP/IP Processing is significant in commercial
    server workloads
  • Key Issues
  • Memory Stall Time Overhead
  • Pathlength (O/S Overhead, etc)

20
Ongoing work
  • Investigating Solutions to the Memory Latency
    Overhead
  • Copy Acceleration
  • Low cost synchronous/asynchronous copy engine
  • DCA
  • Incoming data is pushed into processors cache
    instead of memory
  • Light weight Threads to hide memory access
    latency
  • Switch-on-event threads small context low
    switching overhead
  • Smart Caching
  • Cache structures and policies for networking
  • Partitioning
  • Optimized TCP/IP stack running on dedicated
    processor(s) or core(s)
  • Other Studies
  • Connection processing, bi-directional data
  • Application interference

21
QA
Write a Comment
User Comments (0)
About PowerShow.com