Title: A TCP Tuning Daemon
1A TCP Tuning Daemon
Tom Dunigan thd_at_ornl.gov Matt Mathis
mathis_at_psc.edu Brian Tierney bltierney_at_lbl.gov
2Roadmap
- Motivation
- Net100 project
- Web100
- network probes sensors
- protocol analysis
- A TCP tuning daemon
- Tuning experiments
www.net100.org
- and now a word from our sponsors
- DOE-funded project (Office of Science)
- 1M/yr, 3 yrs beginning 9/01
- LBL, ORNL, PSC, NCAR
- Net100 project objectives (network-aware
operating systems) - measure, understand, and improve end-to-end
network/application performance - tune network protocols and applications (grid
and bulk transfer) - first year emphasis TCP bulk transfer over high
delay/bandwidth nets
3Motivation
- Poor network application performance
- High bandwidth paths, but apps slow
- Is it application? OS? network? Yes
- Often need a network wizard
- Changing bandwidths
- 9.6 Kbs 1.5 Mbs ..45 1001000? Gbs
- Unchanging TCP
- speed of light (RTT)
- MTU (still 1500 bytes)
- TCP congestion avoidance
- TCP is lossy by design !
- 2x overshoot at startup, sawtooth
- recovery after a loss can be very slow on todays
high delay/bandwidth links - Recovery proportional to MSS/RTT2
8 Mbs
Linear recovery at 0.5 Mb/s!
Instantaneous bandwidth
Early startup losses
Average bandwidth
ORNL to NERSC ftp
GigE/OC12 80ms RTT
4TCP tuning
- set optimal (?) buffer size
- need buffer bandwidthRTT
ORNL/NERSC (80 ms, OC12) need 6 MB - avoid losses
- modified slow-start
- reduce bursts
- anticipate loss (ECN,Vegas?)
- reorder threshold
- speed recovery
- bigger MTU or virtual MSS
- modified AIMD (0.5,1)
- delayed ACKs and initial window
- avoid congestion collapse
- be fair (?) intranets, QoS
ns simulation 500 mbs link, 80 ms RTT Packet
loss early in slow start. Standard TCP with del
ACK takes 10 minutes to recover!
5Net100 components for tuning
- TCP protocol analysis
- simulation/emulation
- kernel tuning extensions
- Web100 Linux kernel (NSF)
www.web100.org - instrumented TCP stack (IETF MIB draft)
- 100 variables per flow (/proc/web100)
- socket open/close event notification
- API and tools for tracing and tuning, e.g., bw
tester http//firebird.ccs.ornl.gov7123 - Path characterization
- Network Tuning and Analysis Framework (NTAF)
- both active and passive measurement
- iperf, pipechar
- schedule probes and distribute/archive results
- data base of measurements
- NTAF/Net100 hosts at PSC, NCAR,LBL,ORNL,
NERSC,CERN,UT,SLAC - TCP tuning daemon
6TCP Tuning Daemon
WAD config file bob src_addr 0.0.0.0
src_port 0 dst_addr 10.5.128.74
dst_port 0 mode 1 sndbuf 2000000
rcvbuf 100000 wadai 6 wadmd 0.3
maxssth 100 divide 1 reorder 9
sendstall 0 delack 0 floyd 1
- Work-around Daemon (WAD)
- tune unknowing sender/receiver at startup and/or
during flow - Web100 kernel extensions
- pre-set windowscale to allow dynamic tuning
- uses netlink to alert daemon of socket open/close
(or poll) - besides existing Web100 buffer tuning, new tuning
options using WAD_ variables - knobs to disable Linux 2.4 caching, burst mgt.,
and sendstall - config file with static tuning data
- mode specifies dynamic tuning (Floyd AIMD, NTAF
buffer size, concurrent
streams) - daemon periodically polls NTAF for fresh tuning
data - written in C (also python version)
7Experimental results
- Evaluating the tuning daemon in the wild
- emphasis bulk transfers over high
delay/bandwidth nets (Internet2, ESnet) - tests over 10GigE,OC48, OC12, OC3, ATM/VBR,
GigE,FDDI,100/10T,cable, ISDN,wireless
(802.11b),dialup - tests over NistNET 100T testbed
- Various TCP tuning options
- buffer tuning
- AIMD mods (including Floyd, both in-kernel and in
WAD) - slow-start mods
- parallel vs single
- Results are anecdotal
- more systematic testing is on-going
- Your mileage may vary .
Network professionals on a closed course.
Do not attempt this at home.
8WAD tuning results
- Classic buffer tuning
- ORNL to PSC, OC12, 80ms RTT
- network-challenged app. gets 10 Mbs
- same app., WAD/NTAF tuned buffer gets 143 Mbs
-
- Virtual MSS
- tune TCPs additive increase (WAD_AI)
- add k segments per RTT during recovery
- k6 like GigE jumbo frame, but
- interrupt rate not reduced
- doesnt do k segments for initial window
9Tuning around Linux (2.4) TCP
Amsterdam-Chicago GigE via 10GigE, 100 ms RTT
- Tunable ssthresh caching
- Tunable sendstall (TXQUELEN)
600 mbs
Floyd AIMD as cwnd grows increase AI and
decrease MD, do the reverse when cwnd
shrinks Added to Net100 kernel and to WAD (WAD
tunable)
Floyd AIMD
sendstalls
Standard AIMD
UDP event
10WAD tuning
- Modified slow-start and AI
- ORNL to NERSC, OC12, 80 ms RTT
- often losses in slow-start
- WAD tuned Floyd slow-start and fixed AI (6)
- WAD-tuned AIMD and slow-start
- ORNL to CERN, OC12, 150ms RTT
- parallel streams AIMD (1/(2k),k)
- WAD-tuned single stream (0.125,4)
11GridFTP tuning
Can tuned single stream compete with parallel
streams?
Mostly not with equivalence tuning, but
sometimes. Parallel streams have slow-start
advantage. WAD can divide buffer among
concurrent flowsfairer/faster? Tests
inconclusive so far. Testing on real
Internet is problematic.
Is there a congestion metric? Per unit of
time?
Flow Mbs congestion re-xmits untuned
28 4 30 tuned
74 5 295 parallel 52
30 401 untuned 25
7 25 tuned 67 2
420 parallel 88 17
440
Buffers 64K I/O, 4MB TCP (untuned 64K
TCP 8 mbs, 200s)
Data/plots from Web100 tracer
12Future TCP tuning
- Reorder threshold
- seeing more out of order packets
- WAD tune a bigger reorder threshold for path
- 40x improvement!
- Linux 2.4 does a good job already
- adjusts and caches reorder threshold
- undo congestion avoidance
-
LBL to ORNL (using our TCP-over-UDP)
dup3 case had 289 retransmits, but all were
unneeded!
- Delayed ACKs
- WAD could turn off delayed ACKs -- 2x
improvement in recovery rate and slow-start - Linux 2.4 already turns off delayed ACKs for
initial slow-start
ns simulation 500 mbs link, 80 ms RTT Packet
loss early in slow-start. Standard TCP with del
ACK takes 10 minutes to recover! NOTE aggressive
static AIMD (Floyd pre-tune)
13Futures
- Net100
- analyze effectiveness/fairness of current tuning
options - simulation
- emulation
- on the net (systematic tests)
- NTAF probes -- characterizing a path to tune a
flow - router data (passive)
- monitoring applications with Web100
- additional tuning algorithms
- Vegas,ECN
- non-TCP
- identify non-congestive loss?
- parallel/multipath selection/tuning
- WAD-to-WAD tuning
- jumbo frames experiments the quest for bigger
and bigger MTUs - more user-friendly
- Web100 extensions
- refine user interface and API
- port to other OSs
14Summary
www.net100.org
- Novel approaches
- non-invasive dynamic tuning of legacy
applications - using TCP to tune TCP (Web100)
- tuning on a per flow/destination
- Effective evaluation framework
- protocol analysis and tuning net/app/OS
debugging - out-of-kernel tuning
- Beneficial interactions
- TCP protocols (Floyd, Wu Feng (DRS), Web100,
parallel/non-TCP) - Path characterization research (SciDAC, CAIDA,
Pinger) - Scientific application and Data grids (SciDAC,
CERN) - Performance improvements