Title: TCP/IP Masterclass or So TCP works
1TCP/IP Masterclass or So TCP works but still
the users askWhere is my throughput?
Richard Hughes-Jones The University of
Manchester www.hep.man.ac.uk/rich/ then
Talks
2Layers IP
3The Transport Layer 4 TCP
- TCP RFC 768 RFC 1122 Provides
- Connection orientated service over IP
- During setup the two ends agree on details
- Explicit teardown
- Multiple connections allowed
- Reliable end-to-end Byte Stream delivery over
unreliable network - It takes care of
- Lost packets
- Duplicated packets
- Out of order packets
- TCP provides
- Data buffering
- Flow control
- Error detection handling
- Limits network congestion
4The TCP Segment Format
20 Bytes
5TCP Segment Format cont.
- Source/Dest port TCP port numbers to ID
applications at both ends of connection - Sequence number First byte in segment from
senders byte stream - Acknowledgement identifies the number of the
byte the sender of this (ACK) segment expects to
receive next - Code used to determine segment purpose, e.g.
SYN, ACK, FIN, URG - Window Advertises how much data this station is
willing to accept. Can depend on buffer space
remaining. - Options used for window scaling, SACK,
timestamps, maximum segment size etc.
6TCP providing reliability
- Positive acknowledgement (ACK) of each received
segment - Sender keeps record of each segment sent
- Sender awaits an ACK I am ready to receive
byte 2048 and beyond - Sender starts timer when it sends segment so
can re-transmit
- Inefficient sender has to wait
7Flow Control Sender Congestion Window
- Uses Congestion window, cwnd, a sliding window
to control the data flow - Byte count giving highest byte that can be sent
with out an ACK - Transmit buffer size and Advertised Receive
buffer size important. - ACK gives next sequence no to receive ANDThe
available space in the receive buffer - Timer kept for each packet
8Flow Control Receiver Lost Data
- If new data is received with a sequence number ?
next byte expected Duplicate ACK is send with
the expected sequence number
9How it works TCP Slowstart
- Probe the network - get a rough estimate of the
optimal congestion window size - The larger the window size, the higher the
throughput - Throughput Window size / Round-trip Time
- exponentially increase the congestion window size
until a packet is lost - cwnd initially 1 MTU then increased by 1 MTU for
each ACK received - Send 1st packet get 1 ACK increase cwnd to 2
- Send 2 packets get 2 ACKs increase cwnd to 4
- Time to reach cwnd size W TW RTTlog2 (W)
(not exactly slow!) - Rate doubles each RTT
10How it works TCP Congestion Avoidance
- additive increase starting from the rough
estimate, linearly increase the congestion window
size to probe for additional available bandwidth - cwnd increased by 1 segment per rtt
- cwnd increased by 1 /cwnd for each ACK linear
increase in rate - TCP takes packet loss as indication of congestion
! - multiplicative decrease cut the congestion
window size aggressively if a packet is lost - Standard TCP reduces cwnd by 0.5
- Slow start to Congestion Avoidance transition
determined by ssthresh
11TCP Fast Retransmit Recovery
- Duplicate ACKs are due to lost segments or
segments out of order. - Fast Retransmit If the receiver transmits 3
duplicate ACKs (i.e. it received 3 additional
segments without getting the one expected) - Sender re-transmits the missing segment
- Set ssthresh to 0.5cwnd so enter congestion
avoidance phase - Set cwnd (0.5cwnd 3 ) the 3 dup ACKs
- Increase cwnd by 1 segment when get duplicate
ACKs - Keep sending new data if allowed by cwnd
- Set cwnd to half original value on new ACK
- no need to go into slow start again
- At the steady state, cwnd oscillates around the
optimal window size - With a retransmission timeout, slow start is
triggered again
12TCP Simple Tuning - Filling the Pipe
- Remember, TCP has to hold a copy of data in
flight - Optimal (TCP buffer) window size depends on
- Bandwidth end to end, i.e. min(BWlinks) AKA
bottleneck bandwidth - Round Trip Time (RTT)
- The number of bytes in flight to fill the entire
path - BandwidthDelay Product BDP RTTBW
- Can increase bandwidth by
- orders of magnitude
- Windows also used for flow control
13Standard TCP (Reno) Whats the problem?
- TCP has 2 phases
- Slowstart Probe the network to estimate the
Available BWExponential growth - Congestion AvoidanceMain data transfer phase
transfer rate glows slowly - AIMD and High Bandwidth Long Distance networks
- Poor performance of TCP in high bandwidth wide
area networks is due - in part to the TCP congestion control algorithm.
- For each ack in a RTT without loss
- cwnd -gt cwnd a / cwnd - Additive Increase,
a1 - For each window experiencing loss
- cwnd -gt cwnd b (cwnd) -
Multiplicative Decrease, b ½ - Packet loss is a killer !!
14TCP (Reno) Details of problem 1
- Time for TCP to recover its throughput from 1
lost 1500 byte packet given by - for rtt of 200 ms _at_ 1 Gbit/s
2 min
UK 6 ms Europe 25 ms USA 150 ms1.6 s
26 s 28min
15Investigation of new TCP Stacks
- The AIMD Algorithm Standard TCP (Reno)
- For each ack in a RTT without loss
- cwnd -gt cwnd a / cwnd - Additive Increase,
a1 - For each window experiencing loss
- cwnd -gt cwnd b (cwnd) -
Multiplicative Decrease, b ½ - High Speed TCP
- a and b vary depending on current cwnd using a
table - a increases more rapidly with larger cwnd
returns to the optimal cwnd size sooner for the
network path - b decreases less aggressively and, as a
consequence, so does the cwnd. The effect is that
there is not such a decrease in throughput. - Scalable TCP
- a and b are fixed adjustments for the increase
and decrease of cwnd - a 1/100 the increase is greater than TCP Reno
- b 1/8 the decrease on loss is less than TCP
Reno - Scalable over any link speed.
- Fast TCP
- Uses round trip time as well as packet loss to
indicate congestion with rapid convergence to
fair equilibrium for throughput. - HSTCP-LP, H-TCP, BiC-TCP
16- Lets Check out this
- theory about new TCP stacks
- Does it matter ?
- Does it work?
17Packet Loss with new TCP Stacks
- TCP Response Function
- Throughput vs Loss Rate further to right
faster recovery - Drop packets in kernel
MB-NG rtt 6ms
DataTAG rtt 120 ms
18Packet Loss and new TCP Stacks
- TCP Response Function
- UKLight London-Chicago-London rtt 177 ms
- 2.6.6 Kernel
- Agreement withtheory good
- Some new stacksgood at high loss rates
19High Throughput Demonstrations
Manchester rtt 6.2 ms (Geneva) rtt 128 ms
London (Chicago)
Dual Zeon 2.2 GHz
Dual Zeon 2.2 GHz
Cisco GSR
Cisco GSR
Cisco 7609
Cisco 7609
1 GEth
1 GEth
2.5 Gbit SDH MB-NG Core
20High Performance TCP MB-NG
- Drop 1 in 25,000
- rtt 6.2 ms
- Recover in 1.6 s
- Standard HighSpeed Scalable
21High Performance TCP DataTAG
- Different TCP stacks tested on the DataTAG
Network - rtt 128 ms
- Drop 1 in 106
- High-Speed
- Rapid recovery
- Scalable
- Very fast recovery
- Standard
- Recovery would take 20 mins
22FAST demo via OMNInet and Datatag
NU-E (Leverone)
San Diego
Workstations
FAST display
2 x GE
Nortel Passport 8600
A. Adriaanse, C. Jin, D. Wei (Caltech)
10GE
FAST Demo Cheng Jin, David Wei Caltech
J. Mambretti, F. Yeh (Northwestern)
OMNInet
StarLight-Chicago
Nortel Passport 8600
10GE
CERN -Geneva
Workstations
2 x GE
2 x GE
7,000 km
2 x GE
2 x GE
OC-48 DataTAG
CERN Cisco 7609
CalTech Cisco 7609
Alcatel 1670
Alcatel 1670
S. Ravot (Caltech/CERN)
23FAST TCP vs newReno
24- Problem 2
- Is TCP fair?
- look at
- Round Trip Times Max Transfer Unit
25MTU and Fairness
- Two TCP streams share a 1 Gb/s bottleneck
- RTT117 ms
- MTU 3000 Bytes Avg. throughput over a period
of 7000s 243 Mb/s - MTU 9000 Bytes Avg. throughput over a period
of 7000s 464 Mb/s - Link utilization 70,7
Sylvain Ravot DataTag 2003
26RTT and Fairness
- Two TCP streams share a 1 Gb/s bottleneck
- CERN lt-gt Sunnyvale RTT181ms Avg. throughput
over a period of 7000s 202Mb/s - CERN lt-gt Starlight RTT117ms Avg. throughput
over a period of 7000s 514Mb/s - MTU 9000 bytes
- Link utilization 71,6
Sylvain Ravot DataTag 2003
27- Problem n
- Do TCP Flows Share the Bandwidth ?
28Test of TCP Sharing Methodology (1Gbit/s)
- Chose 3 paths from SLAC (California)
- Caltech (10ms), Univ Florida (80ms), CERN (180ms)
- Used iperf/TCP and UDT/UDP to generate traffic
- Each run was 16 minutes, in 7 regions
Les Cottrell PFLDnet 2005
29TCP Reno single stream
Les Cottrell PFLDnet 2005
- Low performance on fast long distance paths
- AIMD (add a1 pkt to cwnd / RTT, decrease cwnd by
factor b0.5 in congestion) - Net effect recovers slowly, does not effectively
use available bandwidth, so poor throughput - Unequal sharing
SLAC to CERN
30Fast
- As well as packet loss, FAST uses RTT to detect
congestion - RTT is very stable s(RTT) 9ms vs 370.14ms for
the others
31Hamilton TCP
- One of the best performers
- Throughput is high
- Big effects on RTT when achieves best throughput
- Flows share equally
32- Problem n1
- To SACK or not to SACK ?
33The SACK Algorithm
- SACK Rational
- Non-continuous blocks of data can be ACKed
- Sender transmits just lost packets
- Helps when multiple packets lost in one TCP
window - The SACK Processing is inefficient for large
bandwidth delay products - Sender write queue (linked list) walked for
- Each SACK block
- To mark lost packets
- To re-transmit
- Processing so long input Q becomes full
- Get Timeouts
34SACK
- Look into whats happening at the algorithmic
level with web100 - Strange hiccups in cwnd ? only correlation is
SACK arrivals
Scalable TCP on MB-NG with 200mbit/sec CBR
Background Yee-Ting Li
35- Real Applications on Real Networks
- Disk-2-disk applications on real networks
- Memory-2-memory tests
- Transatlantic disk-2-disk at Gigabit speeds
- Remote Computing Farms
- The effect of TCP
- The effect of distance
- Radio Astronomy e-VLBI
- Leave for Ralphs talk
36iperf Throughput Web100
- SuperMicro on MB-NG network
- HighSpeed TCP
- Linespeed 940 Mbit/s
- DupACK ? lt10 (expect 400)
37Applications Throughput Mbit/s
- HighSpeed TCP
- 2 GByte file RAID5
- SuperMicro SuperJANET
- bbcp
- bbftp
- Apachie
- Gridftp
- Previous work used RAID0(not disk limited)
38bbftp What else is going on?
- Scalable TCP
- SuperMicro SuperJANET
- Instantaneous 0 - 550 Mbit/s
- Congestion window duplicate ACK
- Throughput variation not TCP related?
- Disk speed / bus transfer
- Application architecture
39- Transatlantic Disk to Disk Transfers
- With UKLight
- SuperComputing 2004
40SC2004 UKLIGHT Overview
SC2004
SLAC Booth
Cisco 6509
MB-NG 7600 OSR
Manchester
Caltech Booth UltraLight IP
UCL network
UCL HEP
NLR Lambda NLR-PITT-STAR-10GE-16
ULCC UKLight
K2
K2
Ci
UKLight 10G Four 1GE channels
Ci
Caltech 7600
UKLight 10G
Surfnet/ EuroLink 10G Two 1GE channels
Chicago Starlight
K2
41Transatlantic Ethernet TCP Throughput Tests
- Supermicro X5DPE-G2 PCs
- Dual 2.9 GHz Xenon CPU FSB 533 MHz
- 1500 byte MTU
- 2.6.6 Linux Kernel
- Memory-memory TCP throughput
- Standard TCP
- Wire rate throughput of 940 Mbit/s
- First 10 sec
- Work in progress to study
- Implementation detail
- Advanced stacks
- Effect of packet loss
42SC2004 Disk-Disk bbftp
- bbftp file transfer program uses TCP/IP
- UKLight Path- London-Chicago-London PCs-
Supermicro 3Ware RAID0 - MTU 1500 bytes Socket size 22 Mbytes rtt 177ms
SACK off - Move a 2 GByte file
- Web100 plots
- Standard TCP
- Average 825 Mbit/s
- (bbcp 670 Mbit/s)
- Scalable TCP
- Average 875 Mbit/s
- (bbcp 701 Mbit/s4.5s of overhead)
- Disk-TCP-Disk at 1Gbit/s
43Network Disk Interactions (work in progress)
- Hosts
- Supermicro X5DPE-G2 motherboards
- dual 2.8 GHz Zeon CPUs with 512 k byte cache and
1 M byte memory - 3Ware 8506-8 controller on 133 MHz PCI-X bus
configured as RAID0 - six 74.3 GByte Western Digital Raptor WD740 SATA
disks 64k byte stripe size - Measure memory to RAID0 transfer rates with
without UDP traffic
44- Remote Computing Farms in the ATLAS TDAQ
Experiment
45ATLAS Remote Farms Network Connectivity
46ATLAS Application Protocol
- Event Request
- EFD requests an event from SFI
- SFI replies with the event 2Mbytes
- Processing of event
- Return of computation
- EF asks SFO for buffer space
- SFO sends OK
- EF transfers results of the computation
- tcpmon - instrumented TCP request-response
program emulates the Event Filter EFD to SFI
communication.
47tcpmon TCP Activity Manc-CERN Req-Resp
- Round trip time 20 ms
- 64 byte Request green1 Mbyte Response blue
- TCP in slow start
- 1st event takes 19 rtt or 380 ms
48tcpmon TCP Activity Manc-cern Req-RespTCP stack
tuned
- Round trip time 20 ms
- 64 byte Request green1 Mbyte Response blue
- TCP starts in slow start
- 1st event takes 19 rtt or 380 ms
- TCP Congestion windowgrows nicely
- Response takes 2 rtt after 1.5s
- Rate 10/s (with 50ms wait)
- Transfer achievable throughputgrows to 800
Mbit/s - Data transferred WHEN theapplication requires
the data
49tcpmon TCP Activity Alberta-CERN Req-RespTCP
stack tuned
- Round trip time 150 ms
- 64 byte Request green1 Mbyte Response blue
- TCP starts in slow start
- 1st event takes 11 rtt or 1.67 s
- TCP Congestion windowin slow start to 1.8s
then congestion avoidance - Response in 2 rtt after 2.5s
- Rate 2.2/s (with 50ms wait)
- Transfer achievable throughputgrows slowly from
250 to 800 Mbit/s
50Summary Conclusions
- Standard TCP not optimum for high throughput long
distance links - Packet loss is a killer for TCP
- Check on campus links equipment, and access
links to backbones - Users need to collaborate with the Campus Network
Teams - Dante Pert
- New stacks are stable and give better response
performance - Still need to set the TCP buffer sizes !
- Check other kernel settings e.g. window-scale
maximum - Watch for TCP Stack implementation Enhancements
- TCP tries to be fair
- Large MTU has an advantage
- Short distances, small RTT, have an advantage
- TCP does not share bandwidth well with other
streams - The End Hosts themselves
51More Information Some URLs 1
- UKLight web site http//www.uklight.ac.uk
- MB-NG project web site http//www.mb-ng.net/
- DataTAG project web site http//www.datatag.org/
- UDPmon / TCPmon kit writeup http//www.hep.man
.ac.uk/rich/net - Motherboard and NIC Tests
- http//www.hep.man.ac.uk/rich/net/nic/GigEth_te
sts_Boston.ppt http//datatag.web.cern.ch/datata
g/pfldnet2003/ - Performance of 1 and 10 Gigabit Ethernet Cards
with Server Quality Motherboards FGCS Special
issue 2004 - http// www.hep.man.ac.uk/rich/
- TCP tuning information may be found
athttp//www.ncne.nlanr.net/documentation/faq/pe
rformance.html http//www.psc.edu/networking/p
erf_tune.html - TCP stack comparisonsEvaluation of Advanced
TCP Stacks on Fast Long-Distance Production
Networks Journal of Grid Computing 2004 - PFLDnet http//www.ens-lyon.fr/LIP/RESO/pfldnet200
5/ - Dante PERT http//www.geant2.net/server/show/nav.0
0d00h002
52More Information Some URLs 2
- Lectures, tutorials etc. on TCP/IP
- www.nv.cc.va.us/home/joney/tcp_ip.htm
- www.cs.pdx.edu/jrb/tcpip.lectures.html
- www.raleigh.ibm.com/cgi-bin/bookmgr/BOOKS/EZ306200
/CCONTENTS - www.cisco.com/univercd/cc/td/doc/product/iaabu/cen
tri4/user/scf4ap1.htm - www.cis.ohio-state.edu/htbin/rfc/rfc1180.html
- www.jbmelectronics.com/tcp.htm
- Encylopaedia
- http//www.freesoft.org/CIE/index.htm
- TCP/IP Resources
- www.private.org.il/tcpip_rl.html
- Understanding IP addresses
- http//www.3com.com/solutions/en_US/ncs/501302.htm
l - Configuring TCP (RFC 1122)
- ftp//nic.merit.edu/internet/documents/rfc/rfc1122
.txt - Assigned protocols, ports etc (RFC 1010)
- http//www.es.net/pub/rfcs/rfc1010.txt
/etc/protocols
53 54 55Latency Measurements
- UDP/IP packets sent between back-to-back systems
- Processed in a similar manner to TCP/IP
- Not subject to flow control congestion
avoidance algorithms - Used UDPmon test program
- Latency
- Round trip times measured using Request-Response
UDP frames - Latency as a function of frame size
- Slope is given by
- Mem-mem copy(s) pci Gig Ethernet pci
mem-mem copy(s) - Intercept indicates processing times HW
latencies - Histograms of singleton measurements
- Tells us about
- Behavior of the IP stack
- The way the HW operates
56Throughput Measurements
- UDP Throughput
- Send a controlled stream of UDP frames spaced at
regular intervals
57PCI Bus Gigabit Ethernet Activity
- PCI Activity
- Logic Analyzer with
- PCI Probe cards in sending PC
- Gigabit Ethernet Fiber Probe Card
- PCI Probe cards in receiving PC
58Server Quality Motherboards
- SuperMicro P4DP8-2G (P4DP6)
- Dual Xeon
- 400/522 MHz Front side bus
- 6 PCI PCI-X slots
- 4 independent PCI buses
- 64 bit 66 MHz PCI
- 100 MHz PCI-X
- 133 MHz PCI-X
- Dual Gigabit Ethernet
- Adaptec AIC-7899W dual channel SCSI
- UDMA/100 bus master/EIDE channels
- data transfer rates of 100 MB/sec burst
59Server Quality Motherboards
- Boston/Supermicro H8DAR
- Two Dual Core Opterons
- 200 MHz DDR Memory
- Theory BW 6.4Gbit
- HyperTransport
- 2 independent PCI buses
- 133 MHz PCI-X
- 2 Gigabit Ethernet
- SATA
- ( PCI-e )
60Network switch limits behaviour
- End2end UDP packets from udpmon
- Only 700 Mbit/s throughput
- Lots of packet loss
- Packet loss distributionshows throughput limited
6110 Gigabit Ethernet UDP Throughput
- 1500 byte MTU gives 2 Gbit/s
- Used 16144 byte MTU max user length 16080
- DataTAG Supermicro PCs
- Dual 2.2 GHz Xenon CPU FSB 400 MHz
- PCI-X mmrbc 512 bytes
- wire rate throughput of 2.9 Gbit/s
- CERN OpenLab HP Itanium PCs
- Dual 1.0 GHz 64 bit Itanium CPU FSB 400 MHz
- PCI-X mmrbc 4096 bytes
- wire rate of 5.7 Gbit/s
- SLAC Dell PCs giving a
- Dual 3.0 GHz Xenon CPU FSB 533 MHz
- PCI-X mmrbc 4096 bytes
- wire rate of 5.4 Gbit/s
6210 Gigabit Ethernet Tuning PCI-X
- 16080 byte packets every 200 µs
- Intel PRO/10GbE LR Adapter
- PCI-X bus occupancy vs mmrbc
- Measured times
- Times based on PCI-X times from the logic
analyser - Expected throughput 7 Gbit/s
- Measured 5.7 Gbit/s
63UDP Datagram format
Frame header
Application data
FCS
IP header
UDP header
8 Bytes
- Source/destination port port numbers identify
sending receiving processes - Port number IP address allow any application on
Internet to be uniquely identified - Ports can be static or dynamic
- Static (lt 1024) assigned centrally, known as well
known ports - Dynamic
- Message length in bytes includes the UDP header
and data (min 8 max 65,535)
64Congestion control ACK clocking
65End Hosts NICs CERN-nat-Manc.
Throughput Packet Loss Re-Order
- Use UDP packets to characterise Host, NIC
Network - SuperMicro P4DP8 motherboard
- Dual Xenon 2.2GHz CPU
- 400 MHz System bus
- 64 bit 66 MHz PCI / 133 MHz PCI-X bus
Request-Response Latency
- The network can sustain 1Gbps of UDP traffic
- The average server can loose smaller packets
- Packet loss caused by lack of power in the PC
receiving the traffic - Out of order packets due to WAN routers
- Lightpaths look like extended LANShave no
re-ordering
66tcpdump / tcptrace
- tcpdump dump all TCP header information for a
specified source/destination - ftp//ftp.ee.lbl.gov/
- tcptrace format tcpdump output for analysis
using xplot - http//www.tcptrace.org/
- NLANR TCP Testrig Nice wrapper for tcpdump and
tcptrace tools - http//www.ncne.nlanr.net/TCP/testrig/
- Sample use
- tcpdump -s 100 -w /tmp/tcpdump.out host
hostname - tcptrace -Sl /tmp/tcpdump.out
- xplot /tmp/a2b_tsg.xpl
67tcptrace and xplot
- X axis is time
- Y axis is sequence number
- the slope of this curve gives the throughput over
time. - xplot tool make it easy to zoom in
68Zoomed In View
- Green Line ACK values received from the receiver
- Yellow Line tracks the receive window advertised
from the receiver - Green Ticks track the duplicate ACKs received.
- Yellow Ticks track the window advertisements that
were the same as the last advertisement. - White Arrows represent segments sent.
- Red Arrows (R) represent retransmitted segments
69TCP Slow Start