Title: AOL Visit to Caltech
1- AOL Visit to Caltech
- Discussion of Advanced Networking
- Tuesday, September 23, 2003
- 1000AM 1230PM
- 248-Lauritsen
- ravot_at_caltech.edu
2Agenda
- Overview of LHCnet
- High TCP performance over wide area networks
- Problem Statement
- Fairness
- Solutions
- Awards
- Internet2 land speed record
- Fastest and biggest in the West (CENIC award)
- IPv6 internet2 land speed record
- Demos at Telecom World 2003, SC2003, WSIS
3LHCnet (I)
- CERN - US production traffic
- A test-bed to experiment with massive file
transfers across the Atlantic - Provide high-performance protocols for gigabit
networks underlying data-intensive Grids - Guarantee interoperability between several major
Grid projects in Europe and USA
Feb. 2003 Sept. 2003 setup
4LHCnet (II)
New setup
- Unique Multi-platform / Multi-technology optical
transatlantic test-bed - layer-2 and layer-3 capabilities
- Cisco, Juniper, Alcatel, Extreme Networks,
Procket - Powerful Linux farms
- Native IPv6, QoS, LBE
- New level of service MPLS GMPLS
- Get hands-on experience with the operation of
gigabit networks - Stability and reliability of hardware and
software - Interoperability
5LHCnet peering optical connectivity
- Excellent relationships and connectivity with
research and academic networks - UCAID, CENIC and NLR in particular
- Extension of the LHCnet to Sunnyvale during
SC2002 - Cisco and Level(3) loan
- Internet2 Land speed record
- 22 TeraBytes transferred in 6 hours from
Baltimore to Sunnyvale - The optical triangle will be extend to the UK,
forming an optical quadrangle
6- High TCP performance over wide area networks
7Problem Statement
- End-users perspective
- Using TCP as the data-transport protocol for
Grids leads to a poor bandwidth utilization in
high speed WANs - Network protocol designers perspective
- TCP is inefficient in high bandwidthdelay
networks - TCPs congestion control algorithm (AIMD) is not
suited to gigabit networks - When window size is 1 -gt 100 increase in window
size - When window size is 1000 -gt 0.1 increase in
window size - Due to TCPs limited feedback mechanisms, line
errors are interpreted as congestion - RFC 2581 (which gives the formula for increasing
cwnd) forgot delayed ACKs - The future performance of computational grids
looks bad if we continue to rely on the
widely-deployed TCP RENO
8Single TCP stream performance under periodic
losses
MSS1500 Bytes C1.22
- Loss rate 0.01
- LAN BW utilization 99
- WAN BW utilization1.2
9Single TCP stream
TCP connection between Geneva and Chicago C1
Gbit/s MSS1,460 Bytes RTT120ms
10Responsiveness (I)
- The responsiveness r measures how quickly we go
back to using the network link at full capacity
after experiencing a loss if we assume that the
congestion window size is equal to the Bandwidth
Delay product when the packet is lost.
C Capacity of the link
2
C . RTT
r
2 . MSS
11Responsiveness (II)
The Linux kernel 2.4.x implements delayed
acknowledgment. Due to delayed acknowledgments,
the responsiveness is multiplied by two.
Therefore, values above have to be multiplied by
two!
12Measurements with Different MTUs
13MTU and Fairness
Starlight (Chi)
CERN (GVA)
Host 1
1 GE
Host 1
1 GE
GbE Switch
POS 2.5 Gbps
1 GE
Host 2
Host 2
1 GE
Bottleneck
- Two TCP streams share a 1 Gb/s bottleneck
- RTT117 ms
- MTU 1500 Bytes Avg. throughput over a period
of 4000s 698 Mb/s - MTU 9000 Bytes Avg. throughput over a period
of 4000s 50 Mb/s - Factor 14 !
14RTT and Fairness
Sunnyvale
Starlight (Chi)
CERN (GVA)
Host 1
1 GE
10GE
1 GE
GbE Switch
POS 2.5 Gb/s
POS 10 Gb/s
Host 2
Host 2
1 GE
1 GE
Bottleneck
Host 1
- Two TCP streams share a 1 Gb/s bottleneck
- CERN lt-gt Sunnyvale RTT181ms Avg. throughput
over a period of 7000s 202Mb/s - CERN lt-gt Starlight RTT117ms Avg. throughput
over a period of 7000s 514Mb/s - MTU 9000 bytes
- Link utilization 71,6
15Why TCP perform better in a LAN?
- Better reactivity (see previous slides)
- Buffering capacity
(cwnd)
W
Buffering capacity
BDP
W/2
Area 1
Area 2
(RTT)
W
W/2
- Area 1
- CwndltBDP gt Throughput lt Bandwidth
- RTT constant
- Throughput Cwnd / RTT
- Area 2
- Cwnd gt BDP gt Throughput Bandwidth
- RTT increase (proportional to cwnd)
16Why TCP perform better in a LAN?
(cwnd)
W
Buffering capacity
W/2
BDP
Area 2
(RTT)
W
W/2
- Area 1
- CwndltBDP gt Throughput lt Bandwidth
- RTT constant
- Throughput Cwnd / RTT
- Area 2
- Cwnd gt BDP gt Throughput Bandwidth
- RTT increase (proportional to cwnd)
17Solution?
- GRID DT
- Increase TCP responsiveness
- Higher Additive increase
- Smaller backoff
- Reduce the strong penalty imposed by a loss
- Better Fairness
- between flows with different RTT
- between flows with different MTU (Virtual
increase of the MTU) - FAST TCP
- Uses end-to-end delay and loss
- Achieves any desired fairness, expressed by
utility function - Very high utilization (99 in theory)
18Internet 2 CENIC Awards
- Current Internet 2 Land speed record IPv4 class
- On Feb. 27, a Terabyte of data was transferred in
3700 seconds between the Level3 PoP in Sunnyvale
near SLAC and CERN from memory to memory as a
single TCP/IP stream at average rate of 2.38
Gbps. This beat the former record by a factor of
2.5, and used the US-CERN link at 99
efficiency. - Current Internet 2 Land speed record IPv6 class
- On may 2, Caltech and CERN set new Internet2 Land
SpeedRecords using next generation Internet
Protocols (IPv6) by achieving 983
megabits-per-second with a single IPv6 stream for
more than an hour across a distance of 7,067
kilometers (more than 4,000 miles) from Geneva,
Switzerland to Chicago. - CENIC award
- The Biggest, Fastest in the West Award honors the
fastest and most scalable high-performance
networking application/technology.
One Terabyte of data transferred in less than an
hour
Geneva-Sunnyvale 10037Km
19Single stream TCP performance
20- Telecom World 2003/ Internet2 Fall Members'
meeting - SC2003
- World Summit Information Society
- Caltech CERN/DataTAG Internet2 CENIC -
Starlight - Cisco Intel Level(3)
21LHCnet Geneva Los Angeles 10 Gbps path
22LHCnet Telecom World 2003/ Internet2 Fall
Members' meeting
23LHCnet SC2003
24Conclusion
- Transcontinental testbed Geneva Chicago Los
Angeles - The future performance of computational grids
looks bad if we continue to rely on the
widely-deployed TCP RENO - Grid DT
- Virtual MTU
- RTT bias correction
- Achieve multi-streams performance with a single
stream - How to define the fairness?
- Taking into account the MTU
- Taking into account the RTT
- Larger packet size (Jumbogram payload larger
than 64K) - Is standard MTU the largest bottleneck?
- J. Cain (Cisco) Its very difficult to build
switches to switch large packets such as
jumbogram - Our vision of the network
- The network, once viewed as an obstacle for
virtual collaborations and distributed computing
in grids, can now start to be viewed as a
catalyst instead. Grid nodes distributed around
the world will simply become depots for dropping
off information for computation or storage, and
the network will become the fundamental fabric
for tomorrow's computational grids and virtual
supercomputers.
25 26UltraLight
- Integrated packet switched and circuit switched
hybrid experimental research network - 10 GE backbone across the US, (G)MPLS, PHY-TAG,
larger MTU - End-to-end monitoring
- Dynamic bandwidth provisioning,
- Agent-based services spanning all layers of the
system, from the optical cross-connects to the
applications. - Three flagship application areas
- Particle physics experiments exploring the
frontiers of matter and spacetime (LHC), - Astrophysics projects studying the most distant
objects and the early universe (e-VLBI) - Medical teams distributing high resolution
real-time images
27Fast TCP
- Equilibrium properties
- Uses end-to-end delay and loss
- Achieves any desired fairness, expressed by
utility function - Very high utilization (99 in theory)
- Stability properties
- Stability for arbitrary delay, capacity, routing
load - Robust to heterogeneity, evolution,
- Good performance
- Negligible queueing delay loss
- Fast response
28FAST TCP vs Reno
Utilization 90
Utilization 70
29FAST demo via OMNInet and Datatag
NU-E (Leverone)
San Diego
Workstations
FAST dispaly
2 x GE
Nortel Passport 8600
A. Adriaanse, C. Jin, D. Wei (Caltech)
10GE
FAST Demo Cheng Jin, David Wei Caltech
J. Mambretti, F. Yeh (Northwestern)
OMNInet
StarLight-Chicago
Nortel Passport 8600
10GE
CERN -Geneva
Workstations
2 x GE
2 x GE
7,000 km
2 x GE
2 x GE
OC-48 DataTAG
CERN Cisco 7609
CalTech Cisco 7609
Alcatel 1670
Alcatel 1670
S. Ravot (Caltech/CERN)
30Effect of the RTT on the fairness
- Objective Improve fairness between two TCP
streams with different RTT and same MTU - We can adapt the model proposed by Matt. Mathis
by taking into account a higher additive
increment - Assumptions
- Approximate the packet loss of probability p by
assuming that each flow delivers 1/p consecutive
packets followed by one drop. - Under these assumptions, the congestion window
of the flows oscillate with a period T0. - If the receiver acknowledges every packet, then
the congestion window size opens by x (additive
increment) packets each RTT.
W
Number of packets delivered by each stream in one
period
W/2
(t)
2T0
T0
Relation between t and t
CWND evolution under periodic loss
By modifying the congestion increment dynamically
according to RTT, guarantee fairness among TCP
connections
31Linux farms (Summary)
- CENIC PoP (LA)
- 1Dual Opteron 1.8 GHz with 10GE Intel Card (disk
server, 2 TeraBytes) - 2Dual Xeon 3 GHz with 10GE Intel Card (disk
server, 2 TeraBytes) - 12Dual Xeon 2.4 GHz with 1 GE Syskonnect Cards
- Starlight (CHI)
- 3Dual XEON 2.4 GHz with 10GE Intel Card (disk
server, 2 TeraBytes) - 6Dual XEON 2.2 GHz with 21GE Syskonnect Cards
each - CERN computer center (GVA)
- 4Dual XEON 2.4 GHz with 21GE Syskonnect Cards
each - OpenLabs Intanium systems ???
- Convention Center (GVA)
- 2 Itanium Dual 1,5 GHz with 10GE Intel Cards
- 1Dual Xeon 2.4 GHz with 10GE Intel Cards (disk
server, 2 TeraBytes to be sent from Starlight) - 1Dual Xeon 2.4 GHz with 10GE Intel Cards (disk
server, 2 TeraBytes to be sent from Caltech) - 1Dual Xeon 3 GHz with 10GE Intel Card (disk
server, 2 TeraBytes to be sent from Caltech) - 2Dual Xeon 2.2 GHz with 21GE Syskonnect Cards
each