Bringing High-Performance Networking to HEP users - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

Bringing High-Performance Networking to HEP users

Description:

Host, PCI & RAID Controller Performance. RAID5 (stripped with redundancy) ... Separate the data transfers use motherboards with multiple 64 bit PCI-X buses ... – PowerPoint PPT presentation

Number of Views:34
Avg rating:3.0/5.0
Slides: 37
Provided by: dl98
Category:

less

Transcript and Presenter's Notes

Title: Bringing High-Performance Networking to HEP users


1
Bringing High-Performance Networking to HEP users
Richard Hughes-Jones Stephen Dallison, Nicola
Pezzi, Yee-Ting Lee
UKLIGHT
2
The Bandwidth Challenge at SC2003
  • Peak bandwidth 23.21Gbits/s
  • 6.6 TBytes in 48 minutes
  • Phoenix - Amsterdam
  • 4.35 Gbit HighSpeed TCP
  • rtt 175 ms , window 200 MB

3
TCP (Reno) Whats the problem?
  • TCP has 2 phases Slowstart Congestion
    Avoidance
  • AIMD and High Bandwidth Long Distance networks
  • Poor performance of TCP in high bandwidth wide
    area networks is due
  • in part to the TCP congestion control algorithm.
  • For each ack in a RTT without loss
  • cwnd -gt cwnd a / cwnd - Additive Increase,
    a1
  • For each window experiencing loss
  • cwnd -gt cwnd b (cwnd) -
    Multiplicative Decrease, b ½
  • Time to recover from 1 lost packet for round trip
    time of 100 ms

4
Investigation of new TCP Stacks
  • The AIMD Algorithm Standard TCP (Reno)
  • For each ack in a RTT without loss
  • cwnd -gt cwnd a / cwnd - Additive Increase,
    a1
  • For each window experiencing loss
  • cwnd -gt cwnd b (cwnd) -
    Multiplicative Decrease, b ½
  • High Speed TCP
  • a and b vary depending on current cwnd using a
    table
  • a increases more rapidly with larger cwnd
    returns to the optimal cwnd size sooner for the
    network path
  • b decreases less aggressively and, as a
    consequence, so does the cwnd. The effect is that
    there is not such a decrease in throughput.
  • Scalable TCP
  • a and b are fixed adjustments for the increase
    and decrease of cwnd
  • a 1/100 the increase is greater than TCP Reno
  • b 1/8 the decrease on loss is less than TCP
    Reno
  • Scalable over any link speed.
  • Fast TCP
  • Uses round trip time as well as packet loss to
    indicate congestion with rapid convergence to
    fair equilibrium for throughput.
  • HSTCP-LP, H-TCP, BiC-TCP

5
Packet Loss with new TCP Stacks
  • TCP Response Function
  • Throughput vs Loss Rate further to right
    faster recovery
  • Drop packets in kernel

MB-NG rtt 6ms
DataTAG rtt 120 ms
6
High Throughput Demonstrations
Manchester (Geneva)
London (Chicago)
Dual Zeon 2.2 GHz
Dual Zeon 2.2 GHz
Cisco GSR
Cisco GSR
Cisco 7609
Cisco 7609
1 GEth
1 GEth
2.5 Gbit SDH MB-NG Core
7
High Performance TCP MB-NG
  • Drop 1 in 25,000
  • rtt 6.2 ms
  • Recover in 1.6 s
  • Standard HighSpeed Scalable

8
High Performance TCP DataTAG
  • Different TCP stacks tested on the DataTAG
    Network
  • rtt 128 ms
  • Drop 1 in 106
  • High-Speed
  • Rapid recovery
  • Scalable
  • Very fast recovery
  • Standard
  • Recovery would take 20 mins

9
  • End Systems NICs Disks

10
End Hosts NICs SuperMicro P4DP6
  • Use UDP packets to characterise Host NIC
  • SuperMicro P4DP6 motherboard
  • Dual Xenon 2.2GHz CPU
  • 400 MHz System bus
  • 66 MHz 64 bit PCI bus

Throughput
Latency
Bus Activity
11
Host, PCI RAID Controller Performance
  • RAID5 (stripped with redundancy)
  • 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel
    33 MHz
  • 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA
    33/66 MHz
  • Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2
    motherboard
  • Disk Maxtor 160GB 7200rpm 8MB Cache
  • Read ahead kernel tuning /proc/sys/vm/max-readahe
    ad 512
  • RAID0 (stripped) Read 1040 Mbit/s, Write 800
    Mbit/s

12
The performance of the end host / disks BaBar
Case Study RAID BW PCI Activity
  • 3Ware 7500-8 RAID5 parallel EIDE
  • 3Ware forces PCI bus to 33 MHz
  • BaBar Tyan to MB-NG SuperMicroNetwork mem-mem
    619 Mbit/s
  • Disk disk throughput bbcp 40-45 Mbytes/s (320
    360 Mbit/s)
  • PCI bus effectively full!
  • User throughput 250 Mbit/s

Read from RAID5 Disks
Write to RAID5 Disks
13
  • Data Transfer Applications

14
The Tests (being) Made
App TCP Stack SuperMicro on MB-NG SuperMicro on SuperJANET4 BaBar on SuperJANET4
Iperf Standard
Iperf HighSpeed
Iperf Scalable
bbcp Standard
bbcp HighSpeed
bbcp Scalable
bbftp Standard
bbftp HighSpeed
bbftp Scalable
apache Standard
apache HighSpeed
apache Scalable
Gridftp Standard
Gridftp HighSpeed
Gridftp Scalable
15
Topology of the MB NG Network
16
Topology of the Production Network
Manchester Domain
3 routers2 switches
RAL Domain
routers switches
Key Gigabit Ethernet 2.5 Gbit POS Access 10 Gbit
POS
17
Average Transfer Rates Mbit/s
App TCP Stack SuperMicro on MB-NG SuperMicro on SuperJANET4 BaBar on SuperJANET4
Iperf Standard 940 350-370 425
Iperf HighSpeed 940 510 570
Iperf Scalable 940 580-650 605
bbcp Standard 434 290-310 290
bbcp HighSpeed 435 385 360
bbcp Scalable 432 400-430 380
bbftp Standard 400-410 325 320
bbftp HighSpeed 370-390 380
bbftp Scalable 430 345-532 380
apache Standard 425 260 300-360
apache HighSpeed 430 370 315
apache Scalable 428 400 317
Gridftp Standard 405 240
Gridftp HighSpeed 320
Gridftp Scalable 335
18
iperf Throughput Web100
  • SuperMicro on MB-NG network
  • HighSpeed TCP
  • Linespeed 940 Mbit/s
  • DupACK ? lt10 (expect 400)

19
bbftp Host Network Effects
  • 2 Gbyte file RAID5 Disks
  • 1200 Mbit/s read
  • 600 Mbit/s write
  • Scalable TCP
  • BaBar SuperJANET
  • Instantaneous 220 - 625 Mbit/s
  • SuperMicro SuperJANET
  • Instantaneous 400 - 665 Mbit/s for 6 sec
  • Then 0 - 480 Mbit/s
  • SuperMicro MB-NG
  • Instantaneous 880 - 950 Mbit/s for 1.3 sec
  • Then 215 - 625 Mbit/s

20
bbftp What else is going on?
  • Scalable TCP
  • BaBar SuperJANET
  • SuperMicro SuperJANET
  • Congestion window dupACK
  • Variation not TCP related?
  • Disk speed / bus transfer
  • Application

21
Applications Throughput Mbit/s
  • HighSpeed TCP
  • 2 GByte file RAID5
  • SuperMicro SuperJANET
  • bbcp
  • bbftp
  • Apachie
  • Gridftp
  • Previous work used RAID0(not disk limited)

22
Summary, Conclusions Thanks
  • Motherboards NICs, RAID controllers and Disks
    matter
  • The NICs should be well designed
  • NIC should use 64 bit 133 MHz PCI-X (66 MHz PCI
    can be OK)
  • NIC/drivers CSR access / Clean buffer management
    / Good interrupt handling
  • Worry about the CPU-Memory bandwidth as well as
    the PCI bandwidth
  • Data crosses the memory bus at least 3 times
  • Separate the data transfers use motherboards
    with multiple 64 bit PCI-X buses
  • 32 bit 33 MHz is too slow for Gigabit rates
  • 64 bit 33 MHz gt 80 used
  • Choose a modern high throughput RAID controller
  • Consider SW RAID0 of RAID5 HW controllers
  • Need plenty of CPU power for sustained 1 Gbit/s
    transfers
  • Work with Campus network engineers to eliminate
    bottlenecks and packet loss
  • High bandwidth link to your server
  • Look for Access link overloading / old Ethernet
    equipment / flow limitation policies
  • Use of Jumbo frames, Interrupt Coalescence and
    Tuning the PCI-X bus helps
  • New TCP stacks are stable and run with 10 Gigabit
    Ethernet NICs
  • New stacks give better response performance
  • Still need to set the tcp buffer sizes

23
More Information Some URLs
  • MB-NG project web site http//www.mb-ng.net/
  • DataTAG project web site http//www.datatag.org/
  • UDPmon / TCPmon kit writeup http//www.hep.man
    .ac.uk/rich/net
  • Motherboard and NIC Tests
  • www.hep.man.ac.uk/rich/net/nic/GigEth_tests_Bos
    ton.ppt http//datatag.web.cern.ch/datatag/pfldn
    et2003/
  • Performance of 1 and 10 Gigabit Ethernet Cards
    with Server Quality Motherboards FGCS Special
    issue 2004
  • TCP tuning information may be found
    athttp//www.ncne.nlanr.net/documentation/faq/pe
    rformance.html http//www.psc.edu/networking/p
    erf_tune.html
  • TCP stack comparisonsEvaluation of Advanced
    TCP Stacks on Fast Long-Distance Production
    Networks Journal of Grid Computing 2004

24
  • Backup Slides

25
SuperMicro P4DP6 Throughput Intel Pro/1000
  • Motherboard SuperMicro P4DP6 Chipset Intel
    E7500 (Plumas)
  • CPU Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66
    MHz
  • RedHat 7.2 Kernel 2.4.19
  • Max throughput 950Mbit/s
  • No packet loss
  • CPU utilisation on the receiving PC was 25
    for packets gt than 1000 bytes
  • 30- 40 for smaller packets

26
SuperMicro P4DP6 Latency Intel Pro/1000
  • Motherboard SuperMicro P4DP6 Chipset Intel
    E7500 (Plumas)
  • CPU Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66
    MHz
  • RedHat 7.2 Kernel 2.4.19
  • Some steps
  • Slope 0.009 us/byte
  • Slope flat sections 0.0146 us/byte
  • Expect 0.0118 us/byte
  • No variation with packet size
  • FWHM 1.5 us
  • Confirms timing reliable

27
SuperMicro P4DP6 PCI Intel Pro/1000
  • Motherboard SuperMicro P4DP6 Chipset Intel
    E7500 (Plumas)
  • CPU Dual Xeon Prestonia 2.2 GHz PCI, 64 bit, 66
    MHz
  • RedHat 7.2 Kernel 2.4.19
  • 1400 bytes sent
  • Wait 12 us
  • 5.14us on send PCI bus
  • PCI bus 68 occupancy
  • 3 us on PCI for data recv
  • CSR access inserts PCI STOPs
  • NIC takes 1 us/CSR
  • CPU faster than the NIC !
  • Similar effect with the SysKonnect NIC

28
Raid0 Performance (1)
  • 3Ware 7500-8 RAID0 parallel EIDE
  • Maxtor 3.5 Series DiamondMax Plus 9 120 Gb
    ATA/133
  • Raid stripe size 64 bytes
  • WriteSlight increase with number of disks
  • Read
  • 3 Disks OK
  • Write 100 MBytes/s
  • Read 130 MBytes/s

29
Raid0 Performance (2)
  • Maxtor 3.5 Series DiamondMax PLus 9 120 Gb
    ATA/133
  • No difference for Write
  • Larger Stripe lower the performance
  • Write 100 MBytes/s
  • Read 120 MBytes/s

30
Raid5 Disk Performance vs readahead_max
  • BaBar Disk Server
  • Tyan Tiger S2466N motherboard
  • 1 64bit 66 MHz PCI bus
  • Athlon MP2000 CPU
  • AMD-760 MPX chipset
  • 3Ware 7500-8 RAID5
  • 8 200Gb Maxtor IDE 7200rpm disks
  • Note the VM parameterreadahead max
  • Disk to memory (read)Max throughput 1.2 Gbit/s
    150 MBytes/s)
  • Memory to disk (write)Max throughput 400 Mbit/s
    50 MBytes/s)not as fast as Raid0

31
Host, PCI RAID Controller Performance
  • RAID0 (striped) RAID5 (stripped with
    redundancy)
  • 3Ware 7506 Parallel 66 MHz 3Ware 7505 Parallel
    33 MHz
  • 3Ware 8506 Serial ATA 66 MHz ICP Serial ATA
    33/66 MHz
  • Tested on Dual 2.2 GHz Xeon Supermicro P4DP8-G2
    motherboard
  • Disk Maxtor 160GB 7200rpm 8MB Cache
  • Read ahead kernel tuning /proc/sys/vm/max-readahe
    ad

32
Serial ATA Raid Controllers RAID5
  • 3Ware 66 MHz PCI
  • ICP 66 MHz PCI

33
RAID Controller Performance
Write Speed
Read Speed
RAID 0
RAID 5
34
Gridftp Throughput Web100
  • RAID0 Disks
  • 960 Mbit/s read
  • 800 Mbit/s write
  • Throughput Mbit/s
  • See alternate 600/800 Mbit and zero
  • Data Rate 520 Mbit/s
  • Cwnd smooth
  • No dup Ack / send stall /timeouts

35
http data transfers HighSpeed TCP
  • Same Hardware
  • RAID0 Disks
  • Bulk data moved by web servers
  • Apachie web server out of the box!
  • prototype client - curl http library
  • 1Mbyte TCP buffers
  • 2Gbyte file
  • Throughput 720 Mbit/s
  • Cwnd - some variation
  • No dup Ack / send stall / timeouts

36
Bbcp GridFTP Throughput
  • RAID5 - 4disks Manc RAL
  • 2Gbyte file transferred
  • bbcp
  • Mean 710 Mbit/s
  • DataTAG altAIMD kernel in BaBar ATLAS

Mean 710
  • GridFTP
  • See many zeros

Mean 620
Write a Comment
User Comments (0)
About PowerShow.com