Title: Atlas Canada Lightpath
1Atlas Canada Lightpath Data Transfer Trial
Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron
(UofAlberta), Wade Hong (Carleton)
2(No Transcript)
3(No Transcript)
4(No Transcript)
5(No Transcript)
6Brownie 2.5 TeraByte RAID array
- 16 x 160 GB IDE disks (5400 rpm 2MB cache)
- hot swap capable
- Dual ultra160 SCSI interface to host
- Maximum transfer 65 MB/sec
- Triple hot swap power supplies
- CAN 15k
- Arrives July 8th 2002
7What to Do while waiting for server to arrive
- IBM PRO6850 Intellistation (Loan)
- Dual 2.2 GHz Xeons
- 2 PCI 64bit/66MHz
- 4 PCI 33bit/33MHz
- 1.5 GB RAMBUS
- Add 2 Promise Ultra100
- IDE controllers and 5 Disks
- Each disk on its own IDE controller for maximum
IO - Begin Linux Software RAID performance tests
170/130 MB/sec Read/Write
8The Long Road to High Disk IO
- IBM cluster x330s RH7.2 disk io 15 MB/sec
(slow??) - expect 45 MB/sec for any modern single drive
- Need 2.4.18 Linux kernel to support gt1TB
filesystems - IBM cluster x330s RH7.3 disk io 3 MB/sec
- What is going on
- Red Hat modified serverworks driver broke DMA on
x330s - x330s ATA 100 drive, BUT controller is only
UDMA 33 - Promise controllers capable of UDMA 100 but need
latest kernel patches for 2.4.18 before drives
recognise UDMA100 - Finally drives/controller both working at
UDMA100 45MB/sec - Linux software raid0 2 drives 90MB/sec, 3
drives 125 MB/sec - 4 drives 155MB/sec, 5 drives 175 MB/sec
- Now we are ready to start network transfers
9(No Transcript)
10So what are we going to do?
did we
----------------------------------
- Demonstrate a manually provisioned e2e
lightpath - Transfer 1TB of ATLAS MC data generated in Canada
from TRIUMF to CERN - Test out 10GbE technology and channel bonding
- Establish a new benchmark for high performance
disk to disk throughput over a large distance
11Comparative Results(TRIUMF to CERN)
12What is an e2e Lightpath
- Core design principle of CAnet 4
- Ultimately to give control of lightpath
creation, teardown and routing to the end
user - Hence, Customer Empowered Networks
- Provides a flexible infrastructure for emerging
grid applications
- Alas, can only do things manually today
13(No Transcript)
14CAnet 4 Layer 1 Topology
15The Chicago Loopback
- Need to test TCP/IP and Tsunami protocols over
long distances, arrange optical loop via
StarLight - ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF )
- 91ms RTT
- TRIUMF - CERN RRT 200ms Told Damir, we really
needed to have a double loopback - No problem
- The loopback2 was setup a few days later
(RTT193ms) - (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)
16TRIUMF Server SuperMicro P4DL6 (Dual Xeon
2GHz) 400 MHz front side bus 1 GB DDR2100
RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850
RAID controller 2 Promise Ultra 100 Tx2
controllers
17CERN Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400
MHz front side bus 1 GB DDR2100 RAM Dual Channel
Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2
independent PCI buses 6 PCI-X 64 bit/133 Mhz
capable 2 3ware 7850 RAID controller
6 IDE drives on each 3-ware controllers RH7.3 on
13th drive connected to on-board IDE WD Caviar
120GB drives with 8Mbyte cache
RMC4D from HARDDATA
18TRIUMF Backup Server SuperMicro P4DL6 (Dual Xeon
1.8GHz) Supermicro 742I-420 17 4U Chassis 420W
Power Supply 400 MHz front side bus 1 GB
DDR2100 RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64bit/133 MHz capable 2 Promise
Ultra 133 TX2 controllers 1 Promise Ultra
100 TX2 controller
19Back-to-back tests over 12,000km loopback using
designated servers
20Operating System
- Redhat 7.3 based Linux kernel 2.4.18-3
- Needed to support filesystems gt 1TB
- Upgrades and patches
- Patched to 2.4.18-10
- Intel Pro 10GbE Linux driver (early stable)
- SysKonnect 9843 SX Linux driver (latest)
- Ported Sylvain Ravots tcp tune patches
21Intel 10GbE Cards
- Intel kindly loaned us 2 of their Pro/10GbE LR
server adapters cards despite the end of their
Alpha program - based on Intel 82597EX 10 Gigabit Ethernet
Controller - Note length of card!
22Extreme Networks
TRIUMF
CERN
23EXTREME NETWORK HARDWARE
24IDE Disk Arrays
CERN Receive Host
TRIUMF Send Host
25Disk Read/Write Performance
- TRIUMF send host
- 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI
controllers - 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
TB) - Tuned for optimal read performance (227/174 MB/s)
- CERN receive host
- 2 3ware 7850 64-bit/33 MHz PCI IDE controllers
- 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
TB) - Tuned for optimal write performance (295/210 MB/s)
26THUNDER RAID DETAILS
raidstop /dev/mdo mkraid R /dev/md0 mkfs -t
ext3 /dev/md0 mount -t ext2 /dev/mdo /raid0
/root/raidtab raiddev /dev/md0
raid-level 0 nr-raid-disks 12
persistent-superblock 1 chunk-size
512 ?kbytes device /dev/sdc
raid-disk 0 device
/dev/sdd raid-disk 1 device
/dev/sde raid-disk 2
device /dev/sdf raid-disk
3 device /dev/sdg
raid-disk 4 device
/dev/sdh raid-disk 5 device
/dev/sdi raid-disk 6
device /dev/sdj raid-disk
7 device /dev/hde
raid-disk 8 device
/dev/hdg raid-disk 9 device
/dev/hdi raid-disk 10
device /dev/hdk raid-disk
11
8 drives on 3-ware
4 drives on 2 Promise
27Black Magic
- We are novices in the art of optimizing system
performance - It is also time consuming
- We followed most conventional wisdom, much of
which we dont yet fully understand
28Testing Methodologies
- Began testing with a variety of bandwidth
characterization tools - pipechar, pchar, ttcp, iperf, netpipe, pathcar,
etc - Evaluated high performance file transfer
applications - bbftp, bbcp, tsunami, pftp
- Developed scripts to automate and to scan
parameter space for a number of the tools
29Disk I/O Black Magic
- min max read ahead on both systems
- sysctl -w vm.min-readahead127
- sysctl -w vm.max-readahead256
- bdflush on receive host
- sysctl -w vm.bdflush2 500 0 0 500 1000 60 20
0 - or
- echo 2 500 0 0 500 1000 60 20 0
gt/proc/sys/vm/bdflush - bdflush on send host
- sysctl -w vm.bdflush30 500 0 0 500 3000 60 20
0 - or
- echo 30 500 0 0 500 3000 60 20 0
gt/proc/sys/vm/bdflush
30Misc. Tuning and other tips
/sbin/elvtune r 512 /dev/sdc (same for other
11 disks) /sbin/elvtune w 1024 /dev/sdc (same
for other 11 disks) -r sets the max latency
that the I/O scheduler will provide on each
read -w sets the max latency that the I/O
scheduler will provide on each write
When the /raid disk refuses to dismount! Works
for kernels 2.4.11 or later. umount -l /raid
(then mount umount)
lazy
31Disk I/O Black Magic
- Disk I/O elevators (minimal impact noticed)
- /sbin/elvtune
- Allows some control of latency vs throughput
- Read_latency set to 512 (default 8192)
- Write_latency set to 1024 (default 16384)
- atime
- Disables updating the last time a file has been
accessed (typically for file servers) - mount t ext2 o noatime /dev/md0 /raid
- Typically, ext3 writes?90Mbytes/sec while for
ext2 writes 190Mbytes/sec - Reads minimally affected. We always used ext2
32Disk I/O Black Magic
- IRQ Affinity
- root_at_thunder root more /proc/interrupts
- CPU0 CPU1
- 0 15723114 0 IO-APIC-edge timer
- 1 12 0 IO-APIC-edge
keyboard - 2 0 0 XT-PIC
cascade - 8 1 0 IO-APIC-edge rtc
- 10 0 0 IO-APIC-level
usb-ohci - 14 22 0 IO-APIC-edge ide0
- 15 227234 2 IO-APIC-edge ide1
- 16 126 0 IO-APIC-level
aic7xxx - 17 16 0 IO-APIC-level
aic7xxx - 18 91 0 IO-APIC-level ide4,
ide5, 3ware Storage Controller - 20 14 0 IO-APIC-level ide2,
ide3 - 22 2296662 0 IO-APIC-level
SysKonnect SK-98xx - 24 2 0 IO-APIC-level eth3
- 26 2296673 0 IO-APIC-level
SysKonnect SK-98xx - 30 26640812 0 IO-APIC-level eth0
- NMI 0 0
Need to have PROCESS Affinity - but this requires
2.5 kernel
echo 1 gt/proc/irq/18/smp_affinity
? use CPU0 echo 2 gt/proc/irq/18/smp_affinity
? use CPU1 echo 3
gt/proc/irq/18/smp_affinity
? use either cat /proc/irq/prof_cpu_mask
gt/proc/irq/18/smp_affinity ? reset to default
33TCP Black Magic
- Typically suggested TCP and net buffer tuning
-
- sysctl -w net.ipv4.tcp_rmem"4096 4194304
4194304" - sysctl -w net.ipv4.tcp_wmem"4096 4194304
4194304" - sysctl -w net.ipv4.tcp_mem"4194304 4194304
4194304" -
- sysctl -w net.core.rmem_default65535
- sysctl -w net.core.rmem_max8388608
- sysctl -w net.core.wmem_default65535
- sysctl -w net.core.wmem_max8388608
34TCP Black Magic
- Sylvain Ravots tcp tune patch parameters
- sysctl -w net.ipv4.tcp_tune115 115 0
- Linux 2.4 retentive TCP
- Caches TCP control information for a destination
for 10 mins - To avoid caching
- sysctl -w net.ipv4.route.flush1
35We are live continent to continent!
- e2e lightpath up and running Friday Sept 20 2145
CET
traceroute to cern-10g (192.168.2.2), 30 hops
max, 38 byte packets 1 cern-10g (192.168.2.2)
161.780 ms 161.760 ms 161.754 ms
36BBFTP Transfer
- Vancouver ONS
- ons-van01(enet_15/1)
- Vancouver ONS
- ons-van01(enet_15/2)
37BBFTP Transfer
- Chicago ONS
- GigE Port 1
- Chicago ONS
- GigE Port 2
38Tsunami Transfer
- Vancouver ONS
- ons-van01(enet_15/1)
- Vancouver ONS
- ons-van01(enet_15/2)
39Tsunami Transfer
- Chicago ONS
- GigE Port 1
- Chicago ONS
- GigE Port 2
40Sunday Nite Summaries
41Exceeding 1Gbit/sec ( using tsunami)
42What does it mean for TRIUMFin the long TERM
- Established a relationship with a grid of
people for future networking projects - Upgraded WAN connection from 100Mbit to
- 4 x 1GB Ethernet connections directly to BCNET
- Canarie educational/research network
- Westgrid GRID computing
- Commercial Internet
- Spare (research development)
- Recognition that TRIUMF has the expertise and the
Network connectivity for large scale and high
speed data transfers necessary for upcoming
scientific programs, ATLAS, WESTGRID, etc
43Lessons Learned 1
- Linux software RAID faster than most conventional
SCSI and IDE hardware RAID based systems. - One controller for each drive, more disk spindles
the better - More than 2 Promises / machine possible
(100/133Mhz) - Unless programs are multi-threaded or kernel
permits process locking, Dual CPU will not give
best performance. - Single 2.8 GHz is likely to outperform Dual 2.0
GHz, for a single purpose machine like our
fileservers. - More memory the better
44Misc. comments
- No hardware failure even for the 50 disks!
- Largest file transferred 114 Gbytes (Sep 24)
- Tar, compressing, etc take longer than transfer
- Deleting files can take a lot of time
- Low cost of project - 20,000 with most of that
recycled
45?220Mbytes/sec
175Mbytes/sec
46Acknowledgements
- Canarie
- Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas
Tam, Jun Jian - Atlas Canada
- Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka
Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve,
Richard Keeler - HEPnet Canada
- Dean Karlen
- TRIUMF
- Renee Poutissou, Konstantin Olchanski,
Mike Vetterli (SFU /
Westgrid), - BCNET
- Mike Hrybyk, Marilyn Hay, Dennis OReilly, Don
McWilliams
47Acknowledgements
- Extreme Networks
- Amyn Pirmohamed, Steven Flowers, John Casselman,
Darrell Clarke, Rob Bazinet, Damaris Soellner - Intel Corporation
- Hugues Morin, Caroline Larson, Peter Molnar,
Harrison Li, Layne Flake, Jesse Brandeburg
48Acknowledgements
- Indiana University
- Mark Meiss, Stephen Wallace
- Caltech
- Sylvain Ravot, Harvey Neuman
- CERN
- Olivier Martin, Paolo Moroni, Martin Fluckiger,
Stanley Cannon, J.P Martin-Flatin - SURFnet/Universiteit van Amsterdam
- Pieter de Boer, Dennis Paus, Erik.Radius,
Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de
Laat
49Acknowledgements
- Yotta Yotta
- Geoff Hayward, Reg Joseph, Ying Xie, E. Siu
- BCIT
- Bill Rutherford
- Jalaam
- Loki Jorgensen
- Netera
- Gary Finley
50ATLAS Canada
Alberta
SFU
Montreal
Victoria
UBC
Carleton
York
TRIUMF
Toronto
51LHC Data Grid Hierarchy
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
PByte/sec
100-400 MBytes/sec
Online System
Experiment
CERN 700k SI95 1 PB Disk Tape Robot
Tier 0 1
HPSS
2.5 Gbps
Tier 1
FNAL 200k SI95 600 TB
IN2P3 Center
INFN Center
RAL Center
2.5 Gbps
Tier 2
2.5 Gbps
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels
0.11 Gbps
Physics data cache
Tier 4
Workstations
Slide courtesy H. Newman (Caltech)
52The ATLAS Experiment
Canada
Canada