Title: Atlas Canada Lightpath
1Atlas Canada Lightpath Data Transfer Trial
Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron
(UofAlberta), Wade Hong (Carleton)
2ATLAS CANADA TRIUMF-CERN LIGHTPATH DATA
TRANSFER TRIAL FOR IGRID2002
Two 1Gigabit optical fibre circuits (colours)
- What was accomplished?
- Established relationship with grid of people
for future networking projects - Demonstrated a manually provisioned 12,000Km
lightpath - Transferred 1TB of ATLAS Monte-Carlo data to CERN
(equiv. to 1500 CDs) - Established record rates ( 1 CD in 8 seconds or 1
DVD in lt60 seconds) - Demonstrated innovative use of existing
technology - Largely used low-cost commodity software
hardware.
- Participants
- TRIUMF
- University of Alberta
- Carleton
- CERN
- Canarie
- BCNET
- SURFnet
- Acknowledgements
- Netera
- Atlas Canada
- WestGrid
- HEPnet Canada
- Indiana University
- Caltech
- Extreme Networks
- Intel Corporation
3(No Transcript)
4(No Transcript)
5(No Transcript)
6(No Transcript)
7Brownie 2.5 TeraByte RAID array
- 16 x 160 GB IDE disks (5400 rpm 2MB cache)
- hot swap capable
- Dual ultra160 SCSI interface to host
- Maximum transfer 65 MB/sec
- Triple hot swap power supplies
- CAN 15k
- Arrives July 8th 2002
8What to Do while waiting for server to arrive
- IBM PRO6850 Intellistation (Loan)
- Dual 2.2 GHz Xeons
- 2 PCI 64bit/66MHz
- 4 PCI 33bit/33MHz
- 1.5 GB RAMBUS
- Add 2 Promise Ultra100
- IDE controllers and 5 Disks
- Each disk on its own IDE controller for maximum
IO - Begin Linux Software RAID performance tests
170/130 MB/sec Read/Write
9The Long Road to High Disk IO
- IBM cluster x330s RH7.2 disk io 15 MB/sec
(slow??) - expect 45 MB/sec for any modern single drive
- Need 2.4.18 Linux kernel to support gt1TB
filesystems - IBM cluster x330s RH7.3 disk io 3 MB/sec
- What is going on
- Red Hat modified serverworks driver broke DMA on
x330s - x330s ATA 100 drive, BUT controller is only
UDMA 33 - Promise controllers capable of UDMA 100 but need
latest kernel patches for 2.4.18 before drives
recognise UDMA100 - Finally drives/controller both working at
UDMA100 45MB/sec - Linux software raid0 2 drives 90MB/sec, 3
drives 125 MB/sec - 4 drives 155MB/sec, 5 drives 175 MB/sec
- Now we are ready to start network transfers
10(No Transcript)
11So what are we going to do?
did we
----------------------------------
- Demonstrate a manually provisioned e2e
lightpath - Transfer 1TB of ATLAS MC data generated in Canada
from TRIUMF to CERN - Test out 10GbE technology and channel bonding
- Establish a new benchmark for high performance
disk to disk throughput over a large distance
12Comparative Results(TRIUMF to CERN)
13What is an e2e Lightpath
- Core design principle of CAnet 4
- Ultimately to give control of lightpath
creation, teardown and routing to the end
user - Hence, Customer Empowered Networks
- Provides a flexible infrastructure for emerging
grid applications
- Alas, can only do things manually today
14(No Transcript)
15CAnet 4 Layer 1 Topology
16The Chicago Loopback
- Need to test TCP/IP and Tsunami protocols over
long distances, arrange optical loop via
StarLight - ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF )
- 91ms RTT
- TRIUMF - CERN RRT 200ms Told Damir, we really
needed to have a double loopback - No problem
- The loopback2 was setup a few days later
(RTT193ms) - (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)
17TRIUMF Server SuperMicro P4DL6 (Dual Xeon
2GHz) 400 MHz front side bus 1 GB DDR2100
RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850
RAID controller 2 Promise Ultra 100 Tx2
controllers
18CERN Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400
MHz front side bus 1 GB DDR2100 RAM Dual Channel
Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2
independent PCI buses 6 PCI-X 64 bit/133 Mhz
capable 2 3ware 7850 RAID controller
6 IDE drives on each 3-ware controllers RH7.3 on
13th drive connected to on-board IDE WD Caviar
120GB drives with 8Mbyte cache
RMC4D from HARDDATA
19TRIUMF Backup Server SuperMicro P4DL6 (Dual Xeon
1.8GHz) Supermicro 742I-420 17 4U Chassis 420W
Power Supply 400 MHz front side bus 1 GB
DDR2100 RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64bit/133 MHz capable 2 Promise
Ultra 133 TX2 controllers 1 Promise Ultra
100 TX2 controller
20Back-to-back tests over 12,000km loopback using
designated servers
21Operating System
- Redhat 7.3 based Linux kernel 2.4.18-3
- Needed to support filesystems gt 1TB
- Upgrades and patches
- Patched to 2.4.18-10
- Intel Pro 10GbE Linux driver (early stable)
- SysKonnect 9843 SX Linux driver (latest)
- Ported Sylvain Ravots tcp tune patches
22Intel 10GbE Cards
- Intel kindly loaned us 2 of their Pro/10GbE LR
server adapters cards despite the end of their
Alpha program - based on Intel 82597EX 10 Gigabit Ethernet
Controller - Note length of card!
23Extreme Networks
TRIUMF
CERN
24EXTREME NETWORK HARDWARE
25IDE Disk Arrays
CERN Receive Host
TRIUMF Send Host
26Disk Read/Write Performance
- TRIUMF send host
- 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI
controllers - 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
TB) - Tuned for optimal read performance (227/174 MB/s)
- CERN receive host
- 2 3ware 7850 64-bit/33 MHz PCI IDE controllers
- 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
TB) - Tuned for optimal write performance (295/210 MB/s)
27THUNDER RAID DETAILS
raidstop /dev/mdo mkraid R /dev/md0 mkfs -t
ext3 /dev/md0 mount -t ext2 /dev/mdo /raid0
/root/raidtab raiddev /dev/md0
raid-level 0 nr-raid-disks 12
persistent-superblock 1 chunk-size
512 ?kbytes device /dev/sdc
raid-disk 0 device
/dev/sdd raid-disk 1 device
/dev/sde raid-disk 2
device /dev/sdf raid-disk
3 device /dev/sdg
raid-disk 4 device
/dev/sdh raid-disk 5 device
/dev/sdi raid-disk 6
device /dev/sdj raid-disk
7 device /dev/hde
raid-disk 8 device
/dev/hdg raid-disk 9 device
/dev/hdi raid-disk 10
device /dev/hdk raid-disk
11
8 drives on 3-ware
4 drives on 2 Promise
28Black Magic
- We are novices in the art of optimizing system
performance - It is also time consuming
- We followed most conventional wisdom, much of
which we dont yet fully understand
29Testing Methodologies
- Began testing with a variety of bandwidth
characterization tools - pipechar, pchar, ttcp, iperf, netpipe, pathcar,
etc - Evaluated high performance file transfer
applications - bbftp, bbcp, tsunami, pftp
- Developed scripts to automate and to scan
parameter space for a number of the tools
30Disk I/O Black Magic
- min max read ahead on both systems
- sysctl -w vm.min-readahead127
- sysctl -w vm.max-readahead256
- bdflush on receive host
- sysctl -w vm.bdflush2 500 0 0 500 1000 60 20
0 - or
- echo 2 500 0 0 500 1000 60 20 0
gt/proc/sys/vm/bdflush - bdflush on send host
- sysctl -w vm.bdflush30 500 0 0 500 3000 60 20
0 - or
- echo 30 500 0 0 500 3000 60 20 0
gt/proc/sys/vm/bdflush
31Misc. Tuning and other tips
/sbin/elvtune r 512 /dev/sdc (same for other
11 disks) /sbin/elvtune w 1024 /dev/sdc (same
for other 11 disks) -r sets the max latency
that the I/O scheduler will provide on each
read -w sets the max latency that the I/O
scheduler will provide on each write
When the /raid disk refuses to dismount! Works
for kernels 2.4.11 or later. umount -l /raid
(then mount umount)
lazy
32Disk I/O Black Magic
- Disk I/O elevators (minimal impact noticed)
- /sbin/elvtune
- Allows some control of latency vs throughput
- Read_latency set to 512 (default 8192)
- Write_latency set to 1024 (default 16384)
- atime
- Disables updating the last time a file has been
accessed (typically for file servers) - mount t ext2 o noatime /dev/md0 /raid
- Typically, ext3 writes?90Mbytes/sec while for
ext2 writes 190Mbytes/sec - Reads minimally affected. We always used ext2
33Disk I/O Black Magic
- IRQ Affinity
- root_at_thunder root more /proc/interrupts
- CPU0 CPU1
- 0 15723114 0 IO-APIC-edge timer
- 1 12 0 IO-APIC-edge
keyboard - 2 0 0 XT-PIC
cascade - 8 1 0 IO-APIC-edge rtc
- 10 0 0 IO-APIC-level
usb-ohci - 14 22 0 IO-APIC-edge ide0
- 15 227234 2 IO-APIC-edge ide1
- 16 126 0 IO-APIC-level
aic7xxx - 17 16 0 IO-APIC-level
aic7xxx - 18 91 0 IO-APIC-level ide4,
ide5, 3ware Storage Controller - 20 14 0 IO-APIC-level ide2,
ide3 - 22 2296662 0 IO-APIC-level
SysKonnect SK-98xx - 24 2 0 IO-APIC-level eth3
- 26 2296673 0 IO-APIC-level
SysKonnect SK-98xx - 30 26640812 0 IO-APIC-level eth0
- NMI 0 0
Need to have PROCESS Affinity - but this requires
2.5 kernel
echo 1 gt/proc/irq/18/smp_affinity
? use CPU0 echo 2 gt/proc/irq/18/smp_affinity
? use CPU1 echo 3
gt/proc/irq/18/smp_affinity
? use either cat /proc/irq/prof_cpu_mask
gt/proc/irq/18/smp_affinity ? reset to default
34TCP Black Magic
- Typically suggested TCP and net buffer tuning
-
- sysctl -w net.ipv4.tcp_rmem"4096 4194304
4194304" - sysctl -w net.ipv4.tcp_wmem"4096 4194304
4194304" - sysctl -w net.ipv4.tcp_mem"4194304 4194304
4194304" -
- sysctl -w net.core.rmem_default65535
- sysctl -w net.core.rmem_max8388608
- sysctl -w net.core.wmem_default65535
- sysctl -w net.core.wmem_max8388608
35TCP Black Magic
- Sylvain Ravots tcp tune patch parameters
- sysctl -w net.ipv4.tcp_tune115 115 0
- Linux 2.4 retentive TCP
- Caches TCP control information for a destination
for 10 mins - To avoid caching
- sysctl -w net.ipv4.route.flush1
36We are live continent to continent!
- e2e lightpath up and running Friday Sept 20 2145
CET
traceroute to cern-10g (192.168.2.2), 30 hops
max, 38 byte packets 1 cern-10g (192.168.2.2)
161.780 ms 161.760 ms 161.754 ms
37BBFTP Transfer
- Vancouver ONS
- ons-van01(enet_15/1)
- Vancouver ONS
- ons-van01(enet_15/2)
38BBFTP Transfer
- Chicago ONS
- GigE Port 1
- Chicago ONS
- GigE Port 2
39Tsunami Transfer
- Vancouver ONS
- ons-van01(enet_15/1)
- Vancouver ONS
- ons-van01(enet_15/2)
40Tsunami Transfer
- Chicago ONS
- GigE Port 1
- Chicago ONS
- GigE Port 2
41Sunday Nite Summaries
42Exceeding 1Gbit/sec ( using tsunami)
43What does it mean for TRIUMFin the long TERM
- Established a relationship with a grid of
people for future networking projects - Upgraded WAN connection from 100Mbit to
- 4 x 1GB Ethernet connections directly to BCNET
- Canarie educational/research network
- Westgrid GRID computing
- Commercial Internet
- Spare (research development)
- Recognition that TRIUMF has the expertise and the
Network connectivity for large scale and high
speed data transfers necessary for upcoming
scientific programs, ATLAS, WESTGRID, etc
44Lessons Learned 1
- Linux software RAID faster than most conventional
SCSI and IDE hardware RAID based systems. - One controller for each drive, more disk spindles
the better - More than 2 Promises / machine possible
(100/133Mhz) - Unless programs are multi-threaded or kernel
permits process locking, Dual CPU will not give
best performance. - Single 2.8 GHz is likely to outperform Dual 2.0
GHz, for a single purpose machine like our
fileservers. - More memory the better
45Misc. comments
- No hardware failure even for the 50 disks!
- Largest file transferred 114 Gbytes (Sep 24)
- Tar, compressing, etc take longer than transfer
- Deleting files can take a lot of time
- Low cost of project - 20,000 with most of that
recycled
46?220Mbytes/sec
175Mbytes/sec
47Acknowledgements
- Canarie
- Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas
Tam, Jun Jian - Atlas Canada
- Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka
Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve,
Richard Keeler - HEPnet Canada
- Dean Karlen
- TRIUMF
- Renee Poutissou, Konstantin Olchanski,
Mike Vetterli (SFU /
Westgrid), - BCNET
- Mike Hrybyk, Marilyn Hay, Dennis OReilly, Don
McWilliams
48Acknowledgements
- Extreme Networks
- Amyn Pirmohamed, Steven Flowers, John Casselman,
Darrell Clarke, Rob Bazinet, Damaris Soellner - Intel Corporation
- Hugues Morin, Caroline Larson, Peter Molnar,
Harrison Li, Layne Flake, Jesse Brandeburg
49Acknowledgements
- Indiana University
- Mark Meiss, Stephen Wallace
- Caltech
- Sylvain Ravot, Harvey Neuman
- CERN
- Olivier Martin, Paolo Moroni, Martin Fluckiger,
Stanley Cannon, J.P Martin-Flatin - SURFnet/Universiteit van Amsterdam
- Pieter de Boer, Dennis Paus, Erik.Radius,
Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de
Laat
50Acknowledgements
- Yotta Yotta
- Geoff Hayward, Reg Joseph, Ying Xie, E. Siu
- BCIT
- Bill Rutherford
- Jalaam
- Loki Jorgensen
- Netera
- Gary Finley
51ATLAS Canada
Alberta
SFU
Montreal
Victoria
UBC
Carleton
York
TRIUMF
Toronto
52LHC Data Grid Hierarchy
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
PByte/sec
100-400 MBytes/sec
Online System
Experiment
CERN 700k SI95 1 PB Disk Tape Robot
Tier 0 1
HPSS
2.5 Gbps
Tier 1
FNAL 200k SI95 600 TB
IN2P3 Center
INFN Center
RAL Center
2.5 Gbps
Tier 2
2.5 Gbps
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels
0.11 Gbps
Physics data cache
Tier 4
Workstations
Slide courtesy H. Newman (Caltech)
53The ATLAS Experiment
Canada
Canada