Atlas Canada Lightpath - PowerPoint PPT Presentation

1 / 53
About This Presentation
Title:

Atlas Canada Lightpath

Description:

Established relationship with 'grid' of people for future ... Bill Rutherford. Jalaam. Loki Jorgensen. Netera. Gary Finley. ATLAS Canada. TRIUMF. Victoria ... – PowerPoint PPT presentation

Number of Views:71
Avg rating:3.0/5.0
Slides: 54
Provided by: corr161
Category:

less

Transcript and Presenter's Notes

Title: Atlas Canada Lightpath


1
Atlas Canada Lightpath Data Transfer Trial
Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron
(UofAlberta), Wade Hong (Carleton)
2
ATLAS CANADA TRIUMF-CERN LIGHTPATH DATA
TRANSFER TRIAL FOR IGRID2002
Two 1Gigabit optical fibre circuits (colours)
  • What was accomplished?
  • Established relationship with grid of people
    for future networking projects
  • Demonstrated a manually provisioned 12,000Km
    lightpath
  • Transferred 1TB of ATLAS Monte-Carlo data to CERN
    (equiv. to 1500 CDs)
  • Established record rates ( 1 CD in 8 seconds or 1
    DVD in lt60 seconds)
  • Demonstrated innovative use of existing
    technology
  • Largely used low-cost commodity software
    hardware.
  • Participants
  • TRIUMF
  • University of Alberta
  • Carleton
  • CERN
  • Canarie
  • BCNET
  • SURFnet
  • Acknowledgements
  • Netera
  • Atlas Canada
  • WestGrid
  • HEPnet Canada
  • Indiana University
  • Caltech
  • Extreme Networks
  • Intel Corporation

3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
(No Transcript)
7
Brownie 2.5 TeraByte RAID array
  • 16 x 160 GB IDE disks (5400 rpm 2MB cache)
  • hot swap capable
  • Dual ultra160 SCSI interface to host
  • Maximum transfer 65 MB/sec
  • Triple hot swap power supplies
  • CAN 15k
  • Arrives July 8th 2002

8
What to Do while waiting for server to arrive
  • IBM PRO6850 Intellistation (Loan)
  • Dual 2.2 GHz Xeons
  • 2 PCI 64bit/66MHz
  • 4 PCI 33bit/33MHz
  • 1.5 GB RAMBUS
  • Add 2 Promise Ultra100
  • IDE controllers and 5 Disks
  • Each disk on its own IDE controller for maximum
    IO
  • Begin Linux Software RAID performance tests
    170/130 MB/sec Read/Write

9
The Long Road to High Disk IO
  • IBM cluster x330s RH7.2 disk io 15 MB/sec
    (slow??)
  • expect 45 MB/sec for any modern single drive
  • Need 2.4.18 Linux kernel to support gt1TB
    filesystems
  • IBM cluster x330s RH7.3 disk io 3 MB/sec
  • What is going on
  • Red Hat modified serverworks driver broke DMA on
    x330s
  • x330s ATA 100 drive, BUT controller is only
    UDMA 33
  • Promise controllers capable of UDMA 100 but need
    latest kernel patches for 2.4.18 before drives
    recognise UDMA100
  • Finally drives/controller both working at
    UDMA100 45MB/sec
  • Linux software raid0 2 drives 90MB/sec, 3
    drives 125 MB/sec
  • 4 drives 155MB/sec, 5 drives 175 MB/sec
  • Now we are ready to start network transfers

10
(No Transcript)
11
So what are we going to do?
did we
----------------------------------
  • Demonstrate a manually provisioned e2e
    lightpath
  • Transfer 1TB of ATLAS MC data generated in Canada
    from TRIUMF to CERN
  • Test out 10GbE technology and channel bonding
  • Establish a new benchmark for high performance
    disk to disk throughput over a large distance

12
Comparative Results(TRIUMF to CERN)
13
What is an e2e Lightpath
  • Core design principle of CAnet 4
  • Ultimately to give control of lightpath
    creation, teardown and routing to the end
    user
  • Hence, Customer Empowered Networks
  • Provides a flexible infrastructure for emerging
    grid applications
  • Alas, can only do things manually today

14
(No Transcript)
15
CAnet 4 Layer 1 Topology
16
The Chicago Loopback
  • Need to test TCP/IP and Tsunami protocols over
    long distances, arrange optical loop via
    StarLight
  • ( TRIUMF-BCNET-Chicago-BCNET-TRIUMF )
  • 91ms RTT
  • TRIUMF - CERN RRT 200ms Told Damir, we really
    needed to have a double loopback
  • No problem
  • The loopback2 was setup a few days later
    (RTT193ms)
  • (TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)

17
TRIUMF Server SuperMicro P4DL6 (Dual Xeon
2GHz) 400 MHz front side bus 1 GB DDR2100
RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850
RAID controller 2 Promise Ultra 100 Tx2
controllers
18
CERN Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400
MHz front side bus 1 GB DDR2100 RAM Dual Channel
Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2
independent PCI buses 6 PCI-X 64 bit/133 Mhz
capable 2 3ware 7850 RAID controller
6 IDE drives on each 3-ware controllers RH7.3 on
13th drive connected to on-board IDE WD Caviar
120GB drives with 8Mbyte cache
RMC4D from HARDDATA
19
TRIUMF Backup Server SuperMicro P4DL6 (Dual Xeon
1.8GHz) Supermicro 742I-420 17 4U Chassis 420W
Power Supply 400 MHz front side bus 1 GB
DDR2100 RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64bit/133 MHz capable 2 Promise
Ultra 133 TX2 controllers 1 Promise Ultra
100 TX2 controller
20
Back-to-back tests over 12,000km loopback using
designated servers
21
Operating System
  • Redhat 7.3 based Linux kernel 2.4.18-3
  • Needed to support filesystems gt 1TB
  • Upgrades and patches
  • Patched to 2.4.18-10
  • Intel Pro 10GbE Linux driver (early stable)
  • SysKonnect 9843 SX Linux driver (latest)
  • Ported Sylvain Ravots tcp tune patches

22
Intel 10GbE Cards
  • Intel kindly loaned us 2 of their Pro/10GbE LR
    server adapters cards despite the end of their
    Alpha program
  • based on Intel 82597EX 10 Gigabit Ethernet
    Controller
  • Note length of card!

23
Extreme Networks
TRIUMF
CERN
24
EXTREME NETWORK HARDWARE
25
IDE Disk Arrays
CERN Receive Host
TRIUMF Send Host
26
Disk Read/Write Performance
  • TRIUMF send host
  • 1 3ware 7850 and 2 Promise Ultra 100TX2 PCI
    controllers
  • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
    TB)
  • Tuned for optimal read performance (227/174 MB/s)
  • CERN receive host
  • 2 3ware 7850 64-bit/33 MHz PCI IDE controllers
  • 12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
    TB)
  • Tuned for optimal write performance (295/210 MB/s)

27
THUNDER RAID DETAILS
raidstop /dev/mdo mkraid R /dev/md0 mkfs -t
ext3 /dev/md0 mount -t ext2 /dev/mdo /raid0
/root/raidtab raiddev /dev/md0
raid-level 0 nr-raid-disks 12
persistent-superblock 1 chunk-size
512 ?kbytes device /dev/sdc
raid-disk 0 device
/dev/sdd raid-disk 1 device
/dev/sde raid-disk 2
device /dev/sdf raid-disk
3 device /dev/sdg
raid-disk 4 device
/dev/sdh raid-disk 5 device
/dev/sdi raid-disk 6
device /dev/sdj raid-disk
7 device /dev/hde
raid-disk 8 device
/dev/hdg raid-disk 9 device
/dev/hdi raid-disk 10
device /dev/hdk raid-disk
11
8 drives on 3-ware
4 drives on 2 Promise
28
Black Magic
  • We are novices in the art of optimizing system
    performance
  • It is also time consuming
  • We followed most conventional wisdom, much of
    which we dont yet fully understand

29
Testing Methodologies
  • Began testing with a variety of bandwidth
    characterization tools
  • pipechar, pchar, ttcp, iperf, netpipe, pathcar,
    etc
  • Evaluated high performance file transfer
    applications
  • bbftp, bbcp, tsunami, pftp
  • Developed scripts to automate and to scan
    parameter space for a number of the tools

30
Disk I/O Black Magic
  • min max read ahead on both systems
  • sysctl -w vm.min-readahead127
  • sysctl -w vm.max-readahead256
  • bdflush on receive host
  • sysctl -w vm.bdflush2 500 0 0 500 1000 60 20
    0
  • or
  • echo 2 500 0 0 500 1000 60 20 0
    gt/proc/sys/vm/bdflush
  • bdflush on send host
  • sysctl -w vm.bdflush30 500 0 0 500 3000 60 20
    0
  • or
  • echo 30 500 0 0 500 3000 60 20 0
    gt/proc/sys/vm/bdflush

31
Misc. Tuning and other tips
/sbin/elvtune r 512 /dev/sdc (same for other
11 disks) /sbin/elvtune w 1024 /dev/sdc (same
for other 11 disks) -r sets the max latency
that the I/O scheduler will provide on each
read -w sets the max latency that the I/O
scheduler will provide on each write
When the /raid disk refuses to dismount! Works
for kernels 2.4.11 or later. umount -l /raid
(then mount umount)
lazy
32
Disk I/O Black Magic
  • Disk I/O elevators (minimal impact noticed)
  • /sbin/elvtune
  • Allows some control of latency vs throughput
  • Read_latency set to 512 (default 8192)
  • Write_latency set to 1024 (default 16384)
  • atime
  • Disables updating the last time a file has been
    accessed (typically for file servers)
  • mount t ext2 o noatime /dev/md0 /raid
  • Typically, ext3 writes?90Mbytes/sec while for
    ext2 writes 190Mbytes/sec
  • Reads minimally affected. We always used ext2

33
Disk I/O Black Magic
  • IRQ Affinity
  • root_at_thunder root more /proc/interrupts
  • CPU0 CPU1
  • 0 15723114 0 IO-APIC-edge timer
  • 1 12 0 IO-APIC-edge
    keyboard
  • 2 0 0 XT-PIC
    cascade
  • 8 1 0 IO-APIC-edge rtc
  • 10 0 0 IO-APIC-level
    usb-ohci
  • 14 22 0 IO-APIC-edge ide0
  • 15 227234 2 IO-APIC-edge ide1
  • 16 126 0 IO-APIC-level
    aic7xxx
  • 17 16 0 IO-APIC-level
    aic7xxx
  • 18 91 0 IO-APIC-level ide4,
    ide5, 3ware Storage Controller
  • 20 14 0 IO-APIC-level ide2,
    ide3
  • 22 2296662 0 IO-APIC-level
    SysKonnect SK-98xx
  • 24 2 0 IO-APIC-level eth3
  • 26 2296673 0 IO-APIC-level
    SysKonnect SK-98xx
  • 30 26640812 0 IO-APIC-level eth0
  • NMI 0 0

Need to have PROCESS Affinity - but this requires
2.5 kernel
echo 1 gt/proc/irq/18/smp_affinity
? use CPU0 echo 2 gt/proc/irq/18/smp_affinity
? use CPU1 echo 3
gt/proc/irq/18/smp_affinity
? use either cat /proc/irq/prof_cpu_mask
gt/proc/irq/18/smp_affinity ? reset to default
34
TCP Black Magic
  • Typically suggested TCP and net buffer tuning
  • sysctl -w net.ipv4.tcp_rmem"4096 4194304
    4194304"
  • sysctl -w net.ipv4.tcp_wmem"4096 4194304
    4194304"
  • sysctl -w net.ipv4.tcp_mem"4194304 4194304
    4194304"
  • sysctl -w net.core.rmem_default65535
  • sysctl -w net.core.rmem_max8388608
  • sysctl -w net.core.wmem_default65535
  • sysctl -w net.core.wmem_max8388608

35
TCP Black Magic
  • Sylvain Ravots tcp tune patch parameters
  • sysctl -w net.ipv4.tcp_tune115 115 0
  • Linux 2.4 retentive TCP
  • Caches TCP control information for a destination
    for 10 mins
  • To avoid caching
  • sysctl -w net.ipv4.route.flush1

36
We are live continent to continent!
  • e2e lightpath up and running Friday Sept 20 2145
    CET

traceroute to cern-10g (192.168.2.2), 30 hops
max, 38 byte packets 1 cern-10g (192.168.2.2)
161.780 ms 161.760 ms 161.754 ms
37
BBFTP Transfer
  • Vancouver ONS
  • ons-van01(enet_15/1)
  • Vancouver ONS
  • ons-van01(enet_15/2)

38
BBFTP Transfer
  • Chicago ONS
  • GigE Port 1
  • Chicago ONS
  • GigE Port 2

39
Tsunami Transfer
  • Vancouver ONS
  • ons-van01(enet_15/1)
  • Vancouver ONS
  • ons-van01(enet_15/2)

40
Tsunami Transfer
  • Chicago ONS
  • GigE Port 1
  • Chicago ONS
  • GigE Port 2

41
Sunday Nite Summaries
42
Exceeding 1Gbit/sec ( using tsunami)
43
What does it mean for TRIUMFin the long TERM
  • Established a relationship with a grid of
    people for future networking projects
  • Upgraded WAN connection from 100Mbit to
  • 4 x 1GB Ethernet connections directly to BCNET
  • Canarie educational/research network
  • Westgrid GRID computing
  • Commercial Internet
  • Spare (research development)
  • Recognition that TRIUMF has the expertise and the
    Network connectivity for large scale and high
    speed data transfers necessary for upcoming
    scientific programs, ATLAS, WESTGRID, etc

44
Lessons Learned 1
  • Linux software RAID faster than most conventional
    SCSI and IDE hardware RAID based systems.
  • One controller for each drive, more disk spindles
    the better
  • More than 2 Promises / machine possible
    (100/133Mhz)
  • Unless programs are multi-threaded or kernel
    permits process locking, Dual CPU will not give
    best performance.
  • Single 2.8 GHz is likely to outperform Dual 2.0
    GHz, for a single purpose machine like our
    fileservers.
  • More memory the better

45
Misc. comments
  • No hardware failure even for the 50 disks!
  • Largest file transferred 114 Gbytes (Sep 24)
  • Tar, compressing, etc take longer than transfer
  • Deleting files can take a lot of time
  • Low cost of project - 20,000 with most of that
    recycled

46
?220Mbytes/sec
175Mbytes/sec
47
Acknowledgements
  • Canarie
  • Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas
    Tam, Jun Jian
  • Atlas Canada
  • Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka
    Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve,
    Richard Keeler
  • HEPnet Canada
  • Dean Karlen
  • TRIUMF
  • Renee Poutissou, Konstantin Olchanski,
    Mike Vetterli (SFU /
    Westgrid),
  • BCNET
  • Mike Hrybyk, Marilyn Hay, Dennis OReilly, Don
    McWilliams

48
Acknowledgements
  • Extreme Networks
  • Amyn Pirmohamed, Steven Flowers, John Casselman,
    Darrell Clarke, Rob Bazinet, Damaris Soellner
  • Intel Corporation
  • Hugues Morin, Caroline Larson, Peter Molnar,
    Harrison Li, Layne Flake, Jesse Brandeburg

49
Acknowledgements
  • Indiana University
  • Mark Meiss, Stephen Wallace
  • Caltech
  • Sylvain Ravot, Harvey Neuman
  • CERN
  • Olivier Martin, Paolo Moroni, Martin Fluckiger,
    Stanley Cannon, J.P Martin-Flatin
  • SURFnet/Universiteit van Amsterdam
  • Pieter de Boer, Dennis Paus, Erik.Radius,
    Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de
    Laat

50
Acknowledgements
  • Yotta Yotta
  • Geoff Hayward, Reg Joseph, Ying Xie, E. Siu
  • BCIT
  • Bill Rutherford
  • Jalaam
  • Loki Jorgensen
  • Netera
  • Gary Finley

51
ATLAS Canada
Alberta
SFU
Montreal
Victoria
UBC
Carleton
York
TRIUMF
Toronto
52
LHC Data Grid Hierarchy
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
PByte/sec
100-400 MBytes/sec
Online System
Experiment
CERN 700k SI95 1 PB Disk Tape Robot
Tier 0 1
HPSS
2.5 Gbps
Tier 1
FNAL 200k SI95 600 TB
IN2P3 Center
INFN Center
RAL Center
2.5 Gbps
Tier 2
2.5 Gbps
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels
0.11 Gbps
Physics data cache
Tier 4
Workstations
Slide courtesy H. Newman (Caltech)
53
The ATLAS Experiment
Canada
Canada
Write a Comment
User Comments (0)
About PowerShow.com