Atlas Canada Lightpath - PowerPoint PPT Presentation

About This Presentation

Title:

Atlas Canada Lightpath

Description:

Brownie 2.5 TeraByte RAID array. 16 x 160 GB IDE disks (5400 rpm 2MB cache) hot swap capable ... 2 3ware 7850 RAID controller. 6 IDE drives on each 3-ware controllers ... – PowerPoint PPT presentation

Number of Views:74

Avg rating:3.0/5.0

Slides: 53

Provided by: corri7

Learn more at: https://conferences.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: Atlas Canada Lightpath

1
Atlas Canada Lightpath Data Transfer Trial
Corrie Kost, Steve McDonald (TRIUMF) Bryan Caron
(UofAlberta), Wade Hong (Carleton)
2
(No Transcript)
3
(No Transcript)
4
(No Transcript)
5
(No Transcript)
6
Brownie 2.5 TeraByte RAID array

16 x 160 GB IDE disks (5400 rpm 2MB cache)
hot swap capable
Dual ultra160 SCSI interface to host
Maximum transfer 65 MB/sec
Triple hot swap power supplies
CAN 15k
Arrives July 8th 2002

7
What to Do while waiting for server to arrive

IBM PRO6850 Intellistation (Loan)
Dual 2.2 GHz Xeons
2 PCI 64bit/66MHz
4 PCI 33bit/33MHz
1.5 GB RAMBUS
Add 2 Promise Ultra100
IDE controllers and 5 Disks
Each disk on its own IDE controller for maximum
IO
Begin Linux Software RAID performance tests
170/130 MB/sec Read/Write

8
The Long Road to High Disk IO

IBM cluster x330s RH7.2 disk io 15 MB/sec
(slow??)
expect 45 MB/sec for any modern single drive
Need 2.4.18 Linux kernel to support gt1TB
filesystems
IBM cluster x330s RH7.3 disk io 3 MB/sec
What is going on
Red Hat modified serverworks driver broke DMA on
x330s
x330s ATA 100 drive, BUT controller is only
UDMA 33
Promise controllers capable of UDMA 100 but need
latest kernel patches for 2.4.18 before drives
recognise UDMA100
Finally drives/controller both working at
UDMA100 45MB/sec
Linux software raid0 2 drives 90MB/sec, 3
drives 125 MB/sec
4 drives 155MB/sec, 5 drives 175 MB/sec
Now we are ready to start network transfers

9
(No Transcript)
10
So what are we going to do?
did we
----------------------------------

Demonstrate a manually provisioned e2e
lightpath
Transfer 1TB of ATLAS MC data generated in Canada
from TRIUMF to CERN
Test out 10GbE technology and channel bonding
Establish a new benchmark for high performance
disk to disk throughput over a large distance

11
Comparative Results(TRIUMF to CERN)
12
What is an e2e Lightpath

Core design principle of CAnet 4
Ultimately to give control of lightpath
creation, teardown and routing to the end
user
Hence, Customer Empowered Networks
Provides a flexible infrastructure for emerging
grid applications

Alas, can only do things manually today

13
(No Transcript)
14
CAnet 4 Layer 1 Topology
15
The Chicago Loopback

Need to test TCP/IP and Tsunami protocols over
long distances, arrange optical loop via
StarLight
( TRIUMF-BCNET-Chicago-BCNET-TRIUMF )
91ms RTT
TRIUMF - CERN RRT 200ms Told Damir, we really
needed to have a double loopback
No problem
The loopback2 was setup a few days later
(RTT193ms)
(TRIUMF-BCNET-Chicago-BCNET-Chicago-BCNET-TRIUMF)

16
TRIUMF Server SuperMicro P4DL6 (Dual Xeon
2GHz) 400 MHz front side bus 1 GB DDR2100
RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64 bit/133 Mhz capable 3ware 7850
RAID controller 2 Promise Ultra 100 Tx2
controllers
17
CERN Server SuperMicro P4DL6 (Dual Xeon 2GHz) 400
MHz front side bus 1 GB DDR2100 RAM Dual Channel
Ultra 160 onboard SCSI SysKonnect 9843 SX GbE 2
independent PCI buses 6 PCI-X 64 bit/133 Mhz
capable 2 3ware 7850 RAID controller
6 IDE drives on each 3-ware controllers RH7.3 on
13th drive connected to on-board IDE WD Caviar
120GB drives with 8Mbyte cache
RMC4D from HARDDATA
18
TRIUMF Backup Server SuperMicro P4DL6 (Dual Xeon
1.8GHz) Supermicro 742I-420 17 4U Chassis 420W
Power Supply 400 MHz front side bus 1 GB
DDR2100 RAM Dual Channel Ultra 160 onboard
SCSI SysKonnect 9843 SX GbE 2 independent PCI
buses 6 PCI-X 64bit/133 MHz capable 2 Promise
Ultra 133 TX2 controllers 1 Promise Ultra
100 TX2 controller
19
Back-to-back tests over 12,000km loopback using
designated servers
20
Operating System

Redhat 7.3 based Linux kernel 2.4.18-3
Needed to support filesystems gt 1TB
Upgrades and patches
Patched to 2.4.18-10
Intel Pro 10GbE Linux driver (early stable)
SysKonnect 9843 SX Linux driver (latest)
Ported Sylvain Ravots tcp tune patches

21
Intel 10GbE Cards

Intel kindly loaned us 2 of their Pro/10GbE LR
server adapters cards despite the end of their
Alpha program
based on Intel 82597EX 10 Gigabit Ethernet
Controller
Note length of card!

22
Extreme Networks
TRIUMF
CERN
23
EXTREME NETWORK HARDWARE
24
IDE Disk Arrays
CERN Receive Host
TRIUMF Send Host
25
Disk Read/Write Performance

TRIUMF send host
1 3ware 7850 and 2 Promise Ultra 100TX2 PCI
controllers
12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
TB)
Tuned for optimal read performance (227/174 MB/s)
CERN receive host
2 3ware 7850 64-bit/33 MHz PCI IDE controllers
12 WD 7200 rpm UDMA 100 120 GB hard drives (1.4
TB)
Tuned for optimal write performance (295/210 MB/s)

26
THUNDER RAID DETAILS
raidstop /dev/mdo mkraid R /dev/md0 mkfs -t
ext3 /dev/md0 mount -t ext2 /dev/mdo /raid0
/root/raidtab raiddev /dev/md0
raid-level 0 nr-raid-disks 12
persistent-superblock 1 chunk-size
512 ?kbytes device /dev/sdc
raid-disk 0 device
/dev/sdd raid-disk 1 device
/dev/sde raid-disk 2
device /dev/sdf raid-disk
3 device /dev/sdg
raid-disk 4 device
/dev/sdh raid-disk 5 device
/dev/sdi raid-disk 6
device /dev/sdj raid-disk
7 device /dev/hde
raid-disk 8 device
/dev/hdg raid-disk 9 device
/dev/hdi raid-disk 10
device /dev/hdk raid-disk
11
8 drives on 3-ware
4 drives on 2 Promise
27
Black Magic

We are novices in the art of optimizing system
performance
It is also time consuming
We followed most conventional wisdom, much of
which we dont yet fully understand

28
Testing Methodologies

Began testing with a variety of bandwidth
characterization tools
pipechar, pchar, ttcp, iperf, netpipe, pathcar,
etc
Evaluated high performance file transfer
applications
bbftp, bbcp, tsunami, pftp
Developed scripts to automate and to scan
parameter space for a number of the tools

29
Disk I/O Black Magic

min max read ahead on both systems
sysctl -w vm.min-readahead127
sysctl -w vm.max-readahead256
bdflush on receive host
sysctl -w vm.bdflush2 500 0 0 500 1000 60 20
0
or
echo 2 500 0 0 500 1000 60 20 0
gt/proc/sys/vm/bdflush
bdflush on send host
sysctl -w vm.bdflush30 500 0 0 500 3000 60 20
0
or
echo 30 500 0 0 500 3000 60 20 0
gt/proc/sys/vm/bdflush

30
Misc. Tuning and other tips
/sbin/elvtune r 512 /dev/sdc (same for other
11 disks) /sbin/elvtune w 1024 /dev/sdc (same
for other 11 disks) -r sets the max latency
that the I/O scheduler will provide on each
read -w sets the max latency that the I/O
scheduler will provide on each write
When the /raid disk refuses to dismount! Works
for kernels 2.4.11 or later. umount -l /raid
(then mount umount)
lazy
31
Disk I/O Black Magic

Disk I/O elevators (minimal impact noticed)
/sbin/elvtune
Allows some control of latency vs throughput
Read_latency set to 512 (default 8192)
Write_latency set to 1024 (default 16384)
atime
Disables updating the last time a file has been
accessed (typically for file servers)
mount t ext2 o noatime /dev/md0 /raid
Typically, ext3 writes?90Mbytes/sec while for
ext2 writes 190Mbytes/sec
Reads minimally affected. We always used ext2

32
Disk I/O Black Magic

IRQ Affinity
root_at_thunder root more /proc/interrupts
CPU0 CPU1
0 15723114 0 IO-APIC-edge timer
1 12 0 IO-APIC-edge
keyboard
2 0 0 XT-PIC
cascade
8 1 0 IO-APIC-edge rtc
10 0 0 IO-APIC-level
usb-ohci
14 22 0 IO-APIC-edge ide0
15 227234 2 IO-APIC-edge ide1
16 126 0 IO-APIC-level
aic7xxx
17 16 0 IO-APIC-level
aic7xxx
18 91 0 IO-APIC-level ide4,
ide5, 3ware Storage Controller
20 14 0 IO-APIC-level ide2,
ide3
22 2296662 0 IO-APIC-level
SysKonnect SK-98xx
24 2 0 IO-APIC-level eth3
26 2296673 0 IO-APIC-level
SysKonnect SK-98xx
30 26640812 0 IO-APIC-level eth0
NMI 0 0

Need to have PROCESS Affinity - but this requires
2.5 kernel
echo 1 gt/proc/irq/18/smp_affinity
? use CPU0 echo 2 gt/proc/irq/18/smp_affinity
? use CPU1 echo 3
gt/proc/irq/18/smp_affinity
? use either cat /proc/irq/prof_cpu_mask
gt/proc/irq/18/smp_affinity ? reset to default
33
TCP Black Magic

Typically suggested TCP and net buffer tuning
sysctl -w net.ipv4.tcp_rmem"4096 4194304
4194304"
sysctl -w net.ipv4.tcp_wmem"4096 4194304
4194304"
sysctl -w net.ipv4.tcp_mem"4194304 4194304
4194304"
sysctl -w net.core.rmem_default65535
sysctl -w net.core.rmem_max8388608
sysctl -w net.core.wmem_default65535
sysctl -w net.core.wmem_max8388608

34
TCP Black Magic

Sylvain Ravots tcp tune patch parameters
sysctl -w net.ipv4.tcp_tune115 115 0
Linux 2.4 retentive TCP
Caches TCP control information for a destination
for 10 mins
To avoid caching
sysctl -w net.ipv4.route.flush1

35
We are live continent to continent!

e2e lightpath up and running Friday Sept 20 2145
CET

traceroute to cern-10g (192.168.2.2), 30 hops
max, 38 byte packets 1 cern-10g (192.168.2.2)
161.780 ms 161.760 ms 161.754 ms
36
BBFTP Transfer

Vancouver ONS
ons-van01(enet_15/1)
Vancouver ONS
ons-van01(enet_15/2)

37
BBFTP Transfer

Chicago ONS
GigE Port 1
Chicago ONS
GigE Port 2

38
Tsunami Transfer

Vancouver ONS
ons-van01(enet_15/1)
Vancouver ONS
ons-van01(enet_15/2)

39
Tsunami Transfer

Chicago ONS
GigE Port 1
Chicago ONS
GigE Port 2

40
Sunday Nite Summaries
41
Exceeding 1Gbit/sec ( using tsunami)
42
What does it mean for TRIUMFin the long TERM

Established a relationship with a grid of
people for future networking projects
Upgraded WAN connection from 100Mbit to
4 x 1GB Ethernet connections directly to BCNET
Canarie educational/research network
Westgrid GRID computing
Commercial Internet
Spare (research development)
Recognition that TRIUMF has the expertise and the
Network connectivity for large scale and high
speed data transfers necessary for upcoming
scientific programs, ATLAS, WESTGRID, etc

43
Lessons Learned 1

Linux software RAID faster than most conventional
SCSI and IDE hardware RAID based systems.
One controller for each drive, more disk spindles
the better
More than 2 Promises / machine possible
(100/133Mhz)
Unless programs are multi-threaded or kernel
permits process locking, Dual CPU will not give
best performance.
Single 2.8 GHz is likely to outperform Dual 2.0
GHz, for a single purpose machine like our
fileservers.
More memory the better

44
Misc. comments

No hardware failure even for the 50 disks!
Largest file transferred 114 Gbytes (Sep 24)
Tar, compressing, etc take longer than transfer
Deleting files can take a lot of time
Low cost of project - 20,000 with most of that
recycled

45
?220Mbytes/sec
175Mbytes/sec
46
Acknowledgements

Canarie
Bill St. Arnaud, Rene Hatem, Damir Pobric, Thomas
Tam, Jun Jian
Atlas Canada
Mike Vetterli, Randall Sobie, Jim Pinfold, Pekka
Sinervo, Gerald Oakham, Bob Orr, Michel Lefebrve,
Richard Keeler
HEPnet Canada
Dean Karlen
TRIUMF
Renee Poutissou, Konstantin Olchanski,
Mike Vetterli (SFU /
Westgrid),
BCNET
Mike Hrybyk, Marilyn Hay, Dennis OReilly, Don
McWilliams

47
Acknowledgements

Extreme Networks
Amyn Pirmohamed, Steven Flowers, John Casselman,
Darrell Clarke, Rob Bazinet, Damaris Soellner
Intel Corporation
Hugues Morin, Caroline Larson, Peter Molnar,
Harrison Li, Layne Flake, Jesse Brandeburg

48
Acknowledgements

Indiana University
Mark Meiss, Stephen Wallace
Caltech
Sylvain Ravot, Harvey Neuman
CERN
Olivier Martin, Paolo Moroni, Martin Fluckiger,
Stanley Cannon, J.P Martin-Flatin
SURFnet/Universiteit van Amsterdam
Pieter de Boer, Dennis Paus, Erik.Radius,
Erik-Jan.Bos, Leon Gommans, Bert Andree, Cees de
Laat

49
Acknowledgements

Yotta Yotta
Geoff Hayward, Reg Joseph, Ying Xie, E. Siu
BCIT
Bill Rutherford
Jalaam
Loki Jorgensen
Netera
Gary Finley

50
ATLAS Canada
Alberta
SFU
Montreal
Victoria
UBC
Carleton
York
TRIUMF
Toronto
51
LHC Data Grid Hierarchy
CERN/Outside Resource Ratio 12Tier0/(?
Tier1)/(? Tier2) 111
PByte/sec
100-400 MBytes/sec
Online System
Experiment
CERN 700k SI95 1 PB Disk Tape Robot
Tier 0 1
HPSS
2.5 Gbps
Tier 1
FNAL 200k SI95 600 TB
IN2P3 Center
INFN Center
RAL Center
2.5 Gbps
Tier 2
2.5 Gbps
Tier 3
Institute 0.25TIPS
Institute
Institute
Institute
Physicists work on analysis channels Each
institute has 10 physicists working on one or
more channels
0.11 Gbps
Physics data cache
Tier 4
Workstations
Slide courtesy H. Newman (Caltech)
52
The ATLAS Experiment
Canada
Canada

Write a Comment

User Comments (0)