Title: High Speed Physics Data Transfers using UltraLight
1High Speed Physics Data Transfers using UltraLight
- Julian Bunn
- (thanks to Yang Xia and others for material in
this talk) - UltraLight Collaboration Meeting
- October 2005
2Disk to Disk (Newisys) 2004
System VendorNewisys 4300 AMD Opteron Enterprise
Server with 3 AMD-8131 CPUQuad Opteron 848
2.2GHz Memory16GB PC2700 DDR ECC Network
Interface S2io 10GE in a 64-bit/133MHz PCI-X
slot. Raid Controller 3 x Supermicro Marvell
SATA controller Hard Drives 24 x 250GB WDC
7200rpm SATA OSWin2K3 AMD64, Service Pack 1,
v.1185
550 MBytes/sec
3Tests with rootd
- Physics analysis files are typically ROOT format
- Would like to serve these files over the network
as quickly as possible. - At least three possibilities
- Use rootd
- Use Clarens
- Use Web server
- Use of rootd is simple
- On client, use 123.456.789.012/dir/root.file
- On server, run rootd
4rootd
- On server
- root_at_dhcp-116-157 rootdata ./rootd -p 5000 -f
-noauth - main running in foreground mode sending output
to stderr - ROOTD_PORT5000
- On client, add following to .rootrc (corrects
issue in current root) - XNet.ConnectDomainAllowRE
- Plugin.TFile root TNetFile Core
"TNetFile(const char,Option_t,const
char,Int_t,Int_t)" - In the C code, access the files like thisÂ
- TChain ch new TChain("Analysis")
- ch-gtAdd("root//10.1.1.15000/../raid/rootdata/zpr
200gev.mumu.root") - ch-gtAdd("root//10.1.1.15000/../raid/rootdata/zpr
500gev.mumu.root")
5Rootd (measure performance)
Compression makes a big difference Root file is
282 MBytes, but Root object data amounts to 655
MBytes! Thus the physics data rate to application
is twice the reported network rate ? (for this
test 22 MBytes/sec)
Application Real time 00014, CP time 12.790
655167999 Bytes Rootd rd2.81415e08,
wr0, rx478531, tx2.81671e08
Int_t nbytes 0, nb 0 TStopwatch s for
(Long64_t jentry0 jentryltnentriesjentry)
Long64_t ientry LoadTree(jentry) if
(ientry lt 0) break nb
fChain-gtGetEntry(jentry) nbytes nb
s.Stop() s.Print() Long64_t fileBytes
gFile-gtGetBytesRead() Double_t mbSec
(Double_t) (fileBytes/1024/1024) mbSec /
s.RealTime() cout ltlt nbytes ltlt " Bytes
(uncompressed) " ltlt fileBytes ltlt " Bytes (in
file) " ltlt mbSec ltlt " MBytes/sec" ltlt endl
6Tests with Clarens/Root
- Using Dimitris analysis (Root files containing
Higgs -gt muon data at various energies) - Root client requests objects from files of size a
few hundred MBytes - In this analysis, not all the objects from the
file are read, so care in computing the network
data rate is required - Clarens serves data to Root client at approx. 60
MBytes/sec - Compare with using wget pull of Root file from
Clarens/Apache 125 MBytes/sec cold cache, 258
MBytes/sec warm cache
7Tests with gridftp
- Gridftp may work well, if you can manage to
install it and work with security constraints - Michael Thomas experience
- Installed on laptop successfully, but needed Grid
certificate for host, and reverse DNS lookup.
Didnt have, so couldnt use - Installed on osg-discovery.caltech.edu
successfully, but could not use for testing since
production machine - Attempted install on UltraLight dual core
Opterons at Caltech, but no host certificates, no
reverse lookup, no support for x86_64 - Summary installation/deployment constraints
severely restrict usefulness of gridftp
8Tests with bbftp
- bbftp supported by IN2P3
- Time difference makes support less interactive
than for bbcp ? - Operates with an ftp-like client/server setup
- Tested bbftp v3.2.0 between LAN Opterons
- Example localhost copy
- bbftp -e 'put /tmp/julian/example.session
/tmp/julian/junk.dat' localhost -u root - Some problems
- Segmentation faults when using IP numbers rather
than names x86_64 issue? - Transfer fails with reported routing error, but
routes are OK - By default, files are copied to temporary
location on target machine, then copied to
correct location. This is not what is wanted when
targetting a high speed RAID array! Can be
avoided with setoption notmpfile - Sending files to /dev/null did not seem to work
- gtgt USER root PASS
- ltlt bbftpd version 3.2.0 OK
- gtgt COMMAND setoption notmpfile
- ltlt OK
- gtgt COMMAND put OneGB.dat /dev/null
- BBFTP-ERROR-00100 Disk quota excedeed or No
Space left on device - ltlt Disk quota excedeed or No Space left on device
9bbcp
- http//www.slac.stanford.edu/abh/bbcp/
- Developed as tool for BaBar file transfers
- The work of Andy Hanushevsky (SLAC)
- Peer to Peer architecture third party transfers
- Simple to install just need bbcp executable in
path on remote machine(s) - Works with all standard methods of authentication
10Tests with bbcp
- The goal is to transfer data files at 10
Gbits/sec in the WAN - We use Opteron systems with two CPUs each dual
core, 8GB or 16GB RAM, s2io 10Gbit NICs, RHEL 2.6
kernel - We use a stepwise approach, starting with the
easiest data transfers - Memory to bit bucket (/dev/zero to /dev/null)
- Ramdisk to bit bucket (/mnt/rd to /dev/null)
- Ramdisk to Ramdisk (/mnt/rd to /mnt/rd)
- Disk to bit bucket (/disk/file to /dev/null)
- Disk to Ramdisk
- Disk to Disk
11bbcp LAN Rates
- Goal bbcp rates should match or exceed iperf
rates - Single bbcp process
- a) 1 stream max rate   523 MBytes/sec b) 2
streams max rate  522 MBytes/sec c) 4 streams
max rate  473 MBytes/sec d) 8 streams max rate
 460 MBytes/sec e) 16 streams max rate 440
MBytes/sec f) 32 streams max rate 417
MBytes/sec - 3 simultaneous bbcp processes
- P 1) bbcp At 050922 085814 copy 99 complete
348432.0 KB/s P 2) bbcp At 050922 085815 copy
54 complete 192539.5 KB/s P 3) bbcp At 050922
085815 copy 30 complete 194359.9 KB/s
Aggregate utlization of 735 MByte/sec (6
Gbits/sec). - Conclusion bbcp can match iperf in the LAN. Use
one or two streams, and several bbcp processes
(if you can)
12bbcp WAN rates
Memory To Memory (sender has FAST Web100)
785 MBytes/sec
13Performance Killers
-
- 1) Make sure you're using the right interface!
Check with ifconfig - 2) Do a cat /proc/sys/net/ipv4/tcp_rmem and make
sure the numbers are big, like 1610612736Â Â Â Â Â
1610612736Â Â Â Â Â 1610612736 - 3) Tune the interface if not, using
/usr/local/src/s2io//s2io_perf.sh - 4) Flush existing routes sysctl -w
net.ipv4.route.flush1 - 5) Sometimes a route has to be configured
manually, and added to /etc/sysconfig/networks-scr
ipts/route-ethX for the future - 6) Sometimes commands like sysctl and ifconfig
are not in the PATH - 7) Check route is OK with traceroute in both
directions - 8) Check machine reachable with ping
- 9) Sometimes 10Gbit adapter does not have 9000
MTU ... But instead has default of 1500 - 10) If in doubt, reboot
- 11) If still in doubt, rebuild your application,
and goto 10)
14Ramdisks SHC
- Avoid disk I/O by using ramdisks it works
- mount -t ramfs none /mnt/rd
- Allows physics data files to be placed in system
RAM - Finesses the new Bandwidth Challenge rule
disallowing iperf/artificial data - In CACRs new Shared Heterogeneous Cluster (gt80
dual Opteron HP nodes) we intend to populate
ramdisks on all nodes with Root files, and
transfer them using bbcp to nodes in the Caltech
booth at SC2005 - The SHC is connected to the WAN via a Black
Diamond switch, with two bonded 10Gbit links to
Caltechs UltraLight Force10.
15SC2005 Bandwidth Challenge
- The Caltech-CERN-Florida-FNAL-Michigan-Manchester
-SLAC entry will demonstrate high speed transfers
of physics data between host labs and
collaborating institutes in the USA and
worldwide. Caltech and FNAL are major
participants in the CMS collaboration at CERNs
Large Hadron Collider (LHC). SLAC is the host of
the BaBar collaboration. Using state of the art
WAN infrastructure and Grid-based Web Services
based on the LHC Tiered Architecture, our
demonstration will show real-time particle event
analysis requiring transfers of Terabyte-scale
datasets. We propose to saturate at least
fifteen lambdas at Seattle, full duplex
(potentially over 300 Gbps of scientific
data).The lambdas will carry traffic between
SLAC, Caltech and other partner Grid Service
sites including UKlight, UERJ, FNAL and AARnet.
We will monitor the WAN performance using
Caltech's MonALISA agent-based system. The
analysis software will use a suite of
Grid-enabled Analysis tools developed at Caltech
and University of Florida. There will be a
realistic mixture of streams those due to the
transfer of the TeraByte event datasets, and
those due to a set of background flows of varied
character absorbing the remaining capacity. The
intention is to simulate the environment in which
distributed physics analysis will be carried out
at the LHC. We expect to easily beat our SC2004
record of 100Gbits/sec (roughly equivalent to
downloading 1000 DVDs in less than an hour).
16(No Transcript)
17(No Transcript)
18(No Transcript)
19(No Transcript)
20(No Transcript)
21Summary
- Seeking fastest ways of moving physics data in
the 10 Gbps WAN - Disk to Disk WAN record held by Newisys machines
in 2004 gt500MBytes/sec - Root files can be served to Root clients at
decent rates (gt 60Mbytes/sec). Root compression
helps by factor gt2 - Root files can be served by rootd, xrootd,
Clarens, and vanilla Web servers - For file transfers, bbftp and gridftp hard to
deploy and test - bbcp easy to deploy, well supported, and can
match iperf speeds in the LAN (7Gbits/sec) and
the WAN (6.3Gbits/sec) for memory to memory data
transfers - Optimistically, bbcp should be able to copy disk
resident files in the WAN at the same speeds,
given - Powerful servers
- Fast disks
- Although we are not there yet, we are aiming to
be by SC2005!