Title: Storage resources management and access at TIER1 CNAF
1Storage resources management and access at TIER1
CNAF
Ricci Pier Paolo, Lore Giuseppe, Vagnoni Vincenzo
on behalf of INFN TIER1 Staff pierpaolo.ricci_at_cnaf
.infn.it
- ACAT 2005
- May 22-27 2005
- DESY Zeuthen, Germany
2TIER1 INFN CNAF Storage
HSM (400 TB)
NAS (20TB)
STK180 with 100 LTO-1 (10Tbyte Native)
NAS1,NAS4 3ware IDE SAS 18003200 Gbyte
Linux SL 3.0 clients (100-1000 nodes)
W2003 Server with LEGATO Networker (Backup)
RFIO
NFS
PROCOM 3600 FC NAS3 4700 Gbyte
WAN or TIER1 LAN
CASTOR HSM servers
H.A.
PROCOM 3600 FC NAS2 9000 Gbyte
STK L5500 robot (5500 slots) 6 IBM LTO-2, 2 (4)
STK 9940B drives
NFS-RFIO-GridFTP oth...
SAN 1 (200TB)
SAN 2 (40TB)
Diskservers with Qlogic FC HBA 2340
Infortrend 4 x 3200 GByte SATA A16F-R1A2-M1
IBM FastT900 (DS 4500) 3/4 x 50000 GByte 4 FC
interfaces
2 Brocade Silkworm 3900 32 port FC Switch
2 Gadzoox Slingshot 4218 18 port FC Switch
AXUS BROWIE About 2200 GByte 2 FC interface
STK BladeStore About 25000 GByte 4 FC interfaces
Infortrend 5 x 6400 GByte SATA A16F-R1211-M2
JBOD
3CASTOR HSM
Point to Point FC 2Gb/s connections
STK L5500 20003500 mixed slots 6 drives LTO2
(20-30 MB/s) 2 drives 9940B (25-30 MB/s) 1300
LTO2 (200 GB native) 650 9940B (200 GB native)
8 tapeserver Linux RH AS3.0 HBA Qlogic 2300
Sun Blade v100 with 2 internal ide disks with
software raid-0 running ACSLS 7.0 OS Solaris 9.0
1 CASTOR (CERN)Central Services server RH AS3.0
1 ORACLE 9i rel 2 DB server RH AS 3.0
EXPERIMENT Staging area (TB) Tape pool (TB native)
ALICE 8 12
ATLAS 6 20
CMS 2 15
LHCb 18 30
BABAR,AMSoth 2 4
WAN or TIER1 LAN
6 stager with diskserver RH AS3.0 15 TB Local
staging area
8 or more rfio diskservers RH AS 3.0 min 20TB
staging area
Indicates Full rendundancy FC 2Gb/s connections
(dual controller HW and Qlogic SANsurfer Path
Failover SW)
SAN 1
SAN 2
4CASTOR HSM (2)
- In general we obtained
- Good performances when writing into the staging
area (disk buffer) and from staging area to tapes
(2 parallel streams on tape give about 40MB/s) - General good reliability on the stager service
(Every LHC experiment has its own dedicated
stager and policies) and high reliability on the
central castor services - Bad realiability on LTO-2 drives when writing and
reading. This results in tapes marked readonly or
disabled when writing and in locking or failure
when trying to stage-in files in random order. - We could trigger with the experiment
coordination a temporary increase of the staging
area (disk buffe)r and an optimized sequencial
stage-in of data just before analysis phase. Then
the analysis job could run directly over rfio or
grid tool on castor with an high probability to
find the file directly on disk (LHCB). After the
end of the analysis phase the disk buffer could
be re-assigned to another exp. - We decide to acquire and use more STK 9940B
drives for random access to the data - The access to the CASTOR HSM system is
- Direct using rfltcommandgt direcly on the user
interfaces or on the WN (rfcp,rfrm or API...) - Throught front-end with gridftp interface to
castor and srm - 1
5DISK access
WAN or TIER1 LAN
GB Eth. connections nfs,rfio,xrootd,GPFS, GRID
ftp
Generic Diskserver Supermicro 1U 2 Xeon 3.2 Ghz
4GB Ram,GB eth. 1 or 2 Qlogic 2300 HBA Linux AS
or CERN SL 3.0 OS
1
2
3
4
F1
F2
1 or 2 2Gb FC connections every Diskserver
LUN0 gt /dev/sda LUN1 gt /dev/sdb ...
2 Brocade Silkworm 3900 32 port FC Switch
ZONED (50TB Unit with 4 Diskservers)
2 x 2GB Interlink connections
FARMS of rack mountable 1U biprocessors nodes
(actually about 1000 nodes for 1300 KspecInt2000)
2Gb FC connections
- FC Path Failover HA
- Qlogic SANsurfer
- IBM or STK Rdac for Linux
2TB Logical Disk LUN0 LUN1 ...
50 TB IBM FastT 900 (DS 4500) Dual redundant
Controllers (A,B) Internal MiniHub (1,2)
A1
A2
B1
B2
RAID5
- Application HA
- NFS server, rfio server with Red Hat Cluster AS
3.0() - GPFS with configuration NSD Primary Secondary
- /dev/sda Primary Diskserver 1 Secondary
Diskserver2 - /dev/sdb Primary Diskserver 2 Secondary
Diskserver3 - ..... () tested but not actually used in
production
4 Diskservers every 50TB Unit gt every controller
can perform a maximum of 120MByte/s R-W
6DISK access (2)
- We have different protocols in production for
accessing the disk storage. In our diskservers
and Grid SE front-ends we corrently have - NFS on local filesystem ADV. Easy client
implementation and compatibility and possibility
of failover (RH 3.0). DIS. Bad perfomance
scalability for an high number of access (1
client 30MB/s 100 client 15MB/s throughtput) - RFIO on local filesystem ADV. Good performance
and compatibility with Grid Tools and possibility
of failover. DIS. No scalability of front-ends
for the single filesystem, no possibility of
load-balancing - Grid SE Gridftp/rfio over GPFS (CMS,CDF) ADV
Separation from GPFS servers (accessing the
disks) and SE GPFS clients. Load balancing and HA
on the GPFS servers and possibility to implement
the same on the Grid SE services (see next
slide). DIS. GPFS layer requirements on OS and
Certified Hardware for support. - Xrootd (BABAR) ADV Good performance DIS No
possibility of load-balancing for the single
filesystem backends, not grid compliant (at
present...) - NOTE The IBM GPFS 2.2 is a CLUSTERED FILESYSTEM
so is possible from many front-ends (i.e. gridftp
or rfio server) to access simultaneously the SAME
filesystem. Also can use bigger filesystem size
(we use 8-12TB). - 1
7CASTOR Grid Storage Element
- GridFTP access through the castorgrid SE, a dns
cname pointing to 3 server. - Dns round-robin for load balancing
- During LCG Service Challenge2 introduced also a
load average selection every M minutes the ip of
the most loaded server is replaced in the cname
(see graph)
8Monitoring/notifications (Nagios)
9LHCb CASTOR tape pool
processes on a CMS disk SE
eth0 traffic through a CASTOR LCG SE
10Disk accounting
Pure disk space (TB)
CASTOR disk space (TB)
11Parallel Filesystem Test
- Test Goal Evaluation and Comparison of Parallel
Filesystems (GPFS, Lustre) for the implementation
of a powerful disk I/O infrastructure for the
TIER1 INFN CNAF. - A moderately high-end testbed has been used
- 6 IBM xseries 346 file servers connected via FC
SAN to 3 IBM FAStT 900 (DS4500) providing a total
of 24TB - Maximum available throughput to client nodes (30)
using Gb Ethernet 6 Gbps - PHASE 1 Generic test and tuning
- PHASE 2 Realistic physics analysis jobs reading
data from a Parallel Filesystem - Dedicated tools for test (PHASE 1) and monitoring
have been written - The Benchmarking tool allows the user to start,
stop and monitor the test on all the clients from
a single point - Completely automatized
12PHASE 1 Generic Benchmark
- GPFS Very stable, reliable, fault tolerant,
indicated for storage of critical data and no
charge for educational or research use - Lustre Commercial product, easy to install, but
fairly invasive (need patched kernel) and has a
node license cost - PHASE1 Generic Benchmark
- Sequential write/read from a variable number of
clients simultaneously performing I/O with 3
different protocols (native GPFS, rfio over GPFS,
nfs over GPFS). - 1 to 30 Gb clients, 1 to 4 processes per client
- Sequential write/read of zeroed files by means of
dd - File sizes ranging from 1 MB to 1 GB
13Generic Benchmark
Raw ethernet throughput vs time(20 x 1GB file
simultaneous reads with Lustre)
Results of read/write(1GB different files)
14Generic Benchmark(here shown for 1 GB files)
WRITE (MB/s) WRITE (MB/s) WRITE (MB/s) WRITE (MB/s) WRITE (MB/s) READ (MB/s) READ (MB/s) READ (MB/s) READ (MB/s) READ (MB/s)
of simultaneous client processes of simultaneous client processes 1 5 10 50 120 1 5 10 50 120
GPFS 2.3.0-1 native 114 160 151 147 147 85 301 301 305 305
GPFS 2.3.0-1 NFS 102 171 171 159 158 114 320 366 322 292
GPFS 2.3.0-1 RFIO 79 171 158 166 166 79 320 301 320 321
Lustre 1.4.1 native 102 512 512 488 478 73 366 640 453 403
Lustre 1.4.1 RFIO 93 301 320 284 281 68 269 269 314 349
- Numbers are reproducible with small fluctuations
- Lustre tests with NFS export not yet performed
15PHASE 2 Realistic analysis
- We focus on the Analysis Jobs since they are
generally the most I/O bound processes of the
experiment activity. - Realistic LHCb analysis algorithm runs on 8 TB of
data served by RFIO daemons running on GPFS
parallel filesystem servers - The analysis algorithm performs a selection of an
LHCb physics channel by reading sequentially
input DST (Data Summary Tape) files and producing
ntuple files in output - Analysis jobs submitted to the production LSF
batch system of TIER1 INFN (RFIO was the simplest
and most effective choice) - 14000 jobs submitted, 500 jobs in simultaneous
RUN state - Steps of the jobs
- RFIO-copy to the local WN disk the file to be
processed - Analize the data
- RFIO-copy back the output of the algorithm
- Cleanup files from the local disk
16Realistic analysis results
- 8.1 TB of data processed in 7 hours, all 14000
jobs completed succesfully - gt 3 Gb/s raw sustained read throughput from the
file servers with GPFS (about effective
320MByte/s) - Write throughput of output data negligible
- Just 1 MB per job
- The results are very satisfactory and give us a
good impression of the whole infrastructure
layout. - Test for the Lustre configuration are in
progress. (we dont expect big difference using
rfio protocol over parallel filesystem)
17Conclusions
- In these slides we presented
- A general overview of the Italian TIER1 INFN CNAF
storage resources hardware and access methods - HSM Software (CERN CASTOR) for Tape Library Mass
Storage - Disk over SAN with different software protocols
- Some simple management implementations for
monitoring and optimizing our storage resources
access - Results from Clustered Parallel Filesystem
(Lustre/GPFS) performance measurements - Step 1 Generic Filesystem Benchmark
- Step 2 Realistic LHC analysis jobs results
- Thank you to everybody for your attention
18Benchmarking tools
- Dedicated tools for benchmarking and monitoring
have been written - The benchmarking tools allow the user to start,
stop and monitor the evolution of simultaneous
read/write operations from an arbitrary number of
clients, reporting at the end of the test the
aggregated throughput - Realized as a set of bash scripts and C programs
- The tool implements network bandwith measurements
by means of the netperf suite and sequential
read/write with dd - Thought to be of general use, can be reused with
minimal effort for any kind of storage benchmark - Completely automatized
- The user does not need to install anything on the
target nodes as all the software is copied by the
tool via ssh (and also removed in the end) - The user has only to issue a few commands from
the shell prompt to control everything - Can perform complex unattended and personalized
tests by means of very simple scripts, collect
and save all the results and produce plots
19Monitoring tools
- The monitoring tools allow to measure the time
dependence of the raw network traffic of each
server with a granularity of one second - Following the time dependence of the I/O gives
important insights and can be very important for
a detailed understanding and tuning of the
network and parallel filesystem operational
parameters - The existing tools didnt provide such a low
granularity, so we have written our own, reusing
a work made for the LHCb online farm monitoring
(consider that writing/reading one file of 1 GB
from a single client requires just a few seconds) - The tool automatically produces a plot of the
aggregated network traffic of the file servers
for each test in pdf format - The network traffic data files corresponding to
each file server are saved to ascii files in case
one wants to make a detailed per-server analysis
20GPFS features
- Very stable, reliable, fault tolerant, indicated
for storage of critical data and no charge for
educational or research use. - Commercial product, initially developed by IBM
for the SP series and then ported to Linux - Advanced command line interface for configuration
and management - Easy to install, not invasive
- Distributed as binaries in RPM packages
- No patches to standard kernels are required, just
a few kernel modules for POSIX I/O to be compiled
for the running kernel - Data and metadata striping
- Possibility to have data and metadata redundancy
- Expensive solution, as it requires the
replication of the whole files, indicated for
storage of critical data - Data recovery for filesystem corruption available
- Fault tolerant features oriented to SAN and
internal health monitoring through network
heartbeat
21Lustre features
- Commercial product, easy to install, but fairly
invasive - Distributed as binaries and sources in RPM
packages - Requires own Lustre patches to standard kernels,
but binary distribution of patched kernels are
made available - Aggressive commercial effort, the developers sell
it as an Intergalactic Filesystem scalable to
10000 nodes - Advanced interface for configuration and
management - Possibility to have Metadata redundancy and
Metadata Server fault tolerance - Data recovery for filesystem corruption available
- POSIX I/O
22Sequential Write/Read benchmarks
- Sequential write/read from a variable number of
clients simultaneously performing I/O - 1 to 30 Gb clients, 1 to 4 processes per client
- Sequential write/read of zeroed files by means of
dd - File sizes ranging from 1 MB to 1 GB
- After having been written, files are read back
- Particular attention to read the whole files from
disk (i.e. no caching at all on the client side
nor on the server side) - Before starting tests, appropriate syncs are
issued to unload the operating system buffers in
order not to have interference between
consecutive tests
23Hardware testbed
- Disk storage
- 3 IBM FAStT 900 (DS4500)
- Each FAStT 900 serves 2 RAID5 arrays, 4 TB each
(17 x 250 GB disks 1 hot spare) - Each RAID5 is further subdivided in two LUNs of 2
TB each - In total 12 LUNs and 24 TB of disk space (102 x
250GB disks 8 hot spares) - File System Servers
- 6 IBM xseries 346, dual Xeon, 2 GB RAM, Gigabit
NIC - QLogic fiber channel PCI card on each server
connected to the DS4500 via a Brocade switch - 6 Gb/s available bandwidth to/from the clients
- Clients
- 30 SuperMicro nodes, dual Xeon, 2 GB RAM, Gigabit
NIC
24Realistic analysis results (with graphs)
- 8 TB of data processed in about 7 hours, about
14000 jobs submitted, all completed succesfully - 500 analysis jobs in simultaneous RUN state, the
rest in PENDING - 3 Gb/s sustained read throughput from the file
servers (with RFIO on top of GPFS) - Write throughput of output data negligible
- Just about 1 MB per job
LHCb LSF batch queue occupancy during tests
25Abstract
Title Storage resources management and access at
TIER1 CNAF Abstract At presents at LCG TIER1
CNAF we have 2 main different mass storage
systems for archiving the HEP experiment data a
HSM software system (CASTOR) and about 200TB of
different storage devices over SAN. This paper
briefly describe our hardware and software
environtment and summarize the simple technical
improvements we have implemented in order to
obtain a better avaliability and the best data
access throughtput from the front-end machines.
Also some test results for different file systems
over SAN are reported.