Storage resources management and access at TIER1 CNAF

About This Presentation

Title:

Storage resources management and access at TIER1 CNAF

Description:

Silkworm 3900. 32 port FC Switch. Infortrend. 5 x 6400 GByte SATA A16F ... 2 Brocade Silkworm 3900. 32 port FC Switch ZONED (50TB Unit with 4 Diskservers) ... – PowerPoint PPT presentation

Number of Views:124

Avg rating:3.0/5.0

Slides: 26

Provided by: wwwzeut

Category:

more less

Transcript and Presenter's Notes

Title: Storage resources management and access at TIER1 CNAF

1
Storage resources management and access at TIER1
CNAF
Ricci Pier Paolo, Lore Giuseppe, Vagnoni Vincenzo
on behalf of INFN TIER1 Staff pierpaolo.ricci_at_cnaf
.infn.it

ACAT 2005
May 22-27 2005
DESY Zeuthen, Germany

2
TIER1 INFN CNAF Storage
HSM (400 TB)
NAS (20TB)
STK180 with 100 LTO-1 (10Tbyte Native)
NAS1,NAS4 3ware IDE SAS 18003200 Gbyte
Linux SL 3.0 clients (100-1000 nodes)
W2003 Server with LEGATO Networker (Backup)
RFIO
NFS
PROCOM 3600 FC NAS3 4700 Gbyte
WAN or TIER1 LAN
CASTOR HSM servers
H.A.
PROCOM 3600 FC NAS2 9000 Gbyte
STK L5500 robot (5500 slots) 6 IBM LTO-2, 2 (4)
STK 9940B drives
NFS-RFIO-GridFTP oth...
SAN 1 (200TB)
SAN 2 (40TB)
Diskservers with Qlogic FC HBA 2340
Infortrend 4 x 3200 GByte SATA A16F-R1A2-M1
IBM FastT900 (DS 4500) 3/4 x 50000 GByte 4 FC
interfaces
2 Brocade Silkworm 3900 32 port FC Switch
2 Gadzoox Slingshot 4218 18 port FC Switch
AXUS BROWIE About 2200 GByte 2 FC interface
STK BladeStore About 25000 GByte 4 FC interfaces
Infortrend 5 x 6400 GByte SATA A16F-R1211-M2
JBOD
3
CASTOR HSM
Point to Point FC 2Gb/s connections
STK L5500 20003500 mixed slots 6 drives LTO2
(20-30 MB/s) 2 drives 9940B (25-30 MB/s) 1300
LTO2 (200 GB native) 650 9940B (200 GB native)
8 tapeserver Linux RH AS3.0 HBA Qlogic 2300
Sun Blade v100 with 2 internal ide disks with
software raid-0 running ACSLS 7.0 OS Solaris 9.0
1 CASTOR (CERN)Central Services server RH AS3.0
1 ORACLE 9i rel 2 DB server RH AS 3.0
EXPERIMENT Staging area (TB) Tape pool (TB native)
ALICE 8 12
ATLAS 6 20
CMS 2 15
LHCb 18 30
BABAR,AMSoth 2 4
WAN or TIER1 LAN
6 stager with diskserver RH AS3.0 15 TB Local
staging area
8 or more rfio diskservers RH AS 3.0 min 20TB
staging area
Indicates Full rendundancy FC 2Gb/s connections
(dual controller HW and Qlogic SANsurfer Path
Failover SW)
SAN 1
SAN 2
4
CASTOR HSM (2)

In general we obtained
Good performances when writing into the staging
area (disk buffer) and from staging area to tapes
(2 parallel streams on tape give about 40MB/s)
General good reliability on the stager service
(Every LHC experiment has its own dedicated
stager and policies) and high reliability on the
central castor services
Bad realiability on LTO-2 drives when writing and
reading. This results in tapes marked readonly or
disabled when writing and in locking or failure
when trying to stage-in files in random order.
We could trigger with the experiment
coordination a temporary increase of the staging
area (disk buffe)r and an optimized sequencial
stage-in of data just before analysis phase. Then
the analysis job could run directly over rfio or
grid tool on castor with an high probability to
find the file directly on disk (LHCB). After the
end of the analysis phase the disk buffer could
be re-assigned to another exp.
We decide to acquire and use more STK 9940B
drives for random access to the data
The access to the CASTOR HSM system is
Direct using rfltcommandgt direcly on the user
interfaces or on the WN (rfcp,rfrm or API...)
Throught front-end with gridftp interface to
castor and srm
1

5
DISK access
WAN or TIER1 LAN
GB Eth. connections nfs,rfio,xrootd,GPFS, GRID
ftp
Generic Diskserver Supermicro 1U 2 Xeon 3.2 Ghz
4GB Ram,GB eth. 1 or 2 Qlogic 2300 HBA Linux AS
or CERN SL 3.0 OS
1
2
3
4
F1
F2
1 or 2 2Gb FC connections every Diskserver
LUN0 gt /dev/sda LUN1 gt /dev/sdb ...
2 Brocade Silkworm 3900 32 port FC Switch
ZONED (50TB Unit with 4 Diskservers)
2 x 2GB Interlink connections
FARMS of rack mountable 1U biprocessors nodes
(actually about 1000 nodes for 1300 KspecInt2000)
2Gb FC connections

FC Path Failover HA
Qlogic SANsurfer
IBM or STK Rdac for Linux

2TB Logical Disk LUN0 LUN1 ...
50 TB IBM FastT 900 (DS 4500) Dual redundant
Controllers (A,B) Internal MiniHub (1,2)
A1
A2
B1
B2
RAID5

Application HA
NFS server, rfio server with Red Hat Cluster AS
3.0()
GPFS with configuration NSD Primary Secondary
/dev/sda Primary Diskserver 1 Secondary
Diskserver2
/dev/sdb Primary Diskserver 2 Secondary
Diskserver3
..... () tested but not actually used in
production

4 Diskservers every 50TB Unit gt every controller
can perform a maximum of 120MByte/s R-W
6
DISK access (2)

We have different protocols in production for
accessing the disk storage. In our diskservers
and Grid SE front-ends we corrently have
NFS on local filesystem ADV. Easy client
implementation and compatibility and possibility
of failover (RH 3.0). DIS. Bad perfomance
scalability for an high number of access (1
client 30MB/s 100 client 15MB/s throughtput)
RFIO on local filesystem ADV. Good performance
and compatibility with Grid Tools and possibility
of failover. DIS. No scalability of front-ends
for the single filesystem, no possibility of
load-balancing
Grid SE Gridftp/rfio over GPFS (CMS,CDF) ADV
Separation from GPFS servers (accessing the
disks) and SE GPFS clients. Load balancing and HA
on the GPFS servers and possibility to implement
the same on the Grid SE services (see next
slide). DIS. GPFS layer requirements on OS and
Certified Hardware for support.
Xrootd (BABAR) ADV Good performance DIS No
possibility of load-balancing for the single
filesystem backends, not grid compliant (at
present...)
NOTE The IBM GPFS 2.2 is a CLUSTERED FILESYSTEM
so is possible from many front-ends (i.e. gridftp
or rfio server) to access simultaneously the SAME
filesystem. Also can use bigger filesystem size
(we use 8-12TB).
1

7
CASTOR Grid Storage Element

GridFTP access through the castorgrid SE, a dns
cname pointing to 3 server.
Dns round-robin for load balancing
During LCG Service Challenge2 introduced also a
load average selection every M minutes the ip of
the most loaded server is replaced in the cname
(see graph)

8
Monitoring/notifications (Nagios)
9
LHCb CASTOR tape pool
processes on a CMS disk SE
eth0 traffic through a CASTOR LCG SE
10
Disk accounting
Pure disk space (TB)
CASTOR disk space (TB)
11
Parallel Filesystem Test

Test Goal Evaluation and Comparison of Parallel
Filesystems (GPFS, Lustre) for the implementation
of a powerful disk I/O infrastructure for the
TIER1 INFN CNAF.
A moderately high-end testbed has been used
6 IBM xseries 346 file servers connected via FC
SAN to 3 IBM FAStT 900 (DS4500) providing a total
of 24TB
Maximum available throughput to client nodes (30)
using Gb Ethernet 6 Gbps
PHASE 1 Generic test and tuning
PHASE 2 Realistic physics analysis jobs reading
data from a Parallel Filesystem
Dedicated tools for test (PHASE 1) and monitoring
have been written
The Benchmarking tool allows the user to start,
stop and monitor the test on all the clients from
a single point
Completely automatized

12
PHASE 1 Generic Benchmark

GPFS Very stable, reliable, fault tolerant,
indicated for storage of critical data and no
charge for educational or research use
Lustre Commercial product, easy to install, but
fairly invasive (need patched kernel) and has a
node license cost
PHASE1 Generic Benchmark
Sequential write/read from a variable number of
clients simultaneously performing I/O with 3
different protocols (native GPFS, rfio over GPFS,
nfs over GPFS).
1 to 30 Gb clients, 1 to 4 processes per client
Sequential write/read of zeroed files by means of
dd
File sizes ranging from 1 MB to 1 GB

13
Generic Benchmark
Raw ethernet throughput vs time(20 x 1GB file
simultaneous reads with Lustre)
Results of read/write(1GB different files)
14
Generic Benchmark(here shown for 1 GB files)
WRITE (MB/s) WRITE (MB/s) WRITE (MB/s) WRITE (MB/s) WRITE (MB/s) READ (MB/s) READ (MB/s) READ (MB/s) READ (MB/s) READ (MB/s)
of simultaneous client processes of simultaneous client processes 1 5 10 50 120 1 5 10 50 120
GPFS 2.3.0-1 native 114 160 151 147 147 85 301 301 305 305
GPFS 2.3.0-1 NFS 102 171 171 159 158 114 320 366 322 292
GPFS 2.3.0-1 RFIO 79 171 158 166 166 79 320 301 320 321
Lustre 1.4.1 native 102 512 512 488 478 73 366 640 453 403
Lustre 1.4.1 RFIO 93 301 320 284 281 68 269 269 314 349

Numbers are reproducible with small fluctuations
Lustre tests with NFS export not yet performed

15
PHASE 2 Realistic analysis

We focus on the Analysis Jobs since they are
generally the most I/O bound processes of the
experiment activity.
Realistic LHCb analysis algorithm runs on 8 TB of
data served by RFIO daemons running on GPFS
parallel filesystem servers
The analysis algorithm performs a selection of an
LHCb physics channel by reading sequentially
input DST (Data Summary Tape) files and producing
ntuple files in output
Analysis jobs submitted to the production LSF
batch system of TIER1 INFN (RFIO was the simplest
and most effective choice)
14000 jobs submitted, 500 jobs in simultaneous
RUN state
Steps of the jobs
RFIO-copy to the local WN disk the file to be
processed
Analize the data
RFIO-copy back the output of the algorithm
Cleanup files from the local disk

16
Realistic analysis results

8.1 TB of data processed in 7 hours, all 14000
jobs completed succesfully
gt 3 Gb/s raw sustained read throughput from the
file servers with GPFS (about effective
320MByte/s)
Write throughput of output data negligible
Just 1 MB per job
The results are very satisfactory and give us a
good impression of the whole infrastructure
layout.
Test for the Lustre configuration are in
progress. (we dont expect big difference using
rfio protocol over parallel filesystem)

17
Conclusions

In these slides we presented
A general overview of the Italian TIER1 INFN CNAF
storage resources hardware and access methods
HSM Software (CERN CASTOR) for Tape Library Mass
Storage
Disk over SAN with different software protocols
Some simple management implementations for
monitoring and optimizing our storage resources
access
Results from Clustered Parallel Filesystem
(Lustre/GPFS) performance measurements
Step 1 Generic Filesystem Benchmark
Step 2 Realistic LHC analysis jobs results
Thank you to everybody for your attention

18
Benchmarking tools

Dedicated tools for benchmarking and monitoring
have been written
The benchmarking tools allow the user to start,
stop and monitor the evolution of simultaneous
read/write operations from an arbitrary number of
clients, reporting at the end of the test the
aggregated throughput
Realized as a set of bash scripts and C programs
The tool implements network bandwith measurements
by means of the netperf suite and sequential
read/write with dd
Thought to be of general use, can be reused with
minimal effort for any kind of storage benchmark
Completely automatized
The user does not need to install anything on the
target nodes as all the software is copied by the
tool via ssh (and also removed in the end)
The user has only to issue a few commands from
the shell prompt to control everything
Can perform complex unattended and personalized
tests by means of very simple scripts, collect
and save all the results and produce plots

19
Monitoring tools

The monitoring tools allow to measure the time
dependence of the raw network traffic of each
server with a granularity of one second
Following the time dependence of the I/O gives
important insights and can be very important for
a detailed understanding and tuning of the
network and parallel filesystem operational
parameters
The existing tools didnt provide such a low
granularity, so we have written our own, reusing
a work made for the LHCb online farm monitoring
(consider that writing/reading one file of 1 GB
from a single client requires just a few seconds)
The tool automatically produces a plot of the
aggregated network traffic of the file servers
for each test in pdf format
The network traffic data files corresponding to
each file server are saved to ascii files in case
one wants to make a detailed per-server analysis

20
GPFS features

Very stable, reliable, fault tolerant, indicated
for storage of critical data and no charge for
educational or research use.
Commercial product, initially developed by IBM
for the SP series and then ported to Linux
Advanced command line interface for configuration
and management
Easy to install, not invasive
Distributed as binaries in RPM packages
No patches to standard kernels are required, just
a few kernel modules for POSIX I/O to be compiled
for the running kernel
Data and metadata striping
Possibility to have data and metadata redundancy
Expensive solution, as it requires the
replication of the whole files, indicated for
storage of critical data
Data recovery for filesystem corruption available
Fault tolerant features oriented to SAN and
internal health monitoring through network
heartbeat

21
Lustre features

Commercial product, easy to install, but fairly
invasive
Distributed as binaries and sources in RPM
packages
Requires own Lustre patches to standard kernels,
but binary distribution of patched kernels are
made available
Aggressive commercial effort, the developers sell
it as an Intergalactic Filesystem scalable to
10000 nodes
Advanced interface for configuration and
management
Possibility to have Metadata redundancy and
Metadata Server fault tolerance
Data recovery for filesystem corruption available
POSIX I/O

22
Sequential Write/Read benchmarks

Sequential write/read from a variable number of
clients simultaneously performing I/O
1 to 30 Gb clients, 1 to 4 processes per client
Sequential write/read of zeroed files by means of
dd
File sizes ranging from 1 MB to 1 GB
After having been written, files are read back
Particular attention to read the whole files from
disk (i.e. no caching at all on the client side
nor on the server side)
Before starting tests, appropriate syncs are
issued to unload the operating system buffers in
order not to have interference between
consecutive tests

23
Hardware testbed

Disk storage
3 IBM FAStT 900 (DS4500)
Each FAStT 900 serves 2 RAID5 arrays, 4 TB each
(17 x 250 GB disks 1 hot spare)
Each RAID5 is further subdivided in two LUNs of 2
TB each
In total 12 LUNs and 24 TB of disk space (102 x
250GB disks 8 hot spares)
File System Servers
6 IBM xseries 346, dual Xeon, 2 GB RAM, Gigabit
NIC
QLogic fiber channel PCI card on each server
connected to the DS4500 via a Brocade switch
6 Gb/s available bandwidth to/from the clients
Clients
30 SuperMicro nodes, dual Xeon, 2 GB RAM, Gigabit
NIC

24
Realistic analysis results (with graphs)

8 TB of data processed in about 7 hours, about
14000 jobs submitted, all completed succesfully
500 analysis jobs in simultaneous RUN state, the
rest in PENDING
3 Gb/s sustained read throughput from the file
servers (with RFIO on top of GPFS)
Write throughput of output data negligible
Just about 1 MB per job

LHCb LSF batch queue occupancy during tests
25
Abstract
Title Storage resources management and access at
TIER1 CNAF Abstract At presents at LCG TIER1
CNAF we have 2 main different mass storage
systems for archiving the HEP experiment data a
HSM software system (CASTOR) and about 200TB of
different storage devices over SAN. This paper
briefly describe our hardware and software
environtment and summarize the simple technical
improvements we have implemented in order to
obtain a better avaliability and the best data
access throughtput from the front-end machines.
Also some test results for different file systems
over SAN are reported.

Write a Comment

User Comments (0)