Storage resources management and access at TIER1 CNAF - PowerPoint PPT Presentation

About This Presentation

Storage resources management and access at TIER1 CNAF


Silkworm 3900. 32 port FC Switch. Infortrend. 5 x 6400 GByte SATA A16F ... 2 Brocade Silkworm 3900. 32 port FC Switch ZONED (50TB Unit with 4 Diskservers) ... – PowerPoint PPT presentation

Number of Views:124
Avg rating:3.0/5.0
Slides: 26
Provided by: wwwzeut


Transcript and Presenter's Notes

Title: Storage resources management and access at TIER1 CNAF

Storage resources management and access at TIER1
Ricci Pier Paolo, Lore Giuseppe, Vagnoni Vincenzo
on behalf of INFN TIER1 Staff pierpaolo.ricci_at_cnaf
  • ACAT 2005
  • May 22-27 2005
  • DESY Zeuthen, Germany

HSM (400 TB)
NAS (20TB)
STK180 with 100 LTO-1 (10Tbyte Native)
NAS1,NAS4 3ware IDE SAS 18003200 Gbyte
Linux SL 3.0 clients (100-1000 nodes)
W2003 Server with LEGATO Networker (Backup)
PROCOM 3600 FC NAS3 4700 Gbyte
CASTOR HSM servers
PROCOM 3600 FC NAS2 9000 Gbyte
STK L5500 robot (5500 slots) 6 IBM LTO-2, 2 (4)
STK 9940B drives
NFS-RFIO-GridFTP oth...
SAN 1 (200TB)
SAN 2 (40TB)
Diskservers with Qlogic FC HBA 2340
Infortrend 4 x 3200 GByte SATA A16F-R1A2-M1
IBM FastT900 (DS 4500) 3/4 x 50000 GByte 4 FC
2 Brocade Silkworm 3900 32 port FC Switch
2 Gadzoox Slingshot 4218 18 port FC Switch
AXUS BROWIE About 2200 GByte 2 FC interface
STK BladeStore About 25000 GByte 4 FC interfaces
Infortrend 5 x 6400 GByte SATA A16F-R1211-M2
Point to Point FC 2Gb/s connections
STK L5500 20003500 mixed slots 6 drives LTO2
(20-30 MB/s) 2 drives 9940B (25-30 MB/s) 1300
LTO2 (200 GB native) 650 9940B (200 GB native)
8 tapeserver Linux RH AS3.0 HBA Qlogic 2300
Sun Blade v100 with 2 internal ide disks with
software raid-0 running ACSLS 7.0 OS Solaris 9.0
1 CASTOR (CERN)Central Services server RH AS3.0
1 ORACLE 9i rel 2 DB server RH AS 3.0
EXPERIMENT Staging area (TB) Tape pool (TB native)
ALICE 8 12
ATLAS 6 20
CMS 2 15
LHCb 18 30
BABAR,AMSoth 2 4
6 stager with diskserver RH AS3.0 15 TB Local
staging area
8 or more rfio diskservers RH AS 3.0 min 20TB
staging area
Indicates Full rendundancy FC 2Gb/s connections
(dual controller HW and Qlogic SANsurfer Path
Failover SW)
  • In general we obtained
  • Good performances when writing into the staging
    area (disk buffer) and from staging area to tapes
    (2 parallel streams on tape give about 40MB/s)
  • General good reliability on the stager service
    (Every LHC experiment has its own dedicated
    stager and policies) and high reliability on the
    central castor services
  • Bad realiability on LTO-2 drives when writing and
    reading. This results in tapes marked readonly or
    disabled when writing and in locking or failure
    when trying to stage-in files in random order.
  • We could trigger with the experiment
    coordination a temporary increase of the staging
    area (disk buffe)r and an optimized sequencial
    stage-in of data just before analysis phase. Then
    the analysis job could run directly over rfio or
    grid tool on castor with an high probability to
    find the file directly on disk (LHCB). After the
    end of the analysis phase the disk buffer could
    be re-assigned to another exp.
  • We decide to acquire and use more STK 9940B
    drives for random access to the data
  • The access to the CASTOR HSM system is
  • Direct using rfltcommandgt direcly on the user
    interfaces or on the WN (rfcp,rfrm or API...)
  • Throught front-end with gridftp interface to
    castor and srm
  • 1

DISK access
GB Eth. connections nfs,rfio,xrootd,GPFS, GRID
Generic Diskserver Supermicro 1U 2 Xeon 3.2 Ghz
4GB Ram,GB eth. 1 or 2 Qlogic 2300 HBA Linux AS
or CERN SL 3.0 OS
1 or 2 2Gb FC connections every Diskserver
LUN0 gt /dev/sda LUN1 gt /dev/sdb ...
2 Brocade Silkworm 3900 32 port FC Switch
ZONED (50TB Unit with 4 Diskservers)
2 x 2GB Interlink connections
FARMS of rack mountable 1U biprocessors nodes
(actually about 1000 nodes for 1300 KspecInt2000)
2Gb FC connections
  • FC Path Failover HA
  • Qlogic SANsurfer
  • IBM or STK Rdac for Linux

2TB Logical Disk LUN0 LUN1 ...
50 TB IBM FastT 900 (DS 4500) Dual redundant
Controllers (A,B) Internal MiniHub (1,2)
  • Application HA
  • NFS server, rfio server with Red Hat Cluster AS
  • GPFS with configuration NSD Primary Secondary
  • /dev/sda Primary Diskserver 1 Secondary
  • /dev/sdb Primary Diskserver 2 Secondary
  • ..... () tested but not actually used in

4 Diskservers every 50TB Unit gt every controller
can perform a maximum of 120MByte/s R-W
DISK access (2)
  • We have different protocols in production for
    accessing the disk storage. In our diskservers
    and Grid SE front-ends we corrently have
  • NFS on local filesystem ADV. Easy client
    implementation and compatibility and possibility
    of failover (RH 3.0). DIS. Bad perfomance
    scalability for an high number of access (1
    client 30MB/s 100 client 15MB/s throughtput)
  • RFIO on local filesystem ADV. Good performance
    and compatibility with Grid Tools and possibility
    of failover. DIS. No scalability of front-ends
    for the single filesystem, no possibility of
  • Grid SE Gridftp/rfio over GPFS (CMS,CDF) ADV
    Separation from GPFS servers (accessing the
    disks) and SE GPFS clients. Load balancing and HA
    on the GPFS servers and possibility to implement
    the same on the Grid SE services (see next
    slide). DIS. GPFS layer requirements on OS and
    Certified Hardware for support.
  • Xrootd (BABAR) ADV Good performance DIS No
    possibility of load-balancing for the single
    filesystem backends, not grid compliant (at
    so is possible from many front-ends (i.e. gridftp
    or rfio server) to access simultaneously the SAME
    filesystem. Also can use bigger filesystem size
    (we use 8-12TB).
  • 1

CASTOR Grid Storage Element
  • GridFTP access through the castorgrid SE, a dns
    cname pointing to 3 server.
  • Dns round-robin for load balancing
  • During LCG Service Challenge2 introduced also a
    load average selection every M minutes the ip of
    the most loaded server is replaced in the cname
    (see graph)

Monitoring/notifications (Nagios)
LHCb CASTOR tape pool
processes on a CMS disk SE
eth0 traffic through a CASTOR LCG SE
Disk accounting
Pure disk space (TB)
CASTOR disk space (TB)
Parallel Filesystem Test
  • Test Goal Evaluation and Comparison of Parallel
    Filesystems (GPFS, Lustre) for the implementation
    of a powerful disk I/O infrastructure for the
  • A moderately high-end testbed has been used
  • 6 IBM xseries 346 file servers connected via FC
    SAN to 3 IBM FAStT 900 (DS4500) providing a total
    of 24TB
  • Maximum available throughput to client nodes (30)
    using Gb Ethernet 6 Gbps
  • PHASE 1 Generic test and tuning
  • PHASE 2 Realistic physics analysis jobs reading
    data from a Parallel Filesystem
  • Dedicated tools for test (PHASE 1) and monitoring
    have been written
  • The Benchmarking tool allows the user to start,
    stop and monitor the test on all the clients from
    a single point
  • Completely automatized

PHASE 1 Generic Benchmark
  • GPFS Very stable, reliable, fault tolerant,
    indicated for storage of critical data and no
    charge for educational or research use
  • Lustre Commercial product, easy to install, but
    fairly invasive (need patched kernel) and has a
    node license cost
  • PHASE1 Generic Benchmark
  • Sequential write/read from a variable number of
    clients simultaneously performing I/O with 3
    different protocols (native GPFS, rfio over GPFS,
    nfs over GPFS).
  • 1 to 30 Gb clients, 1 to 4 processes per client
  • Sequential write/read of zeroed files by means of
  • File sizes ranging from 1 MB to 1 GB

Generic Benchmark
Raw ethernet throughput vs time(20 x 1GB file
simultaneous reads with Lustre)
Results of read/write(1GB different files)
Generic Benchmark(here shown for 1 GB files)
of simultaneous client processes of simultaneous client processes 1 5 10 50 120 1 5 10 50 120
GPFS 2.3.0-1 native 114 160 151 147 147 85 301 301 305 305
GPFS 2.3.0-1 NFS 102 171 171 159 158 114 320 366 322 292
GPFS 2.3.0-1 RFIO 79 171 158 166 166 79 320 301 320 321
Lustre 1.4.1 native 102 512 512 488 478 73 366 640 453 403
Lustre 1.4.1 RFIO 93 301 320 284 281 68 269 269 314 349
  • Numbers are reproducible with small fluctuations
  • Lustre tests with NFS export not yet performed

PHASE 2 Realistic analysis
  • We focus on the Analysis Jobs since they are
    generally the most I/O bound processes of the
    experiment activity.
  • Realistic LHCb analysis algorithm runs on 8 TB of
    data served by RFIO daemons running on GPFS
    parallel filesystem servers
  • The analysis algorithm performs a selection of an
    LHCb physics channel by reading sequentially
    input DST (Data Summary Tape) files and producing
    ntuple files in output
  • Analysis jobs submitted to the production LSF
    batch system of TIER1 INFN (RFIO was the simplest
    and most effective choice)
  • 14000 jobs submitted, 500 jobs in simultaneous
    RUN state
  • Steps of the jobs
  • RFIO-copy to the local WN disk the file to be
  • Analize the data
  • RFIO-copy back the output of the algorithm
  • Cleanup files from the local disk

Realistic analysis results
  • 8.1 TB of data processed in 7 hours, all 14000
    jobs completed succesfully
  • gt 3 Gb/s raw sustained read throughput from the
    file servers with GPFS (about effective
  • Write throughput of output data negligible
  • Just 1 MB per job
  • The results are very satisfactory and give us a
    good impression of the whole infrastructure
  • Test for the Lustre configuration are in
    progress. (we dont expect big difference using
    rfio protocol over parallel filesystem)

  • In these slides we presented
  • A general overview of the Italian TIER1 INFN CNAF
    storage resources hardware and access methods
  • HSM Software (CERN CASTOR) for Tape Library Mass
  • Disk over SAN with different software protocols
  • Some simple management implementations for
    monitoring and optimizing our storage resources
  • Results from Clustered Parallel Filesystem
    (Lustre/GPFS) performance measurements
  • Step 1 Generic Filesystem Benchmark
  • Step 2 Realistic LHC analysis jobs results
  • Thank you to everybody for your attention

Benchmarking tools
  • Dedicated tools for benchmarking and monitoring
    have been written
  • The benchmarking tools allow the user to start,
    stop and monitor the evolution of simultaneous
    read/write operations from an arbitrary number of
    clients, reporting at the end of the test the
    aggregated throughput
  • Realized as a set of bash scripts and C programs
  • The tool implements network bandwith measurements
    by means of the netperf suite and sequential
    read/write with dd
  • Thought to be of general use, can be reused with
    minimal effort for any kind of storage benchmark
  • Completely automatized
  • The user does not need to install anything on the
    target nodes as all the software is copied by the
    tool via ssh (and also removed in the end)
  • The user has only to issue a few commands from
    the shell prompt to control everything
  • Can perform complex unattended and personalized
    tests by means of very simple scripts, collect
    and save all the results and produce plots

Monitoring tools
  • The monitoring tools allow to measure the time
    dependence of the raw network traffic of each
    server with a granularity of one second
  • Following the time dependence of the I/O gives
    important insights and can be very important for
    a detailed understanding and tuning of the
    network and parallel filesystem operational
  • The existing tools didnt provide such a low
    granularity, so we have written our own, reusing
    a work made for the LHCb online farm monitoring
    (consider that writing/reading one file of 1 GB
    from a single client requires just a few seconds)
  • The tool automatically produces a plot of the
    aggregated network traffic of the file servers
    for each test in pdf format
  • The network traffic data files corresponding to
    each file server are saved to ascii files in case
    one wants to make a detailed per-server analysis

GPFS features
  • Very stable, reliable, fault tolerant, indicated
    for storage of critical data and no charge for
    educational or research use.
  • Commercial product, initially developed by IBM
    for the SP series and then ported to Linux
  • Advanced command line interface for configuration
    and management
  • Easy to install, not invasive
  • Distributed as binaries in RPM packages
  • No patches to standard kernels are required, just
    a few kernel modules for POSIX I/O to be compiled
    for the running kernel
  • Data and metadata striping
  • Possibility to have data and metadata redundancy
  • Expensive solution, as it requires the
    replication of the whole files, indicated for
    storage of critical data
  • Data recovery for filesystem corruption available
  • Fault tolerant features oriented to SAN and
    internal health monitoring through network

Lustre features
  • Commercial product, easy to install, but fairly
  • Distributed as binaries and sources in RPM
  • Requires own Lustre patches to standard kernels,
    but binary distribution of patched kernels are
    made available
  • Aggressive commercial effort, the developers sell
    it as an Intergalactic Filesystem scalable to
    10000 nodes
  • Advanced interface for configuration and
  • Possibility to have Metadata redundancy and
    Metadata Server fault tolerance
  • Data recovery for filesystem corruption available

Sequential Write/Read benchmarks
  • Sequential write/read from a variable number of
    clients simultaneously performing I/O
  • 1 to 30 Gb clients, 1 to 4 processes per client
  • Sequential write/read of zeroed files by means of
  • File sizes ranging from 1 MB to 1 GB
  • After having been written, files are read back
  • Particular attention to read the whole files from
    disk (i.e. no caching at all on the client side
    nor on the server side)
  • Before starting tests, appropriate syncs are
    issued to unload the operating system buffers in
    order not to have interference between
    consecutive tests

Hardware testbed
  • Disk storage
  • 3 IBM FAStT 900 (DS4500)
  • Each FAStT 900 serves 2 RAID5 arrays, 4 TB each
    (17 x 250 GB disks 1 hot spare)
  • Each RAID5 is further subdivided in two LUNs of 2
    TB each
  • In total 12 LUNs and 24 TB of disk space (102 x
    250GB disks 8 hot spares)
  • File System Servers
  • 6 IBM xseries 346, dual Xeon, 2 GB RAM, Gigabit
  • QLogic fiber channel PCI card on each server
    connected to the DS4500 via a Brocade switch
  • 6 Gb/s available bandwidth to/from the clients
  • Clients
  • 30 SuperMicro nodes, dual Xeon, 2 GB RAM, Gigabit

Realistic analysis results (with graphs)
  • 8 TB of data processed in about 7 hours, about
    14000 jobs submitted, all completed succesfully
  • 500 analysis jobs in simultaneous RUN state, the
    rest in PENDING
  • 3 Gb/s sustained read throughput from the file
    servers (with RFIO on top of GPFS)
  • Write throughput of output data negligible
  • Just about 1 MB per job

LHCb LSF batch queue occupancy during tests
Title Storage resources management and access at
TIER1 CNAF Abstract At presents at LCG TIER1
CNAF we have 2 main different mass storage
systems for archiving the HEP experiment data a
HSM software system (CASTOR) and about 200TB of
different storage devices over SAN. This paper
briefly describe our hardware and software
environtment and summarize the simple technical
improvements we have implemented in order to
obtain a better avaliability and the best data
access throughtput from the front-end machines.
Also some test results for different file systems
over SAN are reported.
Write a Comment
User Comments (0)