Hepix 2005 Trip Reports - PowerPoint PPT Presentation

1 / 60
About This Presentation
Title:

Hepix 2005 Trip Reports

Description:

... on Use of IBRIX Fusion FS. CMS Decided to Implement IBRIX solution Why? ... IBRIX file systems via NFS in limited numbers but has been stable Refining admin ... – PowerPoint PPT presentation

Number of Views:56
Avg rating:3.0/5.0
Slides: 61
Provided by: brucegg
Category:
Tags: hepix | ibrix | reports | trip

less

Transcript and Presenter's Notes

Title: Hepix 2005 Trip Reports


1
Hepix 2005 Trip Reports
  • Dantong Yu
  • June/27/2005

2
Highlights
  • Hepix was asked and agreed to be technical
    advisors to IHEPCCC on specific questions where
    it has expertise. Examples were given, the status
    of Linux in HEP and the idea of a virtual
    organisation for HEP physicists.
  • The most recent successful HEP collaboration on
    distribution and widespread acceptance of
    Scientific Linux. Discussions focused on versions
    would need to be supported for LHC in the next
    two years.
  • LEMON, NGOP and SLAC Nagios Monitoring.
  • Service Challenge Preparation.
  • Batch Queue Systems.
  • DoE budget cuts affected SLAC, FNAL, and BNL.

3
Disk, Tape, Storage and File Systems
4
Fermilab Mass Storage
  • ENSTORE, dCache and SRM for CDF, D0 and CMS
  • Name Space PNFS.
  • Provides a hierarchical namespace for users
    files in Enstore.
  • Manages file metadata.
  • Looks like an NFS mounted file system from user
    nodes.
  • Stands for Perfectly Normal File System.
  • Written at DESY.
  • ENSTORE Hardware 6 Silos, Tape Drives 9 LTO1,
    14 LTO2, 9940 20, 9940B 52, DLT (4000 8000)
    8 and 127 commodity PCs.
  • 2.6 Petabyte data, 10.8 million files, 25,000
    volumes, and Record rate 27 Terabyte/day.

5
dCache
  • dCache is deployed on top of ENSTORE or stand
    alone
  • Improve performance by using file in caches
    instead of reading from tape each time when
    needed.
  • 100 pool nodes with 225 Terabytes of disk
  • Lessons Learned
  • Use the XFS filesystem on the pool disks.
  • Use direct I/O when accessing the files on the
  • Local dCache disk.
  • Users will push the system to its limits. Be
    prepared.

6
SRM
  • Provides uniform interface for access to multiple
    storage systems via SRM protocol.
  • SRM is a broker that works on top of other
    storage systems.
  • dCache
  • UNIXTM filesystem
  • Enstore In development

7
Caspur file systems Lustre, AFS, Panasas
8
Components
  • High-end Linux units for both servers and
    clients
  • Servers 2-way Intel Nocona 3.4 GHz, 2GB
    RAM, 2 QLA2310 2Gbit HBA
  • Clients 2-way Intel Xeon 2.4 GHz, 1GB RAM
  • OS SuSE SLES 9 on servers, SLES 9 /
    RHEL 3 on clients
  • - Network nonblocking GigE switches
  • CISCO - 3570G-24TS (24 ports)
  • Extreme Networks - Summit 400-48t (48 ports)
  • SAN
  • Qlogic Sanbox 5200 32 ports
  • - Appliances
  • 3 Panasas ActiveScale Shelves

9
Panasas Storage Cluster Components
Integrated GE Switch
Battery Module (2 Power units)
Shelf Front 1 DB, 10 SB
Shelf Rear
DirectorBlade
StorageBlade
Midplane routes GE, power
10
Test setup (NFS, AFS, Lustre)
Load Farm (16 biprocessor nodes at 2.4 GHz)
Gigabit Ethernet CISCO/Extreme
MDS(Lustre)
SAN QLogic 5200
On each server, 2 Gigabit Ethernet NICs were
bonded (bonding-ALB). LD1, LD2 could be IFT or
DDN. Each LD was zoned to a distinct HBA.
11
What we measured
1) Massive aggregate I/O (large files,
lmdd) - All 16 clients were
unleashed together, file sizes varied in the
range 5-10GB - Gives a good idea about
the systems overall throughput 2) Pileup.
This special benchmark was developed at CERN by
R.Többicke. - Emulation of an
important use case foreseen in one of the LHC
experiments - Several (64-128) 2GB files
are first prepared on the file system under test
- The files are then read by a growing
number of reader threads (ramp-up) -
Every thread selects randomly one file out of the
list - In a single read act, an
arbitrary offset within file is calculated,
and 50-60 KB are read starting with this
offset - Output is the number of
operations times bytes read per time interval
- Pileup results are important for future
service planning
12
A typical Pileup curve
13
3) Emulation of a DAQ Data Buffer -
A very common scenario in HEP DAQ architecture
- Data is constantly arriving from
the detector and has to end up on the
tertiary strorage (tapes) - A
temporary storage area on the way of the data to
tapes serves for reorganization of
streams, preliminary real-time analysis and
as a security buffer to hold against the
interrupts of the archival system
- Of big interest for service planning general
throughput of a balanced Data Buffer.
- A DAQ Manager may moderate the data
influx (for instance, by tuning certain
trigger rates), thus balancing it with the
outflux. - We were running 8 writers and
8 readers, one process per client. Each file was
accessed at any given moment by one and
only one process. On writer nodes we
could moderate the writer speed by adding some
dummy CPU eaters.
14
DAQ Data Buffer
15
Results for 8 GigE outlets
2 remarks - Each of the storage nodes had 2
GigE NICs. We have tried to add a third NIC to
see if we could get more out of the node. There
was a modest improvement of less than 10
percent so we decided to use 8 NICs on 4 nodes
per run. - Panasas shelf had 4 NICs, and we
report here its results multiplied by 2, to
be able to compare it with all other 8-NIC
configurations.
16
Conclusions
1) With 8 GigE NICs in the system, one would
expect a throughput in excess of 800 MB/sec
for large streaming I/O. Lustre and Panasas can
clearly deliver this, NFS is also doing
quite well. The very fact that we were
operating around 800 MB/sec with this hardware
means that our storage nodes were well-balanced
(no bottlenecks, we even might have had a
reserve of 100 MB/sec per setup). 2) Pileup
results were relatively good for AFS, and best in
case of Panasas. The outcome of this
benchmark is correlated with the number of
spindles in the system. Two Panasas shelves
had 40 spindles, while 4 storage nodes used
64 spindles. So Panasas file system was doing a
much better job per spindle than any other
solution that we tested (NFS, AFS, Lustre).

17
USCMS Tier 1 Update on Use of IBRIX Fusion FS
  • CMS Decided to Implement IBRIX solution Why?
  • No specialized hardware required
  • Enabling us to redeploy hardware if this
    solution did not work
  • Made cost of product less than others with
    hardware components
  • Provided NFS access
  • A universal protocol which required no client
    side code
  • Initial decision to only use NFS access because
    of this IBRIX very responsive to our requests and
    to issues we found during theevaluation
  • Purely software solution no specialized hardware
    dependencies
  • Comprised of
  • Highly scalable POSIX compliant parallel file
    system
  • Logical volume manager based on LVM
  • High availability
  • Comprehensive management interface which includes
    a gui

18
Current Status
  • Thin client is working very well and system is
    stable
  • User and group quotas are working as part of our
    managed disk plan, A few requests for enhancement
    to quota subsystem to improve quota management
  • Working on getting users to migrate off of ALL
    nfs mounted work area or data disk
  • IBRIX file systems via NFS in limited numbers but
    has been stable Refining admin documentation and
    knowledge
  • Will add in 2 more segments servers and another
    2.7TB of disk
  • Plan for 20TB by start of data taking
  • Thin client rpm's are kernel version dependent
    IBRIX committed to providing rpm updates for
    security releases in a timely fashion
  • No Performance data is available because the
    current focus is on functionality.

19
SATA Evaluation in FNAL
  • SATA is found in commodity or mid-level storage
    configurations (as opposed to enterprise-level)
    and cannot be expected to give the same
    performance as more expensive architectures. SATA
    controllers can be FC or PCI/PCI-X.
  • a SATA configuration imperfect firmware,
    misleading claims by vendors, untested
    configurations, upgrade is disruptive you get
    what you pay for.
  • A number of suggestions were
  • careful selection of vendor
  • firmware upgrades
  • consider paying more if you can be sure of
    reduced ongoing maintenance costs (including
    human costs)
  • Understand properly your needs and estimate the
    cost and effect of disk losses.
  • Data Loss is acceptable or not

20
Experiences Deploying Xrootd at RAL
21
What is xrootd?
  • xrootd (eXtended Root Daemon) was written at SLAC
    and INFN Padova as part of the work to migrate
    the BaBar event store from Objectivity to Root
    I/O
  • Its a fully functional suite of tools for
    serving data, including server daemons and
    clients which take to each other using the xroot
    protocol

22
xrootd Architecture
application
Protocol Manager
xrd
xrootd
Protocol Layer
xroot
Authentication
Filesystem Logical Layer
ofs
Authorization
optional
Filesystem Physical Layer
oss
(included in distribution)
Filesystem Implementation
mss
_fs
23
Load Balanced Example with MSS
?
?
?
?
?
24
Benefits
  • For Users
  • Jobs dont crash if a disk/server goes down, they
    back off, contact the olb manager and get the
    data from somewhere else
  • Queues arent stopped just because 2 of the data
    is offline
  • For Admins
  • No need for heroic efforts to recover damaged
    filesystems
  • Much easier to schedule maintenance

25
Conclusion
  • Xrootd has proved to be easy to configure and
    link to our MSS. Initial indications are that the
    production service is both reliable and
    performant
  • This should improve both the lives of the users
    and sysadmins with huge advances in both the
    robustness of the system and its maintainability
    without sacrificing performance
  • Talks, software (binaries and source),
    documentation and example configurations are
    available at http//xrootd.slac.stanford.edu

26
Grid Middleware and Service Challenge3
27
BNL Service Challenge3 (Skipped)
28
ldF Background
  • Need to start now setup of Tier2 to be ready on
    time
  • LCG (France) effort concentrated on Tier1 till
    now
  • Technical and financial challenges require time
    to be solved
  • French HEP institutes are quite small
  • 100-150 persons, small computing manpower
  • IdF (Ile-de-France, Paris region) has a large
    concentration of big HEP labs and physicists
  • 6 labs among which DAPNIA 600, LAL 350
  • DAPNIA and LAL involved in Grid effort since
    beginning of EDG
  • 3 EGEE contracts (2 for operation support)

29
Objectives
  • Build a Tier2 facility for simulation and
    analysis
  • 80 LHC 4 experiments, 20 EGEE and local
  • 2/3 LHC analysis
  • Analysis require a large amount of storage
  • Be ready at LHC startup (2nd half of 2007)
  • Resource goals
  • CPU 1500 1kSI2K ( P4 Xeon 2,8 Ghz), max
    CMS800
  • Storage 350 TB of disks (no MSS planned), max
    CMS220
  • Network 10 Gb/s backbone inside Tier2, 1 or 10
    Gb/s external link
  • Need 1.6 M Euros

30
Storage Challenge
  • Efficient use and management of a large amount of
    storage seen as the main challenge
  • Plan to participate SC3.
  • 2006 Mini Tier2, 2007 Production Tier2

31
SC2 Summary
  • SC2 met its throughput goals (achieved gt600MB/s
    daily average for 10 days) and with more sites
    than originally planned!
  • A big improvement from SC1
  • But we still dont have something we can call a
    service
  • Monitoring is better
  • We see outages when they happen, and we
    understood why they happen
  • First step towards operations guides
  • Some advances in infrastructure and software will
    happen before SC3
  • gLite transfer software
  • SRM service more widely deployed
  • We have to understand how to incorporate these
    elements

32
Service Challenge 3 - Phases
  • High level view
  • Setup phase (includes Throughput Test)
  • 2 weeks sustained in July 2005
  • Obvious target GDB of July 20th
  • Primary goals
  • 150MB/s disk disk to Tier1s
  • 60MB/s disk (T0) tape (T1s)
  • Secondary goals
  • Include a few named T2 sites (T2 -gt T1 transfers)
  • Encourage remaining T1s to start disk disk
    transfers
  • Service phase
  • September end 2005
  • Start with ALICE CMS, add ATLAS and LHCb
    October/November
  • All offline use cases except for analysis
  • More components WMS, VOMS, catalogs,
    experiment-specific solutions
  • Implies production setup (CE, SE, )

33
SC3 Milestone Decomposition
  • File transfer goals
  • Build up disk disk transfer speeds to 150MB/s
    with 1GB/s out of CERN
  • SC2 was 100MB/s agreed by site
  • Include tape transfer speeds of 60MB/s with
    300MB/s out of CERN
  • Tier1 goals
  • Bring in additional Tier1 sites wrt SC2 (at least
    wrt the original plan)
  • PIC and Nordic most likely added later SC4?
  • Tier2 goals
  • Start to bring Tier2 sites into challenge
  • Agree services T2s offer / require
  • On-going plan (more later) to address????ileston
    e Decomposition???s
  • Build up disk disk transfer speeds to 150MB/s
    with 1GB/s out of CERN
  • SC2 was 100MB/s agreed by site
  • Include tape transfer speeds of 60MB/s with
    300MB/s out dd additional components
  • Catalogs, VOs, experiment-specific solutions etc,
    3D involvement,
  • Choice of software components, validation,
    fallback,

34
LCG Service ChallengesPlanning for Tier2 Sites
Update for HEPiX meeting
  • Jamie Shiers
  • IT-GD, CERN

35
Executive Summary
  • Tier2 issues have been discussed extensively
    since early this year
  • The role of Tier2s, the services they offer and
    require has been clarified
  • The data rates for MC data are expected to be
    rather low (limited by available CPU resources)
  • The data rates for analysis data depend heavily
    on analysis model (and feasibility of producing
    new analysis datasets IMHO)
  • LCG needs to provide
  • Installation guide / tutorials for DPM, FTS, LFC
  • Tier1s need to assist Tier2s in establishing
    services

36
Tier2 and Base S/W Components
  • Disk Pool Manager (of some flavour)
  • e.g. dCache, DPM,
  • gLite FTS client (and T1 services)
  • Possibly also local catalog, e.g. LFC, FiReMan,
  • Experiment-specific s/w and services ( agents )

37
Tier2s and SC3
  • Initial goal is for a small number of Tier2-Tier1
    partnerships to setup agreed services and gain
    experience
  • This will be input to a wider deployment model
  • Need to test transfers in both directions
  • MC upload
  • Analysis data download
  • Focus is on service rather than throughput
    tests
  • As initial goal, would propose running transfers
    over at least several days
  • e.g. using 1GB files, show sustained rates of 3
    files / hour T2-gtT1
  • More concrete goals for the Service Phase will be
    defined together with experiments in the coming
    weeks
  • Definitely no later than June 13-15 workshop

38
T2s Concrete Target
  • We need a small number of well identified T2/T1
    partners for SC3 as listed above
  • Do not strongly couple T2-T1 transfers to T0-T1
    throughput goals of SC3 setup phase
  • Nevertheless, target one week of reliable
    transfers T2-gtT1 involving at least two T1 sites
    each with at least two T2s by end July 2005

39
The LCG File Catalog (LFC)
  • Jean-Philippe Baud Sophie Lemaitre
  • IT-GD, CERN
  • May 2005

40
LCG File Catalog
  • Based on lessons learned in DCs (2004)
  • Fixes performance and scalability problems seen
    in EDG Catalogs
  • Cursors for large queries
  • Timeouts and retries from the client
  • Provides more features than the EDG Catalogs
  • User exposed transaction API
  • Hierarchical namespace and namespace operations
  • Integrated GSI Authentication Authorization
  • Access Control Lists (Unix Permissions and POSIX
    ACLs)
  • Checksums
  • Based on existing code base
  • Supports Oracle and MySQL database backends

41
Relationships in the Catalog
42
Features
  • Namespace operations
  • All names are in a hierarchical namespace
  • mkdir(), opendir(), etc
  • Also chdir()
  • GUID attached to every directory and file
  • Security GSI Authentication and Authorization
  • Mapping done from Client DN to uid/gid pair
  • Authorization done in terms of uid/gid
  • VOMS will be integrated (collaboration with
    INFN/NIKHEF)
  • VOMS roles appear as a list of gids
  • Ownership of files is stored in catalog
  • Permissions implemented
  • Unix (user, group, all) permissions
  • POSIX ACLs (group and users)

43
LFC Tests
  • LFC has been tested and shown to be scalable to
    at least
  • 40 million entries
  • 100 client threads
  • Performance improved with comparison to RLSs
  • Stable
  • Continuous running at high load for extended
    periods of time with no crashes
  • Based on code which has been in production for gt
    4 years
  • Tuning required to improve bulk performance

44
FiReMan Performance - Insert
  • Comparison with LFC

350
Fireman - Single Entry
Fireman - Bulk 100
300
LFC
250
200
Inserts / Second
150
100
50
0
1
2
5
10
20
50
100
Number of Threads
45
FiReMan Performance - Queries
  • Comparison with LFC

1200
Fireman - Single Entry
Fireman - Bulk 100
LFC
1000
800
Entries Returned / Second
600
400
200
0
1
2
5
10
20
50
100
Number Of Threads
46
Tests Conclusion
  • Both LFC and FiReMan offer large improvements
    over RLS
  • Still some issues remaining
  • Scalability of FiReMan
  • Bulk Operations for LFC
  • More work needed to understand performance and
    bottlenecks
  • Need to test some real Use Cases

47
File Transfer Software and Service SC3
  • Gavin McCance
  • LHC service challenge

48
FTS service
  • It provides point to point movement of SURLs
  • Aims to provide reliable file transfer between
    sites, and thats it!
  • Allows sites to control their resource usage
  • Does not do routing (e.g. like Phedex)
  • Does not deal with GUID, LFN, Dataset,
    Collections
  • Its a fairly simple service that provides sites
    with a reliable and manageable way of serving
    file movement requests from their VOs
  • We are understanding together with the
    experiments the places in the software where
    extra functionality can be plugged in
  • How the VO software frameworks can load the
    system with work
  • Places where VO specific operations (such as
    cataloguing), can be plugged-in, if required

49
Single channel
50
Multiple channels
Single set of servers canmanage multiple
channelsfrom a site
51
What you need to run the server
  • An Oracle database to hold the state
  • MySQL is on-the-list but low-priority unless
    someone screams
  • A transfer server to run the transfer agents
  • Agents responsible for assigning jobs to channels
    managed by that site
  • Agents responsible for actually running the
    transfer (or for delegating the transfer to
    srm-cp).
  • An application server (tested with Tomcat5)
  • To run the submission and monitoring portal
    i.e. the thing you use to talk to the system

52
Initial use models considered
  • Tier-0 to Tier-1 distribution
  • Proposal put server at Tier-0
  • This was the model used in SC2
  • Tier-1 to Tier-2 distribution
  • Proposal put server at Tier-1 push
  • This is analogous to the SC2 model
  • Tier-2 to Tier-1 upload
  • Proposal put server at Tier-1 pull

53
Summary
  • Propose server at Tier-0 and Tier-1
  • Oracle DB, Tomcat application server, transfer
    node
  • Propose client tools at T0, T1 and T2
  • This is a UI / WN type install
  • Evaluation setup
  • Initially at CERN T0, interacting with T1 a la
    SC2
  • Expand to few agreed T1s interacting with agreed
    T2s

54
Disk Pool Manager aims
  • Provide a solution for Tier-2s in LCG-2
  • This implies a few tens of Terabytes in 2005
  • Focus on manageability
  • Easy to install
  • Easy to configure
  • Low effort for ongoing maintenance
  • Easy to add/remove resources
  • Support for multiple physical partitions
  • On one or more disk server nodes
  • Support for different space types volatile and
    permanent
  • Support for multiple replicas of hot files within
    the disk pools

55
Manageability
  • Few daemons to install
  • No central configuration files
  • Disk nodes request to add themselves to the DPM
  • All states are kept in a DB (easy to restore
    after a crash)
  • Easy to remove disks and partitions
  • Allows simple reconfiguration of the Disk Pools
  • Administrator can temporarily remove file systems
    from the DPM if a disk has crashed and is being
    repaired
  • DPM automatically configures a file system as
    unavailable when it is not contactable

56
Features
  • DPM access via different interfaces
  • Direct Socket interface
  • SRM v1
  • SRM v2 Basic
  • Also offer a large part of SRM v2 Advanced
  • Global Space Reservation (next version)
  • Namespace operations
  • Permissions
  • Copy and Remote Get/Put (next version)
  • Data Access
  • Gridftp, rfio (ROOTD, XROOTD could be easily
    added)

57
Security
  • GSI Authentication and Authorization
  • Mapping done from Client DN to uid/gid pair
  • Authorization done in terms of uid/gid
  • Ownership of files is stored in DPM catalog,
    while the physical files on disk are owned by the
    DPM
  • Permissions implemented on files and directories
  • Unix (user, group, other) permissions
  • POSIX ACLs (group and users)
  • Propose to use SRM as interface to set the
    permissions in the Storage Elements (require v2.1
    minimum with Directory and Permission methods)
  • VOMS will be integrated
  • VOMS roles appear as a list of gids

58
Architecture
  • The Light Weight Disk Pool Manager consists of
  • The Disk Pool Manager with its configuration and
    request DB
  • The Disk Pool Name Server
  • The SRM servers
  • The RFIOD and DPM-aware GsiFTP servers
  • How many machines ?
  • DPM, DPNS, and SRM can be installed on the same
    one
  • RFIOD on each disk server managed by the DPM
  • GsiFTP on each disk server managed by the DPM

59
Status
  • DPM will be part of LCG 2.5.0 release but is
    available from now on for testing
  • Satisfies gLite requirement for SRM interface at
    Tier-2

60
Thank You
Write a Comment
User Comments (0)
About PowerShow.com