Title: Hepix 2005 Trip Reports
1Hepix 2005 Trip Reports
2Highlights
- Hepix was asked and agreed to be technical
advisors to IHEPCCC on specific questions where
it has expertise. Examples were given, the status
of Linux in HEP and the idea of a virtual
organisation for HEP physicists. - The most recent successful HEP collaboration on
distribution and widespread acceptance of
Scientific Linux. Discussions focused on versions
would need to be supported for LHC in the next
two years. - LEMON, NGOP and SLAC Nagios Monitoring.
- Service Challenge Preparation.
- Batch Queue Systems.
- DoE budget cuts affected SLAC, FNAL, and BNL.
3Disk, Tape, Storage and File Systems
4Fermilab Mass Storage
- ENSTORE, dCache and SRM for CDF, D0 and CMS
- Name Space PNFS.
- Provides a hierarchical namespace for users
files in Enstore. - Manages file metadata.
- Looks like an NFS mounted file system from user
nodes. - Stands for Perfectly Normal File System.
- Written at DESY.
- ENSTORE Hardware 6 Silos, Tape Drives 9 LTO1,
14 LTO2, 9940 20, 9940B 52, DLT (4000 8000)
8 and 127 commodity PCs. - 2.6 Petabyte data, 10.8 million files, 25,000
volumes, and Record rate 27 Terabyte/day.
5dCache
- dCache is deployed on top of ENSTORE or stand
alone - Improve performance by using file in caches
instead of reading from tape each time when
needed. - 100 pool nodes with 225 Terabytes of disk
- Lessons Learned
- Use the XFS filesystem on the pool disks.
- Use direct I/O when accessing the files on the
- Local dCache disk.
- Users will push the system to its limits. Be
prepared.
6SRM
- Provides uniform interface for access to multiple
storage systems via SRM protocol. - SRM is a broker that works on top of other
storage systems. - dCache
- UNIXTM filesystem
- Enstore In development
7Caspur file systems Lustre, AFS, Panasas
8Components
-
- High-end Linux units for both servers and
clients -
- Servers 2-way Intel Nocona 3.4 GHz, 2GB
RAM, 2 QLA2310 2Gbit HBA -
- Clients 2-way Intel Xeon 2.4 GHz, 1GB RAM
- OS SuSE SLES 9 on servers, SLES 9 /
RHEL 3 on clients -
- - Network nonblocking GigE switches
-
- CISCO - 3570G-24TS (24 ports)
- Extreme Networks - Summit 400-48t (48 ports)
-
- SAN
-
- Qlogic Sanbox 5200 32 ports
-
- - Appliances
- 3 Panasas ActiveScale Shelves
9Panasas Storage Cluster Components
Integrated GE Switch
Battery Module (2 Power units)
Shelf Front 1 DB, 10 SB
Shelf Rear
DirectorBlade
StorageBlade
Midplane routes GE, power
10Test setup (NFS, AFS, Lustre)
Load Farm (16 biprocessor nodes at 2.4 GHz)
Gigabit Ethernet CISCO/Extreme
MDS(Lustre)
SAN QLogic 5200
On each server, 2 Gigabit Ethernet NICs were
bonded (bonding-ALB). LD1, LD2 could be IFT or
DDN. Each LD was zoned to a distinct HBA.
11What we measured
1) Massive aggregate I/O (large files,
lmdd) - All 16 clients were
unleashed together, file sizes varied in the
range 5-10GB - Gives a good idea about
the systems overall throughput 2) Pileup.
This special benchmark was developed at CERN by
R.Többicke. - Emulation of an
important use case foreseen in one of the LHC
experiments - Several (64-128) 2GB files
are first prepared on the file system under test
- The files are then read by a growing
number of reader threads (ramp-up) -
Every thread selects randomly one file out of the
list - In a single read act, an
arbitrary offset within file is calculated,
and 50-60 KB are read starting with this
offset - Output is the number of
operations times bytes read per time interval
- Pileup results are important for future
service planning
12A typical Pileup curve
13 3) Emulation of a DAQ Data Buffer -
A very common scenario in HEP DAQ architecture
- Data is constantly arriving from
the detector and has to end up on the
tertiary strorage (tapes) - A
temporary storage area on the way of the data to
tapes serves for reorganization of
streams, preliminary real-time analysis and
as a security buffer to hold against the
interrupts of the archival system
- Of big interest for service planning general
throughput of a balanced Data Buffer.
- A DAQ Manager may moderate the data
influx (for instance, by tuning certain
trigger rates), thus balancing it with the
outflux. - We were running 8 writers and
8 readers, one process per client. Each file was
accessed at any given moment by one and
only one process. On writer nodes we
could moderate the writer speed by adding some
dummy CPU eaters.
14DAQ Data Buffer
15Results for 8 GigE outlets
2 remarks - Each of the storage nodes had 2
GigE NICs. We have tried to add a third NIC to
see if we could get more out of the node. There
was a modest improvement of less than 10
percent so we decided to use 8 NICs on 4 nodes
per run. - Panasas shelf had 4 NICs, and we
report here its results multiplied by 2, to
be able to compare it with all other 8-NIC
configurations.
16Conclusions
1) With 8 GigE NICs in the system, one would
expect a throughput in excess of 800 MB/sec
for large streaming I/O. Lustre and Panasas can
clearly deliver this, NFS is also doing
quite well. The very fact that we were
operating around 800 MB/sec with this hardware
means that our storage nodes were well-balanced
(no bottlenecks, we even might have had a
reserve of 100 MB/sec per setup). 2) Pileup
results were relatively good for AFS, and best in
case of Panasas. The outcome of this
benchmark is correlated with the number of
spindles in the system. Two Panasas shelves
had 40 spindles, while 4 storage nodes used
64 spindles. So Panasas file system was doing a
much better job per spindle than any other
solution that we tested (NFS, AFS, Lustre).
17USCMS Tier 1 Update on Use of IBRIX Fusion FS
- CMS Decided to Implement IBRIX solution Why?
- No specialized hardware required
- Enabling us to redeploy hardware if this
solution did not work - Made cost of product less than others with
hardware components - Provided NFS access
- A universal protocol which required no client
side code - Initial decision to only use NFS access because
of this IBRIX very responsive to our requests and
to issues we found during theevaluation - Purely software solution no specialized hardware
dependencies - Comprised of
- Highly scalable POSIX compliant parallel file
system - Logical volume manager based on LVM
- High availability
- Comprehensive management interface which includes
a gui
18Current Status
- Thin client is working very well and system is
stable - User and group quotas are working as part of our
managed disk plan, A few requests for enhancement
to quota subsystem to improve quota management - Working on getting users to migrate off of ALL
nfs mounted work area or data disk - IBRIX file systems via NFS in limited numbers but
has been stable Refining admin documentation and
knowledge - Will add in 2 more segments servers and another
2.7TB of disk - Plan for 20TB by start of data taking
- Thin client rpm's are kernel version dependent
IBRIX committed to providing rpm updates for
security releases in a timely fashion - No Performance data is available because the
current focus is on functionality.
19SATA Evaluation in FNAL
- SATA is found in commodity or mid-level storage
configurations (as opposed to enterprise-level)
and cannot be expected to give the same
performance as more expensive architectures. SATA
controllers can be FC or PCI/PCI-X. - a SATA configuration imperfect firmware,
misleading claims by vendors, untested
configurations, upgrade is disruptive you get
what you pay for. - A number of suggestions were
- careful selection of vendor
- firmware upgrades
- consider paying more if you can be sure of
reduced ongoing maintenance costs (including
human costs) - Understand properly your needs and estimate the
cost and effect of disk losses. - Data Loss is acceptable or not
20Experiences Deploying Xrootd at RAL
21What is xrootd?
- xrootd (eXtended Root Daemon) was written at SLAC
and INFN Padova as part of the work to migrate
the BaBar event store from Objectivity to Root
I/O - Its a fully functional suite of tools for
serving data, including server daemons and
clients which take to each other using the xroot
protocol
22xrootd Architecture
application
Protocol Manager
xrd
xrootd
Protocol Layer
xroot
Authentication
Filesystem Logical Layer
ofs
Authorization
optional
Filesystem Physical Layer
oss
(included in distribution)
Filesystem Implementation
mss
_fs
23Load Balanced Example with MSS
?
?
?
?
?
24Benefits
- For Users
- Jobs dont crash if a disk/server goes down, they
back off, contact the olb manager and get the
data from somewhere else - Queues arent stopped just because 2 of the data
is offline - For Admins
- No need for heroic efforts to recover damaged
filesystems - Much easier to schedule maintenance
25Conclusion
- Xrootd has proved to be easy to configure and
link to our MSS. Initial indications are that the
production service is both reliable and
performant - This should improve both the lives of the users
and sysadmins with huge advances in both the
robustness of the system and its maintainability
without sacrificing performance - Talks, software (binaries and source),
documentation and example configurations are
available at http//xrootd.slac.stanford.edu
26Grid Middleware and Service Challenge3
27BNL Service Challenge3 (Skipped)
28ldF Background
- Need to start now setup of Tier2 to be ready on
time - LCG (France) effort concentrated on Tier1 till
now - Technical and financial challenges require time
to be solved - French HEP institutes are quite small
- 100-150 persons, small computing manpower
- IdF (Ile-de-France, Paris region) has a large
concentration of big HEP labs and physicists - 6 labs among which DAPNIA 600, LAL 350
- DAPNIA and LAL involved in Grid effort since
beginning of EDG - 3 EGEE contracts (2 for operation support)
29Objectives
- Build a Tier2 facility for simulation and
analysis - 80 LHC 4 experiments, 20 EGEE and local
- 2/3 LHC analysis
- Analysis require a large amount of storage
- Be ready at LHC startup (2nd half of 2007)
- Resource goals
- CPU 1500 1kSI2K ( P4 Xeon 2,8 Ghz), max
CMS800 - Storage 350 TB of disks (no MSS planned), max
CMS220 - Network 10 Gb/s backbone inside Tier2, 1 or 10
Gb/s external link - Need 1.6 M Euros
30Storage Challenge
- Efficient use and management of a large amount of
storage seen as the main challenge - Plan to participate SC3.
- 2006 Mini Tier2, 2007 Production Tier2
31SC2 Summary
- SC2 met its throughput goals (achieved gt600MB/s
daily average for 10 days) and with more sites
than originally planned! - A big improvement from SC1
- But we still dont have something we can call a
service - Monitoring is better
- We see outages when they happen, and we
understood why they happen - First step towards operations guides
- Some advances in infrastructure and software will
happen before SC3 - gLite transfer software
- SRM service more widely deployed
- We have to understand how to incorporate these
elements
32Service Challenge 3 - Phases
- High level view
- Setup phase (includes Throughput Test)
- 2 weeks sustained in July 2005
- Obvious target GDB of July 20th
- Primary goals
- 150MB/s disk disk to Tier1s
- 60MB/s disk (T0) tape (T1s)
- Secondary goals
- Include a few named T2 sites (T2 -gt T1 transfers)
- Encourage remaining T1s to start disk disk
transfers - Service phase
- September end 2005
- Start with ALICE CMS, add ATLAS and LHCb
October/November - All offline use cases except for analysis
- More components WMS, VOMS, catalogs,
experiment-specific solutions - Implies production setup (CE, SE, )
33SC3 Milestone Decomposition
- File transfer goals
- Build up disk disk transfer speeds to 150MB/s
with 1GB/s out of CERN - SC2 was 100MB/s agreed by site
- Include tape transfer speeds of 60MB/s with
300MB/s out of CERN - Tier1 goals
- Bring in additional Tier1 sites wrt SC2 (at least
wrt the original plan) - PIC and Nordic most likely added later SC4?
- Tier2 goals
- Start to bring Tier2 sites into challenge
- Agree services T2s offer / require
- On-going plan (more later) to address????ileston
e Decomposition???s - Build up disk disk transfer speeds to 150MB/s
with 1GB/s out of CERN - SC2 was 100MB/s agreed by site
- Include tape transfer speeds of 60MB/s with
300MB/s out dd additional components - Catalogs, VOs, experiment-specific solutions etc,
3D involvement, - Choice of software components, validation,
fallback,
34LCG Service ChallengesPlanning for Tier2 Sites
Update for HEPiX meeting
35Executive Summary
- Tier2 issues have been discussed extensively
since early this year - The role of Tier2s, the services they offer and
require has been clarified - The data rates for MC data are expected to be
rather low (limited by available CPU resources) - The data rates for analysis data depend heavily
on analysis model (and feasibility of producing
new analysis datasets IMHO) - LCG needs to provide
- Installation guide / tutorials for DPM, FTS, LFC
- Tier1s need to assist Tier2s in establishing
services
36Tier2 and Base S/W Components
- Disk Pool Manager (of some flavour)
- e.g. dCache, DPM,
- gLite FTS client (and T1 services)
- Possibly also local catalog, e.g. LFC, FiReMan,
- Experiment-specific s/w and services ( agents )
37Tier2s and SC3
- Initial goal is for a small number of Tier2-Tier1
partnerships to setup agreed services and gain
experience - This will be input to a wider deployment model
- Need to test transfers in both directions
- MC upload
- Analysis data download
- Focus is on service rather than throughput
tests - As initial goal, would propose running transfers
over at least several days - e.g. using 1GB files, show sustained rates of 3
files / hour T2-gtT1 - More concrete goals for the Service Phase will be
defined together with experiments in the coming
weeks - Definitely no later than June 13-15 workshop
38T2s Concrete Target
- We need a small number of well identified T2/T1
partners for SC3 as listed above - Do not strongly couple T2-T1 transfers to T0-T1
throughput goals of SC3 setup phase - Nevertheless, target one week of reliable
transfers T2-gtT1 involving at least two T1 sites
each with at least two T2s by end July 2005
39The LCG File Catalog (LFC)
- Jean-Philippe Baud Sophie Lemaitre
- IT-GD, CERN
- May 2005
40LCG File Catalog
- Based on lessons learned in DCs (2004)
- Fixes performance and scalability problems seen
in EDG Catalogs - Cursors for large queries
- Timeouts and retries from the client
- Provides more features than the EDG Catalogs
- User exposed transaction API
- Hierarchical namespace and namespace operations
- Integrated GSI Authentication Authorization
- Access Control Lists (Unix Permissions and POSIX
ACLs) - Checksums
- Based on existing code base
- Supports Oracle and MySQL database backends
41Relationships in the Catalog
42Features
- Namespace operations
- All names are in a hierarchical namespace
- mkdir(), opendir(), etc
- Also chdir()
- GUID attached to every directory and file
- Security GSI Authentication and Authorization
- Mapping done from Client DN to uid/gid pair
- Authorization done in terms of uid/gid
- VOMS will be integrated (collaboration with
INFN/NIKHEF) - VOMS roles appear as a list of gids
- Ownership of files is stored in catalog
- Permissions implemented
- Unix (user, group, all) permissions
- POSIX ACLs (group and users)
43LFC Tests
- LFC has been tested and shown to be scalable to
at least - 40 million entries
- 100 client threads
- Performance improved with comparison to RLSs
- Stable
- Continuous running at high load for extended
periods of time with no crashes - Based on code which has been in production for gt
4 years - Tuning required to improve bulk performance
44FiReMan Performance - Insert
350
Fireman - Single Entry
Fireman - Bulk 100
300
LFC
250
200
Inserts / Second
150
100
50
0
1
2
5
10
20
50
100
Number of Threads
45FiReMan Performance - Queries
1200
Fireman - Single Entry
Fireman - Bulk 100
LFC
1000
800
Entries Returned / Second
600
400
200
0
1
2
5
10
20
50
100
Number Of Threads
46Tests Conclusion
- Both LFC and FiReMan offer large improvements
over RLS - Still some issues remaining
- Scalability of FiReMan
- Bulk Operations for LFC
- More work needed to understand performance and
bottlenecks - Need to test some real Use Cases
47File Transfer Software and Service SC3
- Gavin McCance
- LHC service challenge
48FTS service
- It provides point to point movement of SURLs
- Aims to provide reliable file transfer between
sites, and thats it! - Allows sites to control their resource usage
- Does not do routing (e.g. like Phedex)
- Does not deal with GUID, LFN, Dataset,
Collections - Its a fairly simple service that provides sites
with a reliable and manageable way of serving
file movement requests from their VOs - We are understanding together with the
experiments the places in the software where
extra functionality can be plugged in - How the VO software frameworks can load the
system with work - Places where VO specific operations (such as
cataloguing), can be plugged-in, if required
49Single channel
50Multiple channels
Single set of servers canmanage multiple
channelsfrom a site
51What you need to run the server
- An Oracle database to hold the state
- MySQL is on-the-list but low-priority unless
someone screams - A transfer server to run the transfer agents
- Agents responsible for assigning jobs to channels
managed by that site - Agents responsible for actually running the
transfer (or for delegating the transfer to
srm-cp). - An application server (tested with Tomcat5)
- To run the submission and monitoring portal
i.e. the thing you use to talk to the system
52Initial use models considered
- Tier-0 to Tier-1 distribution
- Proposal put server at Tier-0
- This was the model used in SC2
- Tier-1 to Tier-2 distribution
- Proposal put server at Tier-1 push
- This is analogous to the SC2 model
- Tier-2 to Tier-1 upload
- Proposal put server at Tier-1 pull
53Summary
- Propose server at Tier-0 and Tier-1
- Oracle DB, Tomcat application server, transfer
node - Propose client tools at T0, T1 and T2
- This is a UI / WN type install
- Evaluation setup
- Initially at CERN T0, interacting with T1 a la
SC2 - Expand to few agreed T1s interacting with agreed
T2s
54Disk Pool Manager aims
- Provide a solution for Tier-2s in LCG-2
- This implies a few tens of Terabytes in 2005
- Focus on manageability
- Easy to install
- Easy to configure
- Low effort for ongoing maintenance
- Easy to add/remove resources
- Support for multiple physical partitions
- On one or more disk server nodes
- Support for different space types volatile and
permanent - Support for multiple replicas of hot files within
the disk pools
55Manageability
- Few daemons to install
- No central configuration files
- Disk nodes request to add themselves to the DPM
- All states are kept in a DB (easy to restore
after a crash) - Easy to remove disks and partitions
- Allows simple reconfiguration of the Disk Pools
- Administrator can temporarily remove file systems
from the DPM if a disk has crashed and is being
repaired - DPM automatically configures a file system as
unavailable when it is not contactable
56Features
- DPM access via different interfaces
- Direct Socket interface
- SRM v1
- SRM v2 Basic
- Also offer a large part of SRM v2 Advanced
- Global Space Reservation (next version)
- Namespace operations
- Permissions
- Copy and Remote Get/Put (next version)
- Data Access
- Gridftp, rfio (ROOTD, XROOTD could be easily
added)
57Security
- GSI Authentication and Authorization
- Mapping done from Client DN to uid/gid pair
- Authorization done in terms of uid/gid
- Ownership of files is stored in DPM catalog,
while the physical files on disk are owned by the
DPM - Permissions implemented on files and directories
- Unix (user, group, other) permissions
- POSIX ACLs (group and users)
- Propose to use SRM as interface to set the
permissions in the Storage Elements (require v2.1
minimum with Directory and Permission methods) - VOMS will be integrated
- VOMS roles appear as a list of gids
58Architecture
- The Light Weight Disk Pool Manager consists of
- The Disk Pool Manager with its configuration and
request DB - The Disk Pool Name Server
- The SRM servers
- The RFIOD and DPM-aware GsiFTP servers
- How many machines ?
- DPM, DPNS, and SRM can be installed on the same
one - RFIOD on each disk server managed by the DPM
- GsiFTP on each disk server managed by the DPM
59Status
- DPM will be part of LCG 2.5.0 release but is
available from now on for testing - Satisfies gLite requirement for SRM interface at
Tier-2
60Thank You