Hepix 2005 Trip Reports - PowerPoint PPT Presentation

1 / 60

About This Presentation

Title:

Hepix 2005 Trip Reports

Description:

... on Use of IBRIX Fusion FS. CMS Decided to Implement IBRIX solution Why? ... IBRIX file systems via NFS in limited numbers but has been stable Refining admin ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 61

Provided by: brucegg

Category:

more less

Transcript and Presenter's Notes

Title: Hepix 2005 Trip Reports

1
Hepix 2005 Trip Reports

Dantong Yu
June/27/2005

2
Highlights

Hepix was asked and agreed to be technical
advisors to IHEPCCC on specific questions where
it has expertise. Examples were given, the status
of Linux in HEP and the idea of a virtual
organisation for HEP physicists.
The most recent successful HEP collaboration on
distribution and widespread acceptance of
Scientific Linux. Discussions focused on versions
would need to be supported for LHC in the next
two years.
LEMON, NGOP and SLAC Nagios Monitoring.
Service Challenge Preparation.
Batch Queue Systems.
DoE budget cuts affected SLAC, FNAL, and BNL.

3
Disk, Tape, Storage and File Systems
4
Fermilab Mass Storage

ENSTORE, dCache and SRM for CDF, D0 and CMS
Name Space PNFS.
Provides a hierarchical namespace for users
files in Enstore.
Manages file metadata.
Looks like an NFS mounted file system from user
nodes.
Stands for Perfectly Normal File System.
Written at DESY.
ENSTORE Hardware 6 Silos, Tape Drives 9 LTO1,
14 LTO2, 9940 20, 9940B 52, DLT (4000 8000)
8 and 127 commodity PCs.
2.6 Petabyte data, 10.8 million files, 25,000
volumes, and Record rate 27 Terabyte/day.

5
dCache

dCache is deployed on top of ENSTORE or stand
alone
Improve performance by using file in caches
instead of reading from tape each time when
needed.
100 pool nodes with 225 Terabytes of disk
Lessons Learned
Use the XFS filesystem on the pool disks.
Use direct I/O when accessing the files on the
Local dCache disk.
Users will push the system to its limits. Be
prepared.

6
SRM

Provides uniform interface for access to multiple
storage systems via SRM protocol.
SRM is a broker that works on top of other
storage systems.
dCache
UNIXTM filesystem
Enstore In development

7
Caspur file systems Lustre, AFS, Panasas
8
Components

High-end Linux units for both servers and
clients
Servers 2-way Intel Nocona 3.4 GHz, 2GB
RAM, 2 QLA2310 2Gbit HBA
Clients 2-way Intel Xeon 2.4 GHz, 1GB RAM
OS SuSE SLES 9 on servers, SLES 9 /
RHEL 3 on clients
- Network nonblocking GigE switches
CISCO - 3570G-24TS (24 ports)
Extreme Networks - Summit 400-48t (48 ports)
SAN
Qlogic Sanbox 5200 32 ports
- Appliances
3 Panasas ActiveScale Shelves

9
Panasas Storage Cluster Components
Integrated GE Switch
Battery Module (2 Power units)
Shelf Front 1 DB, 10 SB
Shelf Rear
DirectorBlade
StorageBlade
Midplane routes GE, power
10
Test setup (NFS, AFS, Lustre)
Load Farm (16 biprocessor nodes at 2.4 GHz)
Gigabit Ethernet CISCO/Extreme
MDS(Lustre)
SAN QLogic 5200
On each server, 2 Gigabit Ethernet NICs were
bonded (bonding-ALB). LD1, LD2 could be IFT or
DDN. Each LD was zoned to a distinct HBA.
11
What we measured
1) Massive aggregate I/O (large files,
lmdd) - All 16 clients were
unleashed together, file sizes varied in the
range 5-10GB - Gives a good idea about
the systems overall throughput 2) Pileup.
This special benchmark was developed at CERN by
R.Többicke. - Emulation of an
important use case foreseen in one of the LHC
experiments - Several (64-128) 2GB files
are first prepared on the file system under test
- The files are then read by a growing
number of reader threads (ramp-up) -
Every thread selects randomly one file out of the
list - In a single read act, an
arbitrary offset within file is calculated,
and 50-60 KB are read starting with this
offset - Output is the number of
operations times bytes read per time interval
- Pileup results are important for future
service planning
12
A typical Pileup curve
13
3) Emulation of a DAQ Data Buffer -
A very common scenario in HEP DAQ architecture
- Data is constantly arriving from
the detector and has to end up on the
tertiary strorage (tapes) - A
temporary storage area on the way of the data to
tapes serves for reorganization of
streams, preliminary real-time analysis and
as a security buffer to hold against the
interrupts of the archival system
- Of big interest for service planning general
throughput of a balanced Data Buffer.
- A DAQ Manager may moderate the data
influx (for instance, by tuning certain
trigger rates), thus balancing it with the
outflux. - We were running 8 writers and
8 readers, one process per client. Each file was
accessed at any given moment by one and
only one process. On writer nodes we
could moderate the writer speed by adding some
dummy CPU eaters.
14
DAQ Data Buffer
15
Results for 8 GigE outlets
2 remarks - Each of the storage nodes had 2
GigE NICs. We have tried to add a third NIC to
see if we could get more out of the node. There
was a modest improvement of less than 10
percent so we decided to use 8 NICs on 4 nodes
per run. - Panasas shelf had 4 NICs, and we
report here its results multiplied by 2, to
be able to compare it with all other 8-NIC
configurations.
16
Conclusions
1) With 8 GigE NICs in the system, one would
expect a throughput in excess of 800 MB/sec
for large streaming I/O. Lustre and Panasas can
clearly deliver this, NFS is also doing
quite well. The very fact that we were
operating around 800 MB/sec with this hardware
means that our storage nodes were well-balanced
(no bottlenecks, we even might have had a
reserve of 100 MB/sec per setup). 2) Pileup
results were relatively good for AFS, and best in
case of Panasas. The outcome of this
benchmark is correlated with the number of
spindles in the system. Two Panasas shelves
had 40 spindles, while 4 storage nodes used
64 spindles. So Panasas file system was doing a
much better job per spindle than any other
solution that we tested (NFS, AFS, Lustre).

17
USCMS Tier 1 Update on Use of IBRIX Fusion FS

CMS Decided to Implement IBRIX solution Why?
No specialized hardware required
Enabling us to redeploy hardware if this
solution did not work
Made cost of product less than others with
hardware components
Provided NFS access
A universal protocol which required no client
side code
Initial decision to only use NFS access because
of this IBRIX very responsive to our requests and
to issues we found during theevaluation
Purely software solution no specialized hardware
dependencies
Comprised of
Highly scalable POSIX compliant parallel file
system
Logical volume manager based on LVM
High availability
Comprehensive management interface which includes
a gui

18
Current Status

Thin client is working very well and system is
stable
User and group quotas are working as part of our
managed disk plan, A few requests for enhancement
to quota subsystem to improve quota management
Working on getting users to migrate off of ALL
nfs mounted work area or data disk
IBRIX file systems via NFS in limited numbers but
has been stable Refining admin documentation and
knowledge
Will add in 2 more segments servers and another
2.7TB of disk
Plan for 20TB by start of data taking
Thin client rpm's are kernel version dependent
IBRIX committed to providing rpm updates for
security releases in a timely fashion
No Performance data is available because the
current focus is on functionality.

19
SATA Evaluation in FNAL

SATA is found in commodity or mid-level storage
configurations (as opposed to enterprise-level)
and cannot be expected to give the same
performance as more expensive architectures. SATA
controllers can be FC or PCI/PCI-X.
a SATA configuration imperfect firmware,
misleading claims by vendors, untested
configurations, upgrade is disruptive you get
what you pay for.
A number of suggestions were
careful selection of vendor
firmware upgrades
consider paying more if you can be sure of
reduced ongoing maintenance costs (including
human costs)
Understand properly your needs and estimate the
cost and effect of disk losses.
Data Loss is acceptable or not

20
Experiences Deploying Xrootd at RAL
21
What is xrootd?

xrootd (eXtended Root Daemon) was written at SLAC
and INFN Padova as part of the work to migrate
the BaBar event store from Objectivity to Root
I/O
Its a fully functional suite of tools for
serving data, including server daemons and
clients which take to each other using the xroot
protocol

22
xrootd Architecture
application
Protocol Manager
xrd
xrootd
Protocol Layer
xroot
Authentication
Filesystem Logical Layer
ofs
Authorization
optional
Filesystem Physical Layer
oss
(included in distribution)
Filesystem Implementation
mss
_fs
23
Load Balanced Example with MSS
?
?
?
?
?
24
Benefits

For Users
Jobs dont crash if a disk/server goes down, they
back off, contact the olb manager and get the
data from somewhere else
Queues arent stopped just because 2 of the data
is offline
For Admins
No need for heroic efforts to recover damaged
filesystems
Much easier to schedule maintenance

25
Conclusion

Xrootd has proved to be easy to configure and
link to our MSS. Initial indications are that the
production service is both reliable and
performant
This should improve both the lives of the users
and sysadmins with huge advances in both the
robustness of the system and its maintainability
without sacrificing performance
Talks, software (binaries and source),
documentation and example configurations are
available at http//xrootd.slac.stanford.edu

26
Grid Middleware and Service Challenge3
27
BNL Service Challenge3 (Skipped)
28
ldF Background

Need to start now setup of Tier2 to be ready on
time
LCG (France) effort concentrated on Tier1 till
now
Technical and financial challenges require time
to be solved
French HEP institutes are quite small
100-150 persons, small computing manpower
IdF (Ile-de-France, Paris region) has a large
concentration of big HEP labs and physicists
6 labs among which DAPNIA 600, LAL 350
DAPNIA and LAL involved in Grid effort since
beginning of EDG
3 EGEE contracts (2 for operation support)

29
Objectives

Build a Tier2 facility for simulation and
analysis
80 LHC 4 experiments, 20 EGEE and local
2/3 LHC analysis
Analysis require a large amount of storage
Be ready at LHC startup (2nd half of 2007)
Resource goals
CPU 1500 1kSI2K ( P4 Xeon 2,8 Ghz), max
CMS800
Storage 350 TB of disks (no MSS planned), max
CMS220
Network 10 Gb/s backbone inside Tier2, 1 or 10
Gb/s external link
Need 1.6 M Euros

30
Storage Challenge

Efficient use and management of a large amount of
storage seen as the main challenge
Plan to participate SC3.
2006 Mini Tier2, 2007 Production Tier2

31
SC2 Summary

SC2 met its throughput goals (achieved gt600MB/s
daily average for 10 days) and with more sites
than originally planned!
A big improvement from SC1
But we still dont have something we can call a
service
Monitoring is better
We see outages when they happen, and we
understood why they happen
First step towards operations guides
Some advances in infrastructure and software will
happen before SC3
gLite transfer software
SRM service more widely deployed
We have to understand how to incorporate these
elements

32
Service Challenge 3 - Phases

High level view
Setup phase (includes Throughput Test)
2 weeks sustained in July 2005
Obvious target GDB of July 20th
Primary goals
150MB/s disk disk to Tier1s
60MB/s disk (T0) tape (T1s)
Secondary goals
Include a few named T2 sites (T2 -gt T1 transfers)
Encourage remaining T1s to start disk disk
transfers
Service phase
September end 2005
Start with ALICE CMS, add ATLAS and LHCb
October/November
All offline use cases except for analysis
More components WMS, VOMS, catalogs,
experiment-specific solutions
Implies production setup (CE, SE, )

33
SC3 Milestone Decomposition

File transfer goals
Build up disk disk transfer speeds to 150MB/s
with 1GB/s out of CERN
SC2 was 100MB/s agreed by site
Include tape transfer speeds of 60MB/s with
300MB/s out of CERN
Tier1 goals
Bring in additional Tier1 sites wrt SC2 (at least
wrt the original plan)
PIC and Nordic most likely added later SC4?
Tier2 goals
Start to bring Tier2 sites into challenge
Agree services T2s offer / require
On-going plan (more later) to address????ileston
e Decomposition???s
Build up disk disk transfer speeds to 150MB/s
with 1GB/s out of CERN
SC2 was 100MB/s agreed by site
Include tape transfer speeds of 60MB/s with
300MB/s out dd additional components
Catalogs, VOs, experiment-specific solutions etc,
3D involvement,
Choice of software components, validation,
fallback,

34
LCG Service ChallengesPlanning for Tier2 Sites
Update for HEPiX meeting

Jamie Shiers
IT-GD, CERN

35
Executive Summary

Tier2 issues have been discussed extensively
since early this year
The role of Tier2s, the services they offer and
require has been clarified
The data rates for MC data are expected to be
rather low (limited by available CPU resources)
The data rates for analysis data depend heavily
on analysis model (and feasibility of producing
new analysis datasets IMHO)
LCG needs to provide
Installation guide / tutorials for DPM, FTS, LFC
Tier1s need to assist Tier2s in establishing
services

36
Tier2 and Base S/W Components

Disk Pool Manager (of some flavour)
e.g. dCache, DPM,
gLite FTS client (and T1 services)
Possibly also local catalog, e.g. LFC, FiReMan,
Experiment-specific s/w and services ( agents )

37
Tier2s and SC3

Initial goal is for a small number of Tier2-Tier1
partnerships to setup agreed services and gain
experience
This will be input to a wider deployment model
Need to test transfers in both directions
MC upload
Analysis data download
Focus is on service rather than throughput
tests
As initial goal, would propose running transfers
over at least several days
e.g. using 1GB files, show sustained rates of 3
files / hour T2-gtT1
More concrete goals for the Service Phase will be
defined together with experiments in the coming
weeks
Definitely no later than June 13-15 workshop

38
T2s Concrete Target

We need a small number of well identified T2/T1
partners for SC3 as listed above
Do not strongly couple T2-T1 transfers to T0-T1
throughput goals of SC3 setup phase
Nevertheless, target one week of reliable
transfers T2-gtT1 involving at least two T1 sites
each with at least two T2s by end July 2005

39
The LCG File Catalog (LFC)

Jean-Philippe Baud Sophie Lemaitre
IT-GD, CERN
May 2005

40
LCG File Catalog

Based on lessons learned in DCs (2004)
Fixes performance and scalability problems seen
in EDG Catalogs
Cursors for large queries
Timeouts and retries from the client
Provides more features than the EDG Catalogs
User exposed transaction API
Hierarchical namespace and namespace operations
Integrated GSI Authentication Authorization
Access Control Lists (Unix Permissions and POSIX
ACLs)
Checksums
Based on existing code base
Supports Oracle and MySQL database backends

41
Relationships in the Catalog
42
Features

Namespace operations
All names are in a hierarchical namespace
mkdir(), opendir(), etc
Also chdir()
GUID attached to every directory and file
Security GSI Authentication and Authorization
Mapping done from Client DN to uid/gid pair
Authorization done in terms of uid/gid
VOMS will be integrated (collaboration with
INFN/NIKHEF)
VOMS roles appear as a list of gids
Ownership of files is stored in catalog
Permissions implemented
Unix (user, group, all) permissions
POSIX ACLs (group and users)

43
LFC Tests

LFC has been tested and shown to be scalable to
at least
40 million entries
100 client threads
Performance improved with comparison to RLSs
Stable
Continuous running at high load for extended
periods of time with no crashes
Based on code which has been in production for gt
4 years
Tuning required to improve bulk performance

44
FiReMan Performance - Insert

Comparison with LFC

350
Fireman - Single Entry
Fireman - Bulk 100
300
LFC
250
200
Inserts / Second
150
100
50
0
1
2
5
10
20
50
100
Number of Threads
45
FiReMan Performance - Queries

Comparison with LFC

1200
Fireman - Single Entry
Fireman - Bulk 100
LFC
1000
800
Entries Returned / Second
600
400
200
0
1
2
5
10
20
50
100
Number Of Threads
46
Tests Conclusion

Both LFC and FiReMan offer large improvements
over RLS
Still some issues remaining
Scalability of FiReMan
Bulk Operations for LFC
More work needed to understand performance and
bottlenecks
Need to test some real Use Cases

47
File Transfer Software and Service SC3

Gavin McCance
LHC service challenge

48
FTS service

It provides point to point movement of SURLs
Aims to provide reliable file transfer between
sites, and thats it!
Allows sites to control their resource usage
Does not do routing (e.g. like Phedex)
Does not deal with GUID, LFN, Dataset,
Collections
Its a fairly simple service that provides sites
with a reliable and manageable way of serving
file movement requests from their VOs
We are understanding together with the
experiments the places in the software where
extra functionality can be plugged in
How the VO software frameworks can load the
system with work
Places where VO specific operations (such as
cataloguing), can be plugged-in, if required

49
Single channel
50
Multiple channels
Single set of servers canmanage multiple
channelsfrom a site
51
What you need to run the server

An Oracle database to hold the state
MySQL is on-the-list but low-priority unless
someone screams
A transfer server to run the transfer agents
Agents responsible for assigning jobs to channels
managed by that site
Agents responsible for actually running the
transfer (or for delegating the transfer to
srm-cp).
An application server (tested with Tomcat5)
To run the submission and monitoring portal
i.e. the thing you use to talk to the system

52
Initial use models considered

Tier-0 to Tier-1 distribution
Proposal put server at Tier-0
This was the model used in SC2
Tier-1 to Tier-2 distribution
Proposal put server at Tier-1 push
This is analogous to the SC2 model
Tier-2 to Tier-1 upload
Proposal put server at Tier-1 pull

53
Summary

Propose server at Tier-0 and Tier-1
Oracle DB, Tomcat application server, transfer
node
Propose client tools at T0, T1 and T2
This is a UI / WN type install
Evaluation setup
Initially at CERN T0, interacting with T1 a la
SC2
Expand to few agreed T1s interacting with agreed
T2s

54
Disk Pool Manager aims

Provide a solution for Tier-2s in LCG-2
This implies a few tens of Terabytes in 2005
Focus on manageability
Easy to install
Easy to configure
Low effort for ongoing maintenance
Easy to add/remove resources
Support for multiple physical partitions
On one or more disk server nodes
Support for different space types volatile and
permanent
Support for multiple replicas of hot files within
the disk pools

55
Manageability

Few daemons to install
No central configuration files
Disk nodes request to add themselves to the DPM
All states are kept in a DB (easy to restore
after a crash)
Easy to remove disks and partitions
Allows simple reconfiguration of the Disk Pools
Administrator can temporarily remove file systems
from the DPM if a disk has crashed and is being
repaired
DPM automatically configures a file system as
unavailable when it is not contactable

56
Features

DPM access via different interfaces
Direct Socket interface
SRM v1
SRM v2 Basic
Also offer a large part of SRM v2 Advanced
Global Space Reservation (next version)
Namespace operations
Permissions
Copy and Remote Get/Put (next version)
Data Access
Gridftp, rfio (ROOTD, XROOTD could be easily
added)

57
Security

GSI Authentication and Authorization
Mapping done from Client DN to uid/gid pair
Authorization done in terms of uid/gid
Ownership of files is stored in DPM catalog,
while the physical files on disk are owned by the
DPM
Permissions implemented on files and directories
Unix (user, group, other) permissions
POSIX ACLs (group and users)
Propose to use SRM as interface to set the
permissions in the Storage Elements (require v2.1
minimum with Directory and Permission methods)
VOMS will be integrated
VOMS roles appear as a list of gids

58
Architecture