Title: Our Work at CERN
1Our Work at CERN
- Gang CHEN, Yaodong CHENG
- Computing center, IHEP
- November 2, 2004
2Outline
- Conclusion of CHEP04
- computing fabric
- New developments of CASTOR
- Storage Resource Manager
- Grid File System
- GGF-WG, GFAL
- ELFms
- Quattor, Lemon, Leaf
- Others
- AFS, wireless network,Oracle,condor,SLC, InDiCo,
- lyon visiting, CERN openday (Oct, 16)
3CHEP04
4Conclusion of CHEP04
- CHEP04, from Sept. 26 to Oct. 1
- Plenary conference in every morning
- Seven parallel sessions at each afternoon
- Online computing
- Event processing
- Core software
- Distributed computing services
- Distributed computing systems and experiences
- Computing fabrics
- Wide area network
- Documents www.chep2004.org
- Our Presentations (one talk each person)
- two on Sep. 27 and one on Sep. 30
5Computing fabrics
- Computing nodes, disk servers, tape servers,
network bandwidth at different HEP institutes - Fabrics at Tier0, Tier1 and Tier2
- Installation, configuration, maintenance and
management of large Linux farms - Grid Software installation
- Monitoring of computing fabrics
- OS choice move to RHES3/Scientific Linux
- Storage observations
6Storage stack
Expose to WAN
SRB
gfarm
SRM
StoRM
gfarm
SRB
NFS v2 v3
Chimera PNFS dCache
Expose to LAN
Lustre
StoRM
GoogleFS
PVFS
CASTOR
Local network
10Gb eth
Infiniband
1Gb eth
File Systems
GPFS
ext2/3
XFS
SAN FS
Disk Organisation
HW Raid 5
HW Raid 1
SW Raid 5
SW Raid 0
SATA array direct connect
Disks
FibreChannel/SATA SAN
EIDE/SATA in a box
iSCSI
Tape Store
dCache/TSM
CASTOR
ENSTORE
JASMine
HPSS
7Storage observations
- Castor and dCache are in full growth
- Growing numbers of adopters outside the
development sites - SRM supports all major managers
- SRB at Belle (KEK)
- Not always going for largest disks (capacity
driver), already choosing smaller for performance - Key issue for LHC
- Cluster file system comparisons
- SW based solutions allow HW reuse
8Architecture Choice
- 64 bits are coming soon and HEP is not really
ready for it! - Infiniband for HPC
- Low latency
- High bandwidth (gt700MB/s for CASTOR/RFIO)
- Balance of CPU to disk resources
- Security issues
- Which servers exposed to users or WAN?
- High performance data access and computing
support - Gfarm file system (Japan)
9New CASTOR developments
10Castor Current Status
- Usage at CERN
- 370 disk servers, 50 stagers (disk pool managers)
- 90 tapes drives, More than 3PB in total
- Dev team (5), Operations team (4)
- Associated problems
- Management is more and more difficult
- Performance
- Scalability
- I/O request scheduling
- Optimal use of resource
11Challenge for CASTOR
- LHC is a big challenge
- A single stager should scale up to handle peak
rates of 500/1000 requests per second - Expected system configuration
- 4PB of disk cache, 10 PB stored on tapes per year
- Tens of millions of disk resident files
- peak rate of 4GB/s from online
- 10000 disks, 150 tape drives
- Increase of small files
- The current CASTOR stager cannot do it
12Vision
- With clusters of 100s of disk and tape servers,
the automated storage management faces more and
more the same problems as CPU clusters management - (Storage) Resource management
- (Storage) Resource sharing
- (Storage) Request access scheduling
- Configuration
- Monitoring
- The stager is the main gateway to all resources
managed by CASTOR
Vision Storage Resource Sharing Facility
13Ideas behind new stager
- Pluggable framework rather than total solution
- True request scheduling third party schedulers,
e.g. Maui or LSF - Policy attributes externalize policy engines
governing the resource matchmaking. move toward
full-fledged policy languages, GUILE - Restricted access to storage resources
- All requests are scheduled
- No random rfiod eating up the resources behind
the back of the scheduling system - Database centric architecture
- Stateless components all transactions and
locking provided by the DB system - Allows for easy stop/restarting components
- Facilitates development/debugging
14New Stager Architecture
15ArchitectureRequest handling scheduling
Fabric Authentication service e.g. Kerberos-V
server
RequestHandler
Request repository (Oracle, MySQL)
Scheduler
Thread pool
Scheduling Policies
Catalogue
Job Dispatcher
16Security
- Implementing strong authentication
- (Encryption is not planned for the moment)
- Developed a plugin system, based on the GSSAPI so
as to use the mechanisms - GSI, KBR5
- And support KBR4 for back compatibility
- Modifying various CASTOR components to integrate
the security layer - Impact on the config of machines (need for
service keys etc)
17Castor GUI Client
- Prototype was developed by LIU aigui, on the
platform of Kylix 3. - If possible, it will be downloadable on CASTOR
web site. - Still exists many problems
- Need to be optimized
- Functionality and performance tests are very
necessary
18Storage Resource Manager
19Introduction of SRM
- SRMs are middleware components that manage shared
storage resources on the Grid and provide - Uniform access to heterogeneous storage
- Protocol negotiation
- Dynamic Transfer URL allocation
- Access to permanent and temporary types of
storage - Advanced space and file reservation
- Reliable transfer services
- Storage resources refers to
- DRM disk resource managers
- TRM Tape resource managers
- HRM Hierarchical resource manager
20SRM Collaboration
- Jefferson Lab
- Bryan HessAndy KowalskiChip Watson
-
- Fermilab
- Don PetravickTimur Perelmutov
-
- LBNL
- Arie ShoshaniAlex SimJunmin Gu
EU DataGrid WP2 Peter KunsztHeinz
StockingerKurt StockingerErwin Laure EU
DataGrid WP5 Jean-Philippe BaudStefano
OcchettiJens JensenEmil KnezoOwen Synge
21SRM versions
- Two SRM Interface specifications
- SRM v1.1provides
- Data access/transfer
- Implicit space reservation
- SRM v2.1 adds
- Explicit space reservation
- Namespace discovery and manipulation
- Access permissions manipulation
- Fermilab SRM implements SRM v1.1 specification
- SRM v2.1 by the end of 2004
- Reference http//sdm.lbl.gov/srm-wg
22High Level View of SRM
Client USER/APPLICATIONS
Grid Middleware
SRM
SRM
SRM
SRM
SRM
Enstore
DCache
JASMine
CASTOR
23Role of SRM on the GRID
24Main Advantages of using SRM
- Provides smooth synchronization between shared
resources - Eliminates unnecessary burden from the client
- Insulate them from storage systems failures
- Transparently deal with network failures.
- Enhance the efficiency of the grid, eliminating
unnecessary file transfers by sharing files. - Provide a streaming model to the client
25Grid File System
26Introduction
- There can be many hundreds of petabytes of data
in grids, among which a very large percentage is
stored in files - A standard mechanism to describe and organize
file-based data is essential for facilitating
access to this large amount of data. - GGF GFS-WG
- GFAL- Grid File Access Library
27GGF GFS-WG
- Global Grid forum, Grid File System Working Group
- Two goals (two documents)
- File System Directory Services
- Manage namespace for files, access control, and
metadata management - Architecture for Grid File System Services
- Provides functionality of virtual file systemin
grid environment - Facilitates federation and sharing of virtualized
data - Uses File System Directory Services and standard
access protocols - They will be submitted in GGF13 and GGF14 (2005)
28GFS view
- Transparent access to dispersed file data in a
Grid - POSIX I/O APIs
- Applications can access Gfarm file system without
any modification as if it is mounted at /gfs - Automatic and transparent replica selection for
fault tolerance and access-concentration avoidance
Virtual Directory Tree
File system metadata
mapping
GRID File System
File replica creation
29GFAL
- Grid File Access Library
- Grid storage interactions today require using
several existing software components - the replica catalog services to locate valid
replicas of files. - The SRM software to ensure files exist on disk
or space is allocated on disk for new files - GFAL hides these interactions and presents a
Posix interface for the I/O operations. The
currently supported protocols are file for local
access, dcap (dCache access protocol) and rfio
(CASTOR access protocol).
30Compile and Link
- The function names are obtained by prepending
gfal_ to the Posix names, for example gfal_open,
gfal_read, gfal_close ... The argument lists and
the values returned by the functions are
identical. - The header file gfal_api.h needs to be included
in the application source code - Linked with libGFAL.so
- Security libraries libcgsi_plugin_gsoap_2.3,
libglobus_gssapi_gsi_gcc32dbg and
libglobus_gss_assist_gcc32dbg are used internally
31Basic Design
Physics applications
GFAL
VFS
Posix I/O
SRM Client
Local File I/O
Root I/O Open() Read()
rfio I/O Open() Read()
dCap I/O Open() Read()
Replica Catolog Client
RC Services
SRM services
RFIO services
dCap services
Wide Area Access
MSS services
Local DISK
32File system implementation
- Two options have been considered to offer a File
System view - the way to run standard applications without
modifying the source and without re-linking - The Pluggable File System (PFS) built on top of
Bypass and developed by University of Wisconsin
- The Linux Userland File System (LUFS)
- File system view /grid/vo/
- CASTORfs based on LUFS
- I developed it
- Available
- Low efficiency
33Extremely Large Fabric management system
34ELFms
- ELFms Extremely Large Fabric management
system - Sub Systems
- QUATTOR system installation and configuration
tool suite - LEMON monitoring framework
- LEAF Hardware and State management
35Deploy at CERN
- ELFms manages and controls most of the nodes
in the CERN CC - 2100 nodes out of 2400, to be scaled up to gt
8000 in 2006-8 (LHC) - Multiple functionality and cluster size (batch
nodes, disk servers, tape servers, DB, web, ) - Heterogeneous hardware (CPU, memory, HD size,..)
- Linux (RH) and Solaris (9)
36Quattor
- Quattor takes care of the configuration,
installation and management of fabric nodes - A Configuration Database holds the desired
state of all fabric elements - Node setup (CPU, HD, memory, software RPMs/PKGs,
network, system services, location, audit info) - Cluster (name and type, batch system, load
balancing info) - Defined in templates arranged in hierarchies
common properties set only once - Autonomous management agents running on the node
for - Base installation
- Service (re-)configuration
- Software installation and management
- Quattor was developed in the scope of EU
DataGrid. Development and maintenance now
coordinated by CERN/IT
37Quattor Architecture
- Configuration Management
- Configuration Database
- Configuration access and caching
- Graphical and Command
- Line Interfaces
- Node and Cluster Management
- Automated node installation
- Node Configuration Management
- Software distribution and management
Node
Configuration Management
Node Management
38LEMON
- Monitoring sensors and agent
- Large amount of metrics ( 10 sensors
implementing 150 metrics) - Plug-in architecture new sensors and metrics can
easily be added - Asynchronous push/pull protocol between sensors
and agent - Available for Linux and Solaris
- Repository
- Data insertion via TCP or UDP
- Data retrieval via SOAP
- Backend implementations for text file and Oracle
SQL - Keeps current and historical samples no aging
out of data but archiving on TSM and CASTOR - Correlation Engines and self-healing Fault
Recovery - allows plug-in correlations accessing collected
metrics and external information (eg. quattor
CDB, LSF), and also launch configured recovery
actions - Eg. average number of users on LXPLUS, total
number of active LCG batch nodes - Eg. cleaning up /tmp if occupancy gt x , restart
daemon D if dead, - LEMON is an EDG development now maintained by
CERN/IT
39LEMON Architecture
- LEMON stands for LHC Era Monitoring
40LEAF -LHC Era Automated Fabric)
- Collection of workflows for automated node
hardware and state management - HMS (Hardware Management System)
- eg. installation, moves, vendor calls, retirement
- Automatically requests installs, retires etc. to
technicians - GUI to locate equipment physically
- SMS (State Management System)
- Automated handling high-level configuration
steps, eg. - Reconfigure, reboot,Reallocate nodes,reconfig
- extensible framework plug-ins for site-specific
operations possible - Issues all necessary (re)configuration commands
on top of quattor CDB and NCM - HMS and SMS interface to Quattor and LEMON for
setting/getting node information respectively
41LEAF screenshot
42Other Activities
- AFS
- AFS documents download
- AFS DB servers configuration
- Wireless network deployment
- Oracle license for LCG
- Condor deployment at some HEP institutes
- SLC Scientific Linux CERN version
- lyon visiting (Oct. 27, CHEN gang)
- CERN OpenDay (Oct, 16)
43(No Transcript)
44Thank you!!