Title: LHCb SC3 Exeperience
1LHCb Data Replication During SC3
Author Andrew C. Smith Abstract LHCb's
participation in LCG's Service Challenge 3
involves testing the bulk data transfer
infrastructure developed to allow high bandwidth
distribution of data across the grid in
accordance with the computing model. To enable
reliable bulk replication of data, LHCb's DIRAC
system has been integrated with gLite's File
Transfer Service middleware component to make use
of dedicated network links between LHCb computing
centres. DIRAC's Data Management tools previously
allowed the replication, registration and
deletion of files on the grid. For SC3
supplementary functionality has been added to
allow bulk replication of data (using FTS) and
efficient mass registration to the LFC replica
catalog. Provisional performance results have
shown that the system developed can meet the
expected data replication rate required by the
computing model in 2007. This paper details the
experience and results of integration and
utilisation of DIRAC with the SC3 transfer
machinery.
- LHCb Transfer Aims During SC3
- The extended Service Phase of SC3 was to allow
the experiments to test their specific software
and validate their computing models using the
platform of machinery provided. LHCbs Data
Replication goals during SC3 can be summarised
as - Replication 1TB of stripped DST data from CERN
to all Tier-1s. - Replication of 8 TB of digitised data from
CERN/Tier-0 to LHCb participating Tier1 centers
in parallel. - Removal of 50k replicas (via LFN) from all Tier-1
centres - Moving 4TB of data from Tier1 centres to Tier0
and to other participating Tier1 centers.
- Introduction to DIRAC Data Management
Architecture - DIRAC architecture split into three main
component types - Services - independent functionalities deployed
and administered centrally on machines accessible
by all other DIRAC components - Resources - GRID compute and storage resources at
remote sites - Agents - lightweight software components that
request jobs from the central Services for a
specific purpose. - The DIRAC Data Management System is made up an
assortment of these components.
- Integration of DIRAC with FTS
- SC3 replication machinery utilised gLites File
Transfer Service (FTS) - lowest-level data movement service defined in the
gLite architecture - offers reliable point-to-point bulk file
transfers - physical files (SURLs) between SRM managed SEs
- accepts source-destination SURL pairs
- assigns file transfers to dedicated transfer
channel - take advantage of networking between CERN and
Tier1s - routing of transfers is not provided
- Higher level service required to resolve SURLs
and hence decide on routing. DIRAC Data
Management System employed to do these tasks.
- Main components of the DIRAC Data Management
System - Storage Element
- abstraction of GRID storage resources
- actual access by specific plug-ins
- srm, gridftp, bbftp, sftp, http supported
- namespace management, file up/download, deletion
etc. - Replica Manager
- provides an API for the available data management
operations - point of contact for users of data management
systems - removes direct operation with Storage Element and
File Catalogs - uploading/downloading file to/from GRID SE,
replication of files, file registration, file
removal - File Catalog
- standard API exposed for variety of available
catalogs - allows redundancy across several catalogs
- Integration requirements
- new methods developed in Replica Manager
- previous Data Management operations single file
and blocking - bulk operation functionality added to the
Transfer Agent/Request - monitoring of asynchronous FTS jobs required
- information for monitoring stored within Request
DB entry
2- Operation of DIRAC Bulk Transfer Mechanics
- DIRAC integration with FTS deployed centrally
- managed machine at CERN
- service all data replication jobs for SC3
- Lifetime of bulk replication job
- bulk replication requests submitted to the DIRAC
WMS - JDL file with an input sandbox of an XML file
- XML contains important parameters i.e. LFNs,
source/target SE - DIRAC WMS populates the Request DB of central
machine with XML - Transfer Agent polls Request DB periodically for
waiting requests
- Combined 40MB/s from CERN to 6 LHCb Tier1s to
meet SC3 goals - aggregated daily rate was obtained
- overall SC3 machinery not completely stable
- target rate not sustained over the required
period - peak rates of 100MB/s were observed over several
hours - Rerun of exercise planned to demonstrate the
required rates.
- Bulk File Removal Operations
- Bulk removal of files performed on completion of
T0-T1 replication. - bulk operation of srm-advisory-delete used
- takes list of SURls and removes physical file
- functionality added to Replica Manager and
Storage Element - additions required for SRM Storage Element
plug-in - Replica Manager SURL resolution tools reused
- Different interpretations of the SRM standard has
lead to different underlying behavior between SRM
solutions. - Initially bulk removal operations executed by a
single central agent - SC3 goal of 50K replicas in 24 hours shown to be
unattainable - Several parallel agents instantiated
- each performing physical and catalog removal for
a specific SE - 10K replicas were removed from 5 sites in 28
hours - performance loss observed in replica deletion on
LCG FC (see below) - unnecessary SSL authentications CPU intensive
- Once Transfer Agent obtains Request XML file
- replica information for LFNs obtained
- replicas matched against source SE and target SE
- SURL pairs resolved using endpoint information
- SURL pairs are then submitted via the FTS Client
- FTS GUID and other information on job stored in
XML file
- Tier1Tier1 Replication Activity On-Going
- During T0-T1 replication FTS was found to be most
efficient when replicating files pre-staged on
disk. - dedicated disk pools setup to T1 sites for seed
files - 1.5TB of seed files transferred to dedicated disk
- FTS Servers were installed by T1 sites
- channels setup directly between sites
- Replication activity is on going with this
exercise. The current status of this setup is
shown below.