Title: DIRAC LHCb MC production system
1DIRAC LHCb MC production system
A. Tsaregorodtsev, For the LHCb Data Management
team
CHEP 2003, La Jolla 20 March
2Outline
- Introduction
- DIRAC architecture
- Implementation details
- Deploying agents on the DataGRID
- Conclusions
LHCb Data Management team E. van Herwijnen, J.
Closier, M. Frank, C. Gaspar, F. Loverre, S.
Ponce (CERN), R. Graciani Diaz (Barcelona), D.
Galli, U. Marconi, V. Vagnoni (Bologna), N. Brook
(Bristol), A. Buckley, K. Harrison (Cambridge),
M. Schmelling (GRIDKA, Karlsruhe), U. Egede
(Imperial College London), A. Tsaregorodtsev, V.
Garonne (IN2P3, Marseille), A. Bogdanchikov (INP,
Novosibirsk), I. Korolko (ITEP, Moscow), A.
Washbrook, J.P. Palacios (Liverpool), S. Klous
(Nikhef and Vrije Universiteit Amsterdam), J.J.
Saborido (Santiago de Compostela), A. Khan
(ScotGrid, Edinburgh), A. Pickford (ScotGrid,
Glasgow), A. Soroko (Oxford), V. Romanovski
(Protvino), G.N. Patrick, G. Kuznetsov (RAL), M.
Gandelman (UFRJ, Rio de Janeiro)Â
3What is it all about ?
DIRAC Distributed Infrastructure with Remote
Agent Control
- Distributed MC production system for LHCb
- Production tasks definition and steering
- Software installation on production sites
- Job scheduling and monitoring
- Data transfers
- Automates most of the production tasks
- minimum participation of local production
managers - PULL rather than PUSH concept for tasks
scheduling - Different from the DataGRID architecture.
4DIRAC architecture
5Advantages of the PULL approach
- Better use of resources
- no idle or forgotten CPU power
- natural load balancing more powerful center
gets more work automatically - Less burden on the central production service
- deals only with production tasks definition and
bookkeeping - do not bother about particular production sites
- No need for direct access to local disks from
central service - AFS is not used
- no RPC calls
- Easy introduction of new sites into the
production schema - no information on local sites necessary at the
central site
6Job description
Web based editors
7Agent operations
Production agent
batch system
Production service
SW distribution service
Monitoring service
Bookkeeping service
Castor
isQueueAvalable()
requestJob(queue)
installPackage()
submitJob(queue)
setJobStatus(step 1)
setJobStatus(step 2)
Running job
setJobStatus(step n)
sendBookkeeping()
sendFileToCastor()
addReplica()
8Implementation details
- Central web services
- XML-RPC communication protocol
- Web based editing and visualization
- ORACLE production and bookkeeping databases.
- Agent - a set of collaborating python classes
- Python 1.5.2 to be sure it is compatible with
all the sites - Standard python library XML-RPC implementation
for passing messages - Easily extendable
- for new (GAUDI) applications
- for new tools, e.g. file transport .
- Site specific functionality can be easily added
via callbacks - Data and log files transfer using bbftp .
9Agent customization at a production site
- Ease of setting up a production site is crucial
to absorb all available resources - One Python script where all the local
configuration is defined - Checking for the local batch queue availability
- Local job submitting command
- Copying to/from local mass storage
- Data and log file transfer policy .
- Agent distribution comes with examples of
typical cases - CERN LSF Castor
- PBS data directory
- Standalone PC
- DataGRID worker node.
10Dealing with failures
- Job is rescheduled in case of a local system
failure to run it - Other sites can then pick it up
- Journaling
- all the sensitive files (logs, bookkeeping, job
descriptions) are kept at the production sites - bookkeeping update files are kept in the
bookkeeping service cache - Job can be restarted from where it failed
- in case of system failure (lack of memory,
general power cut, etc) - RAW simulation data should be retained to skip
simulation step . - File transfers are automatically retried after a
predefined pause in case of failures
11Working experience
- DIRAC production system was deployed on 17 LHCb
production sites - 2 hours to 2 days of work for customization
- Dealing with firewalls
- Interfaces to local batch system and mass storage
- Site policy definition .
- Smooth running for routine MC production tasks
- Long jobs with no input data
- Much less burden for local production managers
- automatic data upload to CERN/Castor
- log files automatically available through Web
interface - automatic recoveries from common failures (job
submission, data transfers) - The current Data Challenge production using DIRAC
advances ahead of schedule.
12DIRAC deployment on the DataGRID testbed
13DIRAC on the DataGRID
14Deploying agents on the DataGRID
- JDL InputSandbox contains
- job XML description
- Launcher script
- Use EDG replica_manager for data transfer to
CERN/Castor - Log files are passed back via OutputSandbox .
wget http///distribution/dmsetup dmsetup
--local DataGRID shoot_agent job.xml
15Tests on the DataGRID testbed
- Standard LHCb production jobs were used for the
tests - Jobs of different statistics with 8 step
workflow. - Jobs submitted to 4 EDG testbed Resource Brokers
- 50 jobs per broker
Total of 300K events produced (min bias B
inclusive)
16Lessons learnt
- Difficult start with a lots of subtle details to
learn - DataGRID instability problems persisting
- MDS information system failures
- Site misconfiguration
- Outbound IP connectivity is not available on all
sites - Needed for the LHCb software installation on per
job basis - Needed for jobs exchanging messages with
production services . - Data transfer
- bbftp file transport replaced by the replica
manager tools - Next run of tests will hopefully be more
successful - better understanding of the system
- Use other schema for job delivery to a worker
node .
17DIRAC on the DataGRID (2)
18Conclusions
- The DIRAC production system is routinely running
in production now at 17 sites - It is of great help for local production managers
and a key for the success of the LHCb Data
Challenge 2003 - DataGRID testbed is integrated in the DIRAC
production schema, extensive tests are in
progress .
19DIRAC components Production run preparation
ORACLE
Work flow Editor
Production Editor
Application Packer
Production DB
Production data
Scripts
Edit
Create application tar file
Instantiate Workflow
Production Service
Prod.Mgr
Production Agent
ORACLE
BookkeepingService
20DIRAC components Job execution
ORACLE
Work flow Editor
Production Editor
Application Packager
Production DB
Scripts
Edit
Create application tar file
Instantiate Workflow
Production Service
- Job request
- Status updates
Prod.Mgr
Production Agent
ORACLE