DIRAC LHCb MC production system - PowerPoint PPT Presentation

About This Presentation

Title:

DIRAC LHCb MC production system

Description:

E. van Herwijnen, J. Closier, M. Frank, C. Gaspar, F. Loverre, ... Packer. Create. application. tar file. Bookkeeping. Service. ORACLE. Production. Agent. Site A ... – PowerPoint PPT presentation

Number of Views:49

Avg rating:3.0/5.0

Slides: 21

Provided by: andrei159

Category:

more less

Transcript and Presenter's Notes

Title: DIRAC LHCb MC production system

1
DIRAC LHCb MC production system
A. Tsaregorodtsev, For the LHCb Data Management
team
CHEP 2003, La Jolla 20 March
2
Outline

Introduction
DIRAC architecture
Implementation details
Deploying agents on the DataGRID
Conclusions

LHCb Data Management team E. van Herwijnen, J.
Closier, M. Frank, C. Gaspar, F. Loverre, S.
Ponce (CERN), R. Graciani Diaz (Barcelona), D.
Galli, U. Marconi, V. Vagnoni (Bologna), N. Brook
(Bristol), A. Buckley, K. Harrison (Cambridge),
M. Schmelling (GRIDKA, Karlsruhe), U. Egede
(Imperial College London), A. Tsaregorodtsev, V.
Garonne (IN2P3, Marseille), A. Bogdanchikov (INP,
Novosibirsk), I. Korolko (ITEP, Moscow), A.
Washbrook, J.P. Palacios (Liverpool), S. Klous
(Nikhef and Vrije Universiteit Amsterdam), J.J.
Saborido (Santiago de Compostela), A. Khan
(ScotGrid, Edinburgh), A. Pickford (ScotGrid,
Glasgow), A. Soroko (Oxford), V. Romanovski
(Protvino), G.N. Patrick, G. Kuznetsov (RAL), M.
Gandelman (UFRJ, Rio de Janeiro)
3
What is it all about ?
DIRAC Distributed Infrastructure with Remote
Agent Control

Distributed MC production system for LHCb
Production tasks definition and steering
Software installation on production sites
Job scheduling and monitoring
Data transfers
Automates most of the production tasks
minimum participation of local production
managers
PULL rather than PUSH concept for tasks
scheduling
Different from the DataGRID architecture.

4
DIRAC architecture
5
Advantages of the PULL approach

Better use of resources
no idle or forgotten CPU power
natural load balancing more powerful center
gets more work automatically
Less burden on the central production service
deals only with production tasks definition and
bookkeeping
do not bother about particular production sites
No need for direct access to local disks from
central service
AFS is not used
no RPC calls
Easy introduction of new sites into the
production schema
no information on local sites necessary at the
central site

6
Job description
Web based editors
7
Agent operations
Production agent
batch system
Production service
SW distribution service
Monitoring service
Bookkeeping service
Castor
isQueueAvalable()
requestJob(queue)
installPackage()
submitJob(queue)
setJobStatus(step 1)
setJobStatus(step 2)

Running job
setJobStatus(step n)
sendBookkeeping()
sendFileToCastor()
addReplica()
8
Implementation details

Central web services
XML-RPC communication protocol
Web based editing and visualization
ORACLE production and bookkeeping databases.
Agent - a set of collaborating python classes
Python 1.5.2 to be sure it is compatible with
all the sites
Standard python library XML-RPC implementation
for passing messages
Easily extendable
for new (GAUDI) applications
for new tools, e.g. file transport .
Site specific functionality can be easily added
via callbacks
Data and log files transfer using bbftp .

9
Agent customization at a production site

Ease of setting up a production site is crucial
to absorb all available resources
One Python script where all the local
configuration is defined
Checking for the local batch queue availability
Local job submitting command
Copying to/from local mass storage
Data and log file transfer policy .
Agent distribution comes with examples of
typical cases
CERN LSF Castor
PBS data directory
Standalone PC
DataGRID worker node.

10
Dealing with failures

Job is rescheduled in case of a local system
failure to run it
Other sites can then pick it up
Journaling
all the sensitive files (logs, bookkeeping, job
descriptions) are kept at the production sites
bookkeeping update files are kept in the
bookkeeping service cache
Job can be restarted from where it failed
in case of system failure (lack of memory,
general power cut, etc)
RAW simulation data should be retained to skip
simulation step .
File transfers are automatically retried after a
predefined pause in case of failures

11
Working experience

DIRAC production system was deployed on 17 LHCb
production sites
2 hours to 2 days of work for customization
Dealing with firewalls
Interfaces to local batch system and mass storage
Site policy definition .
Smooth running for routine MC production tasks
Long jobs with no input data
Much less burden for local production managers
automatic data upload to CERN/Castor
log files automatically available through Web
interface
automatic recoveries from common failures (job
submission, data transfers)
The current Data Challenge production using DIRAC
advances ahead of schedule.

12
DIRAC deployment on the DataGRID testbed
13
DIRAC on the DataGRID
14
Deploying agents on the DataGRID

JDL InputSandbox contains
job XML description
Launcher script
Use EDG replica_manager for data transfer to
CERN/Castor
Log files are passed back via OutputSandbox .

wget http///distribution/dmsetup dmsetup
--local DataGRID shoot_agent job.xml
15
Tests on the DataGRID testbed

Standard LHCb production jobs were used for the
tests
Jobs of different statistics with 8 step
workflow.
Jobs submitted to 4 EDG testbed Resource Brokers
50 jobs per broker

Total of 300K events produced (min bias B
inclusive)
16
Lessons learnt

Difficult start with a lots of subtle details to
learn
DataGRID instability problems persisting
MDS information system failures
Site misconfiguration
Outbound IP connectivity is not available on all
sites
Needed for the LHCb software installation on per
job basis
Needed for jobs exchanging messages with
production services .
Data transfer
bbftp file transport replaced by the replica
manager tools
Next run of tests will hopefully be more
successful
better understanding of the system
Use other schema for job delivery to a worker
node .

17
DIRAC on the DataGRID (2)
18
Conclusions

The DIRAC production system is routinely running
in production now at 17 sites
It is of great help for local production managers
and a key for the success of the LHCb Data
Challenge 2003
DataGRID testbed is integrated in the DIRAC
production schema, extensive tests are in
progress .

19
DIRAC components Production run preparation
ORACLE
Work flow Editor
Production Editor
Application Packer
Production DB
Production data
Scripts
Edit
Create application tar file
Instantiate Workflow
Production Service
Prod.Mgr
Production Agent
ORACLE
BookkeepingService
20
DIRAC components Job execution
ORACLE
Work flow Editor
Production Editor
Application Packager
Production DB
Scripts
Edit
Create application tar file
Instantiate Workflow
Production Service