DIRAC LHCb MC production system - PowerPoint PPT Presentation

About This Presentation
Title:

DIRAC LHCb MC production system

Description:

E. van Herwijnen, J. Closier, M. Frank, C. Gaspar, F. Loverre, ... Packer. Create. application. tar file. Bookkeeping. Service. ORACLE. Production. Agent. Site A ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 21
Provided by: andrei159
Category:

less

Transcript and Presenter's Notes

Title: DIRAC LHCb MC production system


1
DIRAC LHCb MC production system
A. Tsaregorodtsev, For the LHCb Data Management
team
CHEP 2003, La Jolla 20 March
2
Outline
  • Introduction
  • DIRAC architecture
  • Implementation details
  • Deploying agents on the DataGRID
  • Conclusions

LHCb Data Management team E. van Herwijnen, J.
Closier, M. Frank, C. Gaspar, F. Loverre, S.
Ponce (CERN), R. Graciani Diaz (Barcelona), D.
Galli, U. Marconi, V. Vagnoni (Bologna), N. Brook
(Bristol), A. Buckley, K. Harrison (Cambridge),
M. Schmelling (GRIDKA, Karlsruhe), U. Egede
(Imperial College London), A. Tsaregorodtsev, V.
Garonne (IN2P3, Marseille), A. Bogdanchikov (INP,
Novosibirsk), I. Korolko (ITEP, Moscow), A.
Washbrook, J.P. Palacios (Liverpool), S. Klous
(Nikhef and Vrije Universiteit Amsterdam), J.J.
Saborido (Santiago de Compostela), A. Khan
(ScotGrid, Edinburgh), A. Pickford (ScotGrid,
Glasgow), A. Soroko (Oxford), V. Romanovski
(Protvino), G.N. Patrick, G. Kuznetsov (RAL), M.
Gandelman (UFRJ, Rio de Janeiro) 
3
What is it all about ?
DIRAC Distributed Infrastructure with Remote
Agent Control
  • Distributed MC production system for LHCb
  • Production tasks definition and steering
  • Software installation on production sites
  • Job scheduling and monitoring
  • Data transfers
  • Automates most of the production tasks
  • minimum participation of local production
    managers
  • PULL rather than PUSH concept for tasks
    scheduling
  • Different from the DataGRID architecture.

4
DIRAC architecture
5
Advantages of the PULL approach
  • Better use of resources
  • no idle or forgotten CPU power
  • natural load balancing more powerful center
    gets more work automatically
  • Less burden on the central production service
  • deals only with production tasks definition and
    bookkeeping
  • do not bother about particular production sites
  • No need for direct access to local disks from
    central service
  • AFS is not used
  • no RPC calls
  • Easy introduction of new sites into the
    production schema
  • no information on local sites necessary at the
    central site

6
Job description
Web based editors
7
Agent operations
Production agent
batch system
Production service
SW distribution service
Monitoring service
Bookkeeping service
Castor
isQueueAvalable()
requestJob(queue)
installPackage()
submitJob(queue)
setJobStatus(step 1)
setJobStatus(step 2)

Running job
setJobStatus(step n)
sendBookkeeping()
sendFileToCastor()
addReplica()
8
Implementation details
  • Central web services
  • XML-RPC communication protocol
  • Web based editing and visualization
  • ORACLE production and bookkeeping databases.
  • Agent - a set of collaborating python classes
  • Python 1.5.2 to be sure it is compatible with
    all the sites
  • Standard python library XML-RPC implementation
    for passing messages
  • Easily extendable
  • for new (GAUDI) applications
  • for new tools, e.g. file transport .
  • Site specific functionality can be easily added
    via callbacks
  • Data and log files transfer using bbftp .

9
Agent customization at a production site
  • Ease of setting up a production site is crucial
    to absorb all available resources
  • One Python script where all the local
    configuration is defined
  • Checking for the local batch queue availability
  • Local job submitting command
  • Copying to/from local mass storage
  • Data and log file transfer policy .
  • Agent distribution comes with examples of
    typical cases
  • CERN LSF Castor
  • PBS data directory
  • Standalone PC
  • DataGRID worker node.

10
Dealing with failures
  • Job is rescheduled in case of a local system
    failure to run it
  • Other sites can then pick it up
  • Journaling
  • all the sensitive files (logs, bookkeeping, job
    descriptions) are kept at the production sites
  • bookkeeping update files are kept in the
    bookkeeping service cache
  • Job can be restarted from where it failed
  • in case of system failure (lack of memory,
    general power cut, etc)
  • RAW simulation data should be retained to skip
    simulation step .
  • File transfers are automatically retried after a
    predefined pause in case of failures

11
Working experience
  • DIRAC production system was deployed on 17 LHCb
    production sites
  • 2 hours to 2 days of work for customization
  • Dealing with firewalls
  • Interfaces to local batch system and mass storage
  • Site policy definition .
  • Smooth running for routine MC production tasks
  • Long jobs with no input data
  • Much less burden for local production managers
  • automatic data upload to CERN/Castor
  • log files automatically available through Web
    interface
  • automatic recoveries from common failures (job
    submission, data transfers)
  • The current Data Challenge production using DIRAC
    advances ahead of schedule.

12
DIRAC deployment on the DataGRID testbed
13
DIRAC on the DataGRID
14
Deploying agents on the DataGRID
  • JDL InputSandbox contains
  • job XML description
  • Launcher script
  • Use EDG replica_manager for data transfer to
    CERN/Castor
  • Log files are passed back via OutputSandbox .

wget http///distribution/dmsetup dmsetup
--local DataGRID shoot_agent job.xml
15
Tests on the DataGRID testbed
  • Standard LHCb production jobs were used for the
    tests
  • Jobs of different statistics with 8 step
    workflow.
  • Jobs submitted to 4 EDG testbed Resource Brokers
  • 50 jobs per broker

Total of 300K events produced (min bias B
inclusive)
16
Lessons learnt
  • Difficult start with a lots of subtle details to
    learn
  • DataGRID instability problems persisting
  • MDS information system failures
  • Site misconfiguration
  • Outbound IP connectivity is not available on all
    sites
  • Needed for the LHCb software installation on per
    job basis
  • Needed for jobs exchanging messages with
    production services .
  • Data transfer
  • bbftp file transport replaced by the replica
    manager tools
  • Next run of tests will hopefully be more
    successful
  • better understanding of the system
  • Use other schema for job delivery to a worker
    node .

17
DIRAC on the DataGRID (2)
18
Conclusions
  • The DIRAC production system is routinely running
    in production now at 17 sites
  • It is of great help for local production managers
    and a key for the success of the LHCb Data
    Challenge 2003
  • DataGRID testbed is integrated in the DIRAC
    production schema, extensive tests are in
    progress .

19
DIRAC components Production run preparation
ORACLE
Work flow Editor
Production Editor
Application Packer
Production DB
Production data
Scripts
Edit
Create application tar file
Instantiate Workflow
Production Service
Prod.Mgr
Production Agent
ORACLE
BookkeepingService
20
DIRAC components Job execution
ORACLE
Work flow Editor
Production Editor
Application Packager
Production DB
Scripts
Edit
Create application tar file
Instantiate Workflow
Production Service
  • Job request
  • Status updates

Prod.Mgr
Production Agent
ORACLE
Write a Comment
User Comments (0)
About PowerShow.com