Title: Simulation in a Distributed Computing Environment
1 Simulation in a Distributed Computing
Environment
- S. Guatelli1, J. Moscicki2, M.G. Pia1
- 1INFN Genova, Italy
- 2CERN, Geneva, Switzerland
2Speed of Monte Carlo simulation
- Speed of execution is often a concern in Monte
Carlo simulation - Often a trade-off between precision of the
simulation and speed of execution
Typical use cases
- Semi-interactive response
- Detector design
- Optimisation
- Oncological radiotherapy
- Very long execution time
- High statistics simulation
- High precision simulation
Fast simulation Variance reduction techniques
(event biasing) Inverse Monte Carlo
methods Parallelisation
Methods for faster simulation response
3Features of this study
- Geant4 application in a distributed computing
environment - Architecture
- Implications on simulation applications
- Environments
- PC farm
- GRID
- Two use cases Geant4 Advanced Examples
- semi-interactive response (brachytherapy)
- high statistics (medical_linac)
- By-product results for Geant4 medical
application - Quantitative study
- results to be submitted for publication
4Requirements
Architectural requirements
- Transparent execution in sequential/parallel mode
- Transparent execution on a PC farm and on the Grid
High statistics simulation
Semi-interactive simulation
- Geant4 brachytherapy
- Execution time for 20 M events 5 hours
- Goal execution time few minutes
- Geant4 medical_linac
- Execution time for 109 events 10 days
- Goal execution time few hours
Reference sequential mode on a Pentium IV, 3 GHz
5Parallel mode local cluster / GRID
- Both applications have the same computing model
- a job consists of a number of independent tasks
which may be executed in parallel - result of each task is a small data packet (few
kb), which is merged as the job runs - In a cluster
- computing resources are used for parallel
execution - user connects to a possibly remote cluster
- input data for the job must be available on the
site - typically there is a shared file system and a
queuing system - network is fast
- GRID computing uses resources from multiple
computing centres - typically there is no shared file system
- (parts of) input data must be replicated in
remote sites - network connection is slower than within a cluster
6Overview
- Architectural issues
- DIANE
- How to dianize a Geant4 application
- Performance tests
- On a single CPU
- On clusters
- On the GRID
- Conclusions
- Lessons learned
- Outlook
Quantitative, documented results
Publicly distributed DIANE Geant4 application
code
7DIANE
http//cern.ch/DIANE
Developed by J. Moscicki, CERN/IT
- RD project
- started in 2001 in CERN/IT with very limited
resources - collaboration with Geant4 groups at CERN, INFN,
ESA - succesful prototypes running on LSF and EDG
Master-Worker architectural pattern
- Parallel cluster processing
- make fine tuning and customisation easy
- transparently using GRID technology
- application independent
8Practical example Geant4 simulation with analysis
- Each task produces a file with histograms
- The job result is the sum of histograms produced
by tasks - Master-worker model
- client starts a job
- workers perform tasks and produce histograms
- master integrates the results
- Distributed Processing for Geant4 Applications
- task N events
- job M tasks
- tasks may be executed in parallel
- tasks produce histograms/ntuples
- task output is automatically combined (add
histograms, append ntuples) - Master-Worker Model
- Master steers the execution of job, automatically
splits the job and merges the results - Worker initializes the Geant4 application and
executes macros - Client gets the results
9simulation with DIANE
UML Deployment Diagram for Geant4 applications
- Completely transparent to the user same Geant4
application code - G4Simulation class is responsible of managing the
simulation - manage random number seeds
- Geant4 initialisation
- macros to be executed in batch mode
- termination
10Development costs
- Strategy to minimise the cost of migrating a
Geant4 simulation to a distributed environment - DIANE Active Workflow framework
- provides automatic communication/synchronization
mechanisms - application is glued to the framework using a
small Python module - in most cases no code changes to the original
application are required - load balancing and error recovery policies may be
plugged in form of simple python functions - Transparent adaptation for Clusters/GRIDs,
shared/local file systems, shared/private queues - Development/modification of application code
- original source code unmodified
- addition of an interface class which binds
together application and M-W framework
The application developer is shielded from the
complexity of underlying technology via DIANE
11Test results
- Performance of the execution of the dianized
Brachytherapy example - Test on a single CPU
- Test on a dedicated farm (60 CPUs)
- Test on a farm shared with other users (LSF,
CERN) - Test on the GRID (LCG)
Tools and libraries Simulation toolkit Geant4
7.0.p01 Analysis tools AIDA 3.2.1 and PI
1.3.3 DIANE DIANE 1.4.2 CLHEP 1.9.1.2 G4EMLOW
2.3
12Overhead at initialisation/termination
- Test on a single dedicated CPU (Intel , Pentium
IV, 3.00 GHz) - Study execution via DIANE w.r.t. sequential
execution - run 1 event
Standalone application 4.6 ? 0.2 s
Application via DIANE, simulation only 8.8 ? 0.8 s
Application via DIANE, with analysis integration 9.5 ? 0.5 s
Overhead 5 s, negligible in a high statistics
job
13Overhead due to DIANE
- Test on a single dedicated CPU (Intel , Pentium
IV, 3.00 GHz) - Study execution via DIANE w.r.t. sequential
execution
Execution time vs. number of events in the job
The overhead of DIANE is negligible in high
statistics jobs
Ratio
with respect to the number of events
14Farm execution time and efficiency
- Dedicated farm 30 identical bi-processors
(Pentium IV, 3 GHz) - Thanks to Regional Operation Centre (ROC) Team,
Taiwan - Thanks to Hurng-Chun Lee (Academia Sinica Grid
Computing Center, Taiwan) - Load balancing optimisation of the number of
tasks and workers
15Optimizing the number of tasks
- The job ends when all the tasks are executed in
the workers - If the job is split into a higher number of
tasks, the chance that the workers finish the
tasks at the same time is a higher - Note the overall time of the job is determined
by the last worker to finish the last task
Example of a good job balancing
Example of a job that can be improved from a
performance point of view
16Farm shared with other users
Real-life case farm shared with other users
Execution in parallel mode on 5 workers of CERN
LSF DIANE used as intermediate layer
Preliminary!
The load of the cluster changes quickly in
time The conditions of the test are not
reproducible
Highly variable performance
17Parallel execution in a PC farm
- Required production of Brachytherapy 20 M events
- 20 M events in sequential mode
- 16646 s ( 4h and 38 minutes) on a a Intel ,
Pentium IV, 3.00 GHz - The same simulation runs in 5 minutes in parallel
on 56 CPUs - appropriate for clinical usage
- Similar results for Geant4 medical_linac Advanced
Example - production can become compatible with usage for
the verification of IMRT treatment planning - sequential execution requires 10 days to obtain
significant results
18Running on the Grid (LCG)
- G4Brachy executed on the GRID (LCG)
- nodes located in Spain, Russia, Italy, Germany,
Switzerland - Conditions of the test
- The load of the GRID changes quickly in time
- The conditions of the test are not reproducible
- Efficiency
- The evaluation of the efficiency with the same
criterion as in a dedicated farm does not make
much sense in this context - Study the efficiency of DIANE as automated job
management w.r.t. manual submission through
simple scripts
19Test results
Execution on the GRID through DIANE, 20 M
events,180 tasks, 30 workers
Execution on the GRID, without DIANE
Worker number
Worker number
Time (seconds)
Time (seconds)
Through DIANE - All the tasks are executed
successfully on 22 workers - Not all the workers
are initialized and used on-going investigation
Without DIANE - 2 jobs not successfully
executed due to set-up problems of the workers
20How the GRID load changes
- Execution time of Brachytherapy in two different
conditions of the GRID - DIANE used as intermediate layer
Worker number
Worker number
Time (seconds)
Time (seconds)
20 M events, 60 workers initialized, 360 tasks
Very different result!
21Farm/GRID execution
- Brachy, 20 M events, 180 tasks
- Taipei cluster
- 29 machines, 734 s 12 minutes
- GRID
- 27 machines, 1517 s 25 minutes
Preliminary indication The conditions are not
reproducible
22Lessons learned
- DIANE as intermediate layer
- Transparency
- Good separation of the subsystems
- Good management of CPU resources
- Negligible overhead
- Load balancing
- A relatively large number of tasks increases the
efficiency of parallel execution in a farm - Trade-off between optimisation of task splitting
and overhead introduced - Controlled and real life situation is quite
different in a farm - need dedicated farm for critical usage (i.e.
hospital) - Grid
- highly variable environment
- not mature yet for critical usage
- automated management through a smart system is
mandatory - work in progress, details still to be understood
quantitatively
23Conclusions
- General approach to the execution of Geant4
simulation in a distributed computing environment - transparent sequential/parallel application
- transparent execution on a local farm or on the
Grid - user code is the same
- Quantitative, documented results
- reference for users and for further improvement
- on-going work to understand details