Title: ATLAS DC2 Phase I
1ATLAS DC2 Phase I
- ATLAS Software Week
- 6th December 2004
- Gilbert Poulard (CERN PH-ATC)
- on behalf of ATLAS DC Grid and Operations teams
2ATLAS-DC2 operation
- Consider DC2 as a three-part operation
- part I production of simulated data
(July-November 2004) - running on 3 Grids, worldwide
- part II test of Tier-0 operation
(November-December 2004) - Do in 10 days what should be done in 1 day when
real data-taking start - Input is Raw Data like
- output (ESDAOD) will be distributed to Tier-1s
in real time for analysis - part III test of distributed analysis on the
Grid (Early 2005) - access to event and non-event data from anywhere
in the world both in organized and chaotic ways - Requests
- Physics channels (10 millions of events)
- Several millions of events for calibration
(single particles and physics samples (di-jets))
3DC2-Phase 1 Data preparation
- DC2 Phase I
- Part 1 event generation
- Physics processes --gt 4-momentum of particles
- Part 2 Detector simulation
- Tracking of particles through the detector
- Records interaction of particle with sensitive
elements of the detector - Part 3 pile-up and digitization
- Pile-up superposition of background events
with the signal event - Digitization response of the sensitive elements
of the detector - Output, called byte-stream data, looks-like
Raw Data - DC2 Phase II
- Part 4 Data transfer CERN Tier-0
- Part 5 Event mixing
- Part 6 Tier-0 exercise
4DC2-Phase II
- Data preparation
- Transfer of data to CERN (100K files 25 TB)
- Event mixing
- 30 Physics channels
- Plan originally to produce ByteStream but decided
to use RDO (well tested) - Tier-0 exercise
- Reconstruction ESD AOD
- Reconstruction from RDO
- Creates ESD (Event Summary Data)
- In a 2nd step produces AOD (Analysis Object Data)
in 10 different streams and Event collections - In parallel distributes ESD and AOD to Tier-1s in
real time - ESD 2 Tier-1s
- AOD all Tier-1s
5DC2 Phase I
- Started in July and effectively completed
- On 3 Grids
- LCG
- Including some non-ATLAS sites
- Using in production mode the LCG-Grid-Canada
interface - 3 sites are accessible through this interface at
TRIUMF - Uni. Victoria, Uni. Alberta and WestGrid
(SFU/TRIUMF) - NorduGrid
- Several Scandinavian super-computer resources
- Sites in Australia, Germany, Slovenia,
Switzerland - Grid3
- Using also computing resources that are not
dedicated to ATLAS (e.g. US-CMS sites)
6Grid3 participating sites
- Sep 04
- 30 sites, multi-VO
- shared resources
- 3000 CPUs (shared)
7NorduGrid Co. Participating sites
- Totals
- 7 countries
- 22 sites
- 3000 CPUs
- dedicated 600
- 7 Storage Services (in RLS)
- few more storage facilities
- 12TB
- 1FTE (1-3 persons) in charge of production
- 2-3 executor instances
8LCG-2
9ATLAS Production system
prodDB
AMI
dms
Don Quijote
Windmill
super
super
super
super
super
soap
jabber
jabber
jabber
soap
LCG exe
LCG exe
NG exe
G3 exe
LSF exe
Capone
Dulcinea
Lexor
RLS
RLS
RLS
LCG
NG
Grid3
LSF
10LCG dedicated resources (services)
- Initial underestimate of ATLAS needs for DC2
- Only a UI/RB/BDII/DQ combo machine !!
- Several service resources currently dedicated to
ATLAS - 2 User Interfaces (lxb0725, lxb0726)
- 2 Resource Brokers (lxb0728, lxb0729)
- 1 MyProxy server (lxb0727)
- ATLAS-BDII (load share) (lxb2005, lxb2011)
- DQ server (lxn1190)
- ATLAS dedicated services across sites (IFIC,
CNAF, Milano) - Some of those resources used to saturation
- Initially some latency providing resources
(coord. with security team) - Services (RB,BDII) always been kept up to date
with new patches/bug fixes
11DC2 Phase I operation
- Main difficulties at the initial phase
- For all Grids
- Debugging the Production System
- On LCG and Grid3 several instances of the
Supervisor have to be run for better coping with
the instability of the system. As a consequence
the Production System was more difficult to
handle. - LCG
- Mis-configuration of sites Information system
(wrong or missing information) Job submission
and Resource Broker (leak due to EDG-WP1) Jobs
ranking. - Data management(copy register) Stage in/out
problems - NorduGrid
- Replica Location Service (Globus) hanging several
times per day - Mis-configuration of sites
- Access to the conditions database
- Grid3
- Data Management - RLS interactions
- Software distribution problems
- Load on gatekeepers
- Some problems with certificates (causing jobs to
abort) - Good collaboration with Grid teams to solve the
problems - On the other hand, the Athena framework and
Geant4 were extremely stable (only a handful of
crashes in gt10 M events)
12DC2 Phase I problems
- Non initial problems (not always fixed)
- NorduGrid
- Access to conditions database Site specific
accidents (ex. Storage elements died) - Grid3
- Try to avoid single points of failure (adding new
servers) - Lack of storage management in some sites
- LCG
- Workload Management System
- Resource Broker (slow rejecting jobs if too
busy) - Site ranking based on too few parameters
- Uneven job distribution
- Lack of normalized CPU units (jobs going to wrong
queues) - Data managementSystem
- Failure to get input file
- Failure to store or/an register output files
- Correctly registered output files but data
corrupted during transfer - For all
- Slowness of the response of the Production
Database - Problem that appeared after 6 weeks of running.
13ATLAS DC2 production
14ATLAS DC2 production
15Jobs on LCG
30 November 2004
31 sites 90000 jobs
16LCG successful jobs
17LCG failed jobs
Production database also used for testing!
18LCG failure rate
Production database also used for testing!
19Jobs on Grid3
30 November 2004
19 sites 93000 jobs
20Status of GRID3 Jobs
To Do extra A9 simulation, some digitization
and some B1 pile-up Note also waiting for some
B3 and B4 input evgen files from LCG
21Job Success Rate on GRID3
22Grid3 successful jobs
23Grid3 failed jobs
24Grid3 failure rate
25Jobs on NorduGrid
30 November 2004
19 sites 93000 jobs
26Jobs on NorduGrid
27NorduGrid successful Jobs
28NorduGrid failed jobs
29NorduGrid failure rate
30NorduGrid failure reasons
31Jobs Total
30 November 2004
69 sites 276000 Jobs
32G4-Simulation
Physics channels only 30 November 2004
33Digitization
Physics channels only 30 November 2004
34Pile-up
Physics channels only 30 November 2004
35Summary (1)
- All DC2 operation have been done on Grid
- Grid systems are not easy to use and debug
- Its difficult to know where problems are
- Production required more human resources than
expected - DC1 in 2002 ran on non-Grid European sites with
one production manager per site - DC2 in 2004 ran on LCG sites with 4-5 people for
the central operation, plus the LCG support team - Grid3 has a production team
- Should we generalize the concept?
- DC2 on NorduGrid was run by 2 people
36Summary (2)
- Current production system is not user friendly
- It was fragile at the beginning
- It became more robust after several weeks of
running and is stable now - A review is scheduled for mid-January 2005
- Schedule was driven by the availability (and
robustness) of many different components
(Middleware Production System software
database) - All systems are under development and need to be
stabilized - Nevertheless
- Phase I is over
- Phase II is running and Tier-0 exercise will be
repeated when we will be in more stable
conditions