ATLAS DC2 Phase I - PowerPoint PPT Presentation

1 / 36

About This Presentation

Title:

ATLAS DC2 Phase I

Description:

ATLAS DC2 Phase I – PowerPoint PPT presentation

Number of Views:43

Avg rating:3.0/5.0

Slides: 37

Provided by: francescop

Category:

more less

Transcript and Presenter's Notes

Title: ATLAS DC2 Phase I

1
ATLAS DC2 Phase I

ATLAS Software Week
6th December 2004
Gilbert Poulard (CERN PH-ATC)
on behalf of ATLAS DC Grid and Operations teams

2
ATLAS-DC2 operation

Consider DC2 as a three-part operation
part I production of simulated data
(July-November 2004)
running on 3 Grids, worldwide
part II test of Tier-0 operation
(November-December 2004)
Do in 10 days what should be done in 1 day when
real data-taking start
Input is Raw Data like
output (ESDAOD) will be distributed to Tier-1s
in real time for analysis
part III test of distributed analysis on the
Grid (Early 2005)
access to event and non-event data from anywhere
in the world both in organized and chaotic ways
Requests
Physics channels (10 millions of events)
Several millions of events for calibration
(single particles and physics samples (di-jets))

3
DC2-Phase 1 Data preparation

DC2 Phase I
Part 1 event generation
Physics processes --gt 4-momentum of particles
Part 2 Detector simulation
Tracking of particles through the detector
Records interaction of particle with sensitive
elements of the detector
Part 3 pile-up and digitization
Pile-up superposition of background events
with the signal event
Digitization response of the sensitive elements
of the detector
Output, called byte-stream data, looks-like
Raw Data
DC2 Phase II
Part 4 Data transfer CERN Tier-0
Part 5 Event mixing
Part 6 Tier-0 exercise

4
DC2-Phase II

Data preparation
Transfer of data to CERN (100K files 25 TB)
Event mixing
30 Physics channels
Plan originally to produce ByteStream but decided
to use RDO (well tested)
Tier-0 exercise
Reconstruction ESD AOD
Reconstruction from RDO
Creates ESD (Event Summary Data)
In a 2nd step produces AOD (Analysis Object Data)
in 10 different streams and Event collections
In parallel distributes ESD and AOD to Tier-1s in
real time
ESD 2 Tier-1s
AOD all Tier-1s

5
DC2 Phase I

Started in July and effectively completed
On 3 Grids
LCG
Including some non-ATLAS sites
Using in production mode the LCG-Grid-Canada
interface
3 sites are accessible through this interface at
TRIUMF
Uni. Victoria, Uni. Alberta and WestGrid
(SFU/TRIUMF)
NorduGrid
Several Scandinavian super-computer resources
Sites in Australia, Germany, Slovenia,
Switzerland
Grid3
Using also computing resources that are not
dedicated to ATLAS (e.g. US-CMS sites)

6
Grid3 participating sites

Sep 04
30 sites, multi-VO
shared resources
3000 CPUs (shared)

7
NorduGrid Co. Participating sites

Totals
7 countries
22 sites
3000 CPUs
dedicated 600
7 Storage Services (in RLS)
few more storage facilities
12TB
1FTE (1-3 persons) in charge of production
2-3 executor instances

8
LCG-2
9
ATLAS Production system
prodDB
AMI
dms
Don Quijote
Windmill
super
super
super
super
super
soap
jabber
jabber
jabber
soap
LCG exe
LCG exe
NG exe
G3 exe
LSF exe
Capone
Dulcinea
Lexor
RLS
RLS
RLS
LCG
NG
Grid3
LSF
10
LCG dedicated resources (services)

Initial underestimate of ATLAS needs for DC2
Only a UI/RB/BDII/DQ combo machine !!
Several service resources currently dedicated to
ATLAS
2 User Interfaces (lxb0725, lxb0726)
2 Resource Brokers (lxb0728, lxb0729)
1 MyProxy server (lxb0727)
ATLAS-BDII (load share) (lxb2005, lxb2011)
DQ server (lxn1190)
ATLAS dedicated services across sites (IFIC,
CNAF, Milano)
Some of those resources used to saturation
Initially some latency providing resources
(coord. with security team)
Services (RB,BDII) always been kept up to date
with new patches/bug fixes

11
DC2 Phase I operation

Main difficulties at the initial phase
For all Grids
Debugging the Production System
On LCG and Grid3 several instances of the
Supervisor have to be run for better coping with
the instability of the system. As a consequence
the Production System was more difficult to
handle.
LCG
Mis-configuration of sites Information system
(wrong or missing information) Job submission
and Resource Broker (leak due to EDG-WP1) Jobs
ranking.
Data management(copy register) Stage in/out
problems
NorduGrid
Replica Location Service (Globus) hanging several
times per day
Mis-configuration of sites
Access to the conditions database
Grid3
Data Management - RLS interactions
Software distribution problems
Load on gatekeepers
Some problems with certificates (causing jobs to
abort)
Good collaboration with Grid teams to solve the
problems
On the other hand, the Athena framework and
Geant4 were extremely stable (only a handful of
crashes in gt10 M events)

12
DC2 Phase I problems

Non initial problems (not always fixed)
NorduGrid
Access to conditions database Site specific
accidents (ex. Storage elements died)
Grid3
Try to avoid single points of failure (adding new
servers)
Lack of storage management in some sites
LCG
Workload Management System
Resource Broker (slow rejecting jobs if too
busy)
Site ranking based on too few parameters
Uneven job distribution
Lack of normalized CPU units (jobs going to wrong
queues)
Data managementSystem
Failure to get input file
Failure to store or/an register output files
Correctly registered output files but data
corrupted during transfer
For all
Slowness of the response of the Production
Database
Problem that appeared after 6 weeks of running.

13
ATLAS DC2 production
14
ATLAS DC2 production
15
Jobs on LCG
30 November 2004
31 sites 90000 jobs
16
LCG successful jobs
17
LCG failed jobs
Production database also used for testing!
18
LCG failure rate
Production database also used for testing!
19
Jobs on Grid3
30 November 2004
19 sites 93000 jobs
20
Status of GRID3 Jobs
To Do extra A9 simulation, some digitization
and some B1 pile-up Note also waiting for some
B3 and B4 input evgen files from LCG
21
Job Success Rate on GRID3
22
Grid3 successful jobs
23
Grid3 failed jobs
24
Grid3 failure rate
25
Jobs on NorduGrid
30 November 2004
19 sites 93000 jobs
26
Jobs on NorduGrid
27
NorduGrid successful Jobs
28
NorduGrid failed jobs
29
NorduGrid failure rate
30
NorduGrid failure reasons
31
Jobs Total
30 November 2004
69 sites 276000 Jobs
32
G4-Simulation
Physics channels only 30 November 2004
33
Digitization
Physics channels only 30 November 2004
34
Pile-up
Physics channels only 30 November 2004
35
Summary (1)

All DC2 operation have been done on Grid
Grid systems are not easy to use and debug
Its difficult to know where problems are
Production required more human resources than
expected
DC1 in 2002 ran on non-Grid European sites with
one production manager per site
DC2 in 2004 ran on LCG sites with 4-5 people for
the central operation, plus the LCG support team
Grid3 has a production team
Should we generalize the concept?
DC2 on NorduGrid was run by 2 people

36
Summary (2)

Current production system is not user friendly
It was fragile at the beginning
It became more robust after several weeks of
running and is stable now
A review is scheduled for mid-January 2005
Schedule was driven by the availability (and
robustness) of many different components
(Middleware Production System software
database)
All systems are under development and need to be
stabilized
Nevertheless
Phase I is over
Phase II is running and Tier-0 exercise will be
repeated when we will be in more stable
conditions