ATLAS DC2 Phase I - PowerPoint PPT Presentation

1 / 36
About This Presentation
Title:

ATLAS DC2 Phase I

Description:

ATLAS DC2 Phase I – PowerPoint PPT presentation

Number of Views:43
Avg rating:3.0/5.0
Slides: 37
Provided by: francescop
Category:
Tags: atlas | benedict | dc2 | phase

less

Transcript and Presenter's Notes

Title: ATLAS DC2 Phase I


1
ATLAS DC2 Phase I
  • ATLAS Software Week
  • 6th December 2004
  • Gilbert Poulard (CERN PH-ATC)
  • on behalf of ATLAS DC Grid and Operations teams

2
ATLAS-DC2 operation
  • Consider DC2 as a three-part operation
  • part I production of simulated data
    (July-November 2004)
  • running on 3 Grids, worldwide
  • part II test of Tier-0 operation
    (November-December 2004)
  • Do in 10 days what should be done in 1 day when
    real data-taking start
  • Input is Raw Data like
  • output (ESDAOD) will be distributed to Tier-1s
    in real time for analysis
  • part III test of distributed analysis on the
    Grid (Early 2005)
  • access to event and non-event data from anywhere
    in the world both in organized and chaotic ways
  • Requests
  • Physics channels (10 millions of events)
  • Several millions of events for calibration
    (single particles and physics samples (di-jets))

3
DC2-Phase 1 Data preparation
  • DC2 Phase I
  • Part 1 event generation
  • Physics processes --gt 4-momentum of particles
  • Part 2 Detector simulation
  • Tracking of particles through the detector
  • Records interaction of particle with sensitive
    elements of the detector
  • Part 3 pile-up and digitization
  • Pile-up superposition of background events
    with the signal event
  • Digitization response of the sensitive elements
    of the detector
  • Output, called byte-stream data, looks-like
    Raw Data
  • DC2 Phase II
  • Part 4 Data transfer CERN Tier-0
  • Part 5 Event mixing
  • Part 6 Tier-0 exercise

4
DC2-Phase II
  • Data preparation
  • Transfer of data to CERN (100K files 25 TB)
  • Event mixing
  • 30 Physics channels
  • Plan originally to produce ByteStream but decided
    to use RDO (well tested)
  • Tier-0 exercise
  • Reconstruction ESD AOD
  • Reconstruction from RDO
  • Creates ESD (Event Summary Data)
  • In a 2nd step produces AOD (Analysis Object Data)
    in 10 different streams and Event collections
  • In parallel distributes ESD and AOD to Tier-1s in
    real time
  • ESD 2 Tier-1s
  • AOD all Tier-1s

5
DC2 Phase I
  • Started in July and effectively completed
  • On 3 Grids
  • LCG
  • Including some non-ATLAS sites
  • Using in production mode the LCG-Grid-Canada
    interface
  • 3 sites are accessible through this interface at
    TRIUMF
  • Uni. Victoria, Uni. Alberta and WestGrid
    (SFU/TRIUMF)
  • NorduGrid
  • Several Scandinavian super-computer resources
  • Sites in Australia, Germany, Slovenia,
    Switzerland
  • Grid3
  • Using also computing resources that are not
    dedicated to ATLAS (e.g. US-CMS sites)

6
Grid3 participating sites
  • Sep 04
  • 30 sites, multi-VO
  • shared resources
  • 3000 CPUs (shared)

7
NorduGrid Co. Participating sites
  • Totals
  • 7 countries
  • 22 sites
  • 3000 CPUs
  • dedicated 600
  • 7 Storage Services (in RLS)
  • few more storage facilities
  • 12TB
  • 1FTE (1-3 persons) in charge of production
  • 2-3 executor instances

8
LCG-2
9
ATLAS Production system
prodDB
AMI
dms
Don Quijote
Windmill
super
super
super
super
super
soap
jabber
jabber
jabber
soap
LCG exe
LCG exe
NG exe
G3 exe
LSF exe
Capone
Dulcinea
Lexor
RLS
RLS
RLS
LCG
NG
Grid3
LSF
10
LCG dedicated resources (services)
  • Initial underestimate of ATLAS needs for DC2
  • Only a UI/RB/BDII/DQ combo machine !!
  • Several service resources currently dedicated to
    ATLAS
  • 2 User Interfaces (lxb0725, lxb0726)
  • 2 Resource Brokers (lxb0728, lxb0729)
  • 1 MyProxy server (lxb0727)
  • ATLAS-BDII (load share) (lxb2005, lxb2011)
  • DQ server (lxn1190)
  • ATLAS dedicated services across sites (IFIC,
    CNAF, Milano)
  • Some of those resources used to saturation
  • Initially some latency providing resources
    (coord. with security team)
  • Services (RB,BDII) always been kept up to date
    with new patches/bug fixes

11
DC2 Phase I operation
  • Main difficulties at the initial phase
  • For all Grids
  • Debugging the Production System
  • On LCG and Grid3 several instances of the
    Supervisor have to be run for better coping with
    the instability of the system. As a consequence
    the Production System was more difficult to
    handle.
  • LCG
  • Mis-configuration of sites Information system
    (wrong or missing information) Job submission
    and Resource Broker (leak due to EDG-WP1) Jobs
    ranking.
  • Data management(copy register) Stage in/out
    problems
  • NorduGrid
  • Replica Location Service (Globus) hanging several
    times per day
  • Mis-configuration of sites
  • Access to the conditions database
  • Grid3
  • Data Management - RLS interactions
  • Software distribution problems
  • Load on gatekeepers
  • Some problems with certificates (causing jobs to
    abort)
  • Good collaboration with Grid teams to solve the
    problems
  • On the other hand, the Athena framework and
    Geant4 were extremely stable (only a handful of
    crashes in gt10 M events)

12
DC2 Phase I problems
  • Non initial problems (not always fixed)
  • NorduGrid
  • Access to conditions database Site specific
    accidents (ex. Storage elements died)
  • Grid3
  • Try to avoid single points of failure (adding new
    servers)
  • Lack of storage management in some sites
  • LCG
  • Workload Management System
  • Resource Broker (slow rejecting jobs if too
    busy)
  • Site ranking based on too few parameters
  • Uneven job distribution
  • Lack of normalized CPU units (jobs going to wrong
    queues)
  • Data managementSystem
  • Failure to get input file
  • Failure to store or/an register output files
  • Correctly registered output files but data
    corrupted during transfer
  • For all
  • Slowness of the response of the Production
    Database
  • Problem that appeared after 6 weeks of running.

13
ATLAS DC2 production
14
ATLAS DC2 production
15
Jobs on LCG
30 November 2004
31 sites 90000 jobs
16
LCG successful jobs
17
LCG failed jobs
Production database also used for testing!
18
LCG failure rate
Production database also used for testing!
19
Jobs on Grid3
30 November 2004
19 sites 93000 jobs
20
Status of GRID3 Jobs
To Do extra A9 simulation, some digitization
and some B1 pile-up Note also waiting for some
B3 and B4 input evgen files from LCG
21
Job Success Rate on GRID3
22
Grid3 successful jobs
23
Grid3 failed jobs
24
Grid3 failure rate
25
Jobs on NorduGrid
30 November 2004
19 sites 93000 jobs
26
Jobs on NorduGrid
27
NorduGrid successful Jobs
28
NorduGrid failed jobs
29
NorduGrid failure rate
30
NorduGrid failure reasons
31
Jobs Total
30 November 2004
69 sites 276000 Jobs
32
G4-Simulation
Physics channels only 30 November 2004
33
Digitization
Physics channels only 30 November 2004
34
Pile-up
Physics channels only 30 November 2004
35
Summary (1)
  • All DC2 operation have been done on Grid
  • Grid systems are not easy to use and debug
  • Its difficult to know where problems are
  • Production required more human resources than
    expected
  • DC1 in 2002 ran on non-Grid European sites with
    one production manager per site
  • DC2 in 2004 ran on LCG sites with 4-5 people for
    the central operation, plus the LCG support team
  • Grid3 has a production team
  • Should we generalize the concept?
  • DC2 on NorduGrid was run by 2 people

36
Summary (2)
  • Current production system is not user friendly
  • It was fragile at the beginning
  • It became more robust after several weeks of
    running and is stable now
  • A review is scheduled for mid-January 2005
  • Schedule was driven by the availability (and
    robustness) of many different components
    (Middleware Production System software
    database)
  • All systems are under development and need to be
    stabilized
  • Nevertheless
  • Phase I is over
  • Phase II is running and Tier-0 exercise will be
    repeated when we will be in more stable
    conditions
Write a Comment
User Comments (0)
About PowerShow.com