Title: D Computing Model
1DØ Computing Model Monte Carlo Data
Reprocessing
Gavin Davies Imperial College London
DOSAR Workshop, Sao Paulo, September 2005
2Outline
- Operational status
- Globally continue to do well
- Shared by recent Run II Computing Review
- DØ Computing model
- Ongoing, long established plan
- Production Computing
- Monte Carlo
- Reprocessing of Run II data
- 109 events reprocessed on the grid largest HEP
grid effort - Looking forward
- Conclusions
3Snapshot of Current Status
- Reconstruction keeping up with data taking
- Data handling is performing well
- Production computing is off-site and grid based.
It continues to grow work well - Over 75 million Monte Carlo events produced in
last year - Run IIa data set being reprocessed on the grid
109 events - Analysis cpu power has been expanded
- Globally doing well
- Shared by recent Run II Computing Review
4Computing Model
- Started with distributed computing with evolution
to automated use of common tools/solutions on the
grid (SAM-Grid) for all tasks - Scalable
- Not alone Joint effort with others at FNAL and
elsewhere, LHC - 1997 Original Plan
- All Monte Carlo to be produced off-site
- SAM to be used for all data handling, provides a
data-grid - Now Monte Carlo and data reprocessing with
SAM-Grid - Next Other production tasks e.g. fixing and then
user analysis - Use concept of Regional Centres
- DOSAR one of pioneers
- Builds local expertise
5Reconstruction Release
- Periodically update version of reconstruction
code - As develop new / more refined algorithms
- As get better understanding of detector
- Frequency of releases decreases with time
- One major release in last year p17
- Basis for current Monte Carlo (MC) data
reprocessing - Benefits of p17
- Reco speed-up
- Full calorimeter calibration
- Fuller description of detector material
- Use of zero-bias overlay for MC
- (More details http//cdinternal.fnal.gov/RUNIIRev
/runIIMP05.asp)
6Data Handling - SAM
- SAM continues to perform well, providing a
data-grid - 50 SAM sites worldwide
- Over 2.5 PB (50B events)
- consumed in the last year
- Up to 300 TB moved per month
- Larger SAM cache solved tape
- access issues
- Continued success of SAM shifters
- Often remote collaborators
- Form 1st line of defense
- SAMTV monitors SAM SAM stations
http//d0db-prd.fnal.gov/sm_local/SamAtAGlance/
7SAMGrid
More than 10 DØ execution sites http//samgrid.fna
l.gov8080/
SAM data handling JIM job submission
monitoring SAM JIM ? SAM-Grid
http//samgrid.fnal.gov8080/list_of_schedulers.ph
p http//samgrid.fnal.gov8080/list_of_resources.p
hp
8Remote Production Activities Monte Carlo - I
- Over 75M events produced in last year, at more
than 10 sites - More than double last years production
- Vast majority on shared sites
- DOSAR major part of this
- SAM-Grid introduced in spring 04, becoming the
default - Based on request system and jobmanager-mc_runjob
- MC software package retrieved via SAMo way, inc
central farm - Average production efficiency 90
- Average inefficiency due to grid infrastructure
1-5 - http//www-d0.fnal.gov/computing/grid/deployment-i
ssues.html - Continued move to common tools
- DOSAR sites continue move to SAMGrid from McFarm
From 04
9Remote Production Activities Monte Carlo - II
- Beyond just shared resources
- More than 17M events produced directly on LCG
via submission from Nikhef - Good example of remote site driving the
development - Similar momentum building on/for OSG
- Two good site examples within p17 reprocessing
10Remote Production Activities Reprocessing - I
- After significant improvements to
reconstruction, reprocess old data - P14 Winter 2003/04
- 500M events, 100M remotely, from DST
- Based around mc_runjob
- Distributed computing rather than Grid
- P17 End march ? Oct
- x 10 larger ie 1000M events, 250TB
- Basically all remote
- From raw ie use of db proxy servers
- SAM-Grid as default (using mc_runjob)
- 3200 1GHz PIIIs for 6 months
- Massive activity - largest grid activity in HEP
http//www-d0.fnal.gov/computing/reprocessing/p17/
11Reprocessing - II
Grid jobs spawns many batch jobs
Production
Merging
12Reprocessing -III
- SAMGrid provides
- Common environment operation scripts at each
site - Effective book-keeping
- SAM avoids data duplication defines recovery
jobs - JIMs XML-DB used to ease bug tracing
- Tough deploying a product, under evolution with
limited manpower to new sites (we are a running
experiment) - Very significant improvements in JIM
(scalability) during this period - Certification of sites - Need to check
- SAMGrid vs usual production
- Remote sites vs central site
- Merged vs unmerged files
FNAL vs SPRACE
13Reprocessing - IV
http//samgrid.fnal.gov8080/cgi-bin/plot_efficien
cy.cgi
- Monitoring (illustration)
- Overall efficiency, speed or by site.
- Status into the end-game
- Data sets all allocated, moving to cleaning-up
- Must now push on the Monte Carlo
855 Mevents done
14SAM-Grid Interoperability
- Need access to greater resources as data sets
grow - Ongoing programme on LCG and OSG interoperability
- Step 1 (co-existence) use shared resources with
SAM-Grid head-node - Widely done for both Reprocessing and MC
- OSG co-existence shown for data reprocessing
- Step 2 SAMGrid-LCG interface
- SAM does data handling JIM job submission
- Basically forwarding mechanism
- Prototype established at IN2P3/Wuppertal
- Extending to production level
- OSG activity increasing build on LCG experience
- Team work between core developers / sites
15Looking Forward
- Increased data sets require increased resources
for MC, repro etc - Route to these is increased use of grid and
common tools - Have an ongoing joint program, but work to do..
- Continue development of SAM-Grid
- Automated production job submission by shifters
- Deployment team
- Bring in new sites in manpower efficient manner
- Benefit of a new site goes well beyond a cpu
count we appreciate / value this. - Full interoperability
- Ability to access efficiently all shared
resources - Additional resources for above recommended by
Taskforce
16Conclusions
- Computing model continues to be successful
- Based around grid-like computing, using common
tools - Key part of this is the production computing MC
and reprocessing - Significant advances this year
- Continued migration to common tools
- Progress on interoperability, both LCG and OSG
- Two reprocessing sites operating under OSG
- P17 reprocessing a tremendous success
- Strongly praised by Review Committee
- DOSAR major part of this
- More general contribution also strongly
acknowledged. - Thank you
-
- Lets all keep up the good work
17Back-up
18Terms
- Tevatron
- Approx equiv challenge to LHC in todays money
- Running experiments
- SAM (Sequential Access to Metadata)
- Well developed metadata and distributed data
replication system - Originally developed by DØ FNAL-CD
- JIM (Job Information and Monitoring)
- handles job submission and monitoring (all but
data handling) - SAM JIM ?SAM-Grid computational grid
- Tools
- Runjob - Handles job workflow management
- dØtools User interface for job submission
- dØrte - Specification of runtime needs
19Reminder of Data Flow
- Data acquisition (raw data in evpack format)
- Currently limited to 50 Hz Level-3 accept rate
- Request increase to 100 Hz, as planned for Run
IIb see later - Reconstruction (tmb/DST in evpack format)
- Additional information in tmb ? tmb (DST format
stopped) - Sufficient for complex corrections, inc track
fitting - Fixing (tmb in evpack format)
- Improvements / corrections coming after cut of
production release - Centrally performed
- Skimming (tmb in evpack format)
- Centralised event streaming based on
reconstructed physics objects - Selection procedures regularly improved
- Analysis (out root histogram)
- Common root-based Analysis Format (CAF)
introduced in last year - tmb format remains
20Remote Production Activities Monte Carlo
21The Good and Bad of the Grid
- Only viable way to go
- Increase in resources (cpu and potentially
manpower) - Work with, not against, LHC
- Still limited
- BUT
- Need to conform to standards dependence on
others.. - Long term solutions must be favoured over short
term idiosyncratic convenience - Or wont be able to maintain adequate resources.
- Must maintain production level service (papers),
while increasing functionality - As transparent as possible to non-expert