D Computing Model - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

D Computing Model

Description:

Over 75 million Monte Carlo events produced in last year ... All Monte Carlo to be produced off-site ... Now: Monte Carlo and data reprocessing with SAM-Grid ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 22
Provided by: drga55
Category:
Tags: computing | model | monte

less

Transcript and Presenter's Notes

Title: D Computing Model


1
DØ Computing Model Monte Carlo Data
Reprocessing
Gavin Davies Imperial College London
DOSAR Workshop, Sao Paulo, September 2005
2
Outline
  • Operational status
  • Globally continue to do well
  • Shared by recent Run II Computing Review
  • DØ Computing model
  • Ongoing, long established plan
  • Production Computing
  • Monte Carlo
  • Reprocessing of Run II data
  • 109 events reprocessed on the grid largest HEP
    grid effort
  • Looking forward
  • Conclusions

3
Snapshot of Current Status
  • Reconstruction keeping up with data taking
  • Data handling is performing well
  • Production computing is off-site and grid based.
    It continues to grow work well
  • Over 75 million Monte Carlo events produced in
    last year
  • Run IIa data set being reprocessed on the grid
    109 events
  • Analysis cpu power has been expanded
  • Globally doing well
  • Shared by recent Run II Computing Review

4
Computing Model
  • Started with distributed computing with evolution
    to automated use of common tools/solutions on the
    grid (SAM-Grid) for all tasks
  • Scalable
  • Not alone Joint effort with others at FNAL and
    elsewhere, LHC
  • 1997 Original Plan
  • All Monte Carlo to be produced off-site
  • SAM to be used for all data handling, provides a
    data-grid
  • Now Monte Carlo and data reprocessing with
    SAM-Grid
  • Next Other production tasks e.g. fixing and then
    user analysis
  • Use concept of Regional Centres
  • DOSAR one of pioneers
  • Builds local expertise

5
Reconstruction Release
  • Periodically update version of reconstruction
    code
  • As develop new / more refined algorithms
  • As get better understanding of detector
  • Frequency of releases decreases with time
  • One major release in last year p17
  • Basis for current Monte Carlo (MC) data
    reprocessing
  • Benefits of p17
  • Reco speed-up
  • Full calorimeter calibration
  • Fuller description of detector material
  • Use of zero-bias overlay for MC
  • (More details http//cdinternal.fnal.gov/RUNIIRev
    /runIIMP05.asp)

6
Data Handling - SAM
  • SAM continues to perform well, providing a
    data-grid
  • 50 SAM sites worldwide
  • Over 2.5 PB (50B events)
  • consumed in the last year
  • Up to 300 TB moved per month
  • Larger SAM cache solved tape
  • access issues
  • Continued success of SAM shifters
  • Often remote collaborators
  • Form 1st line of defense
  • SAMTV monitors SAM SAM stations

http//d0db-prd.fnal.gov/sm_local/SamAtAGlance/
7
SAMGrid
More than 10 DØ execution sites http//samgrid.fna
l.gov8080/
SAM data handling JIM job submission
monitoring SAM JIM ? SAM-Grid
http//samgrid.fnal.gov8080/list_of_schedulers.ph
p http//samgrid.fnal.gov8080/list_of_resources.p
hp
8
Remote Production Activities Monte Carlo - I
  • Over 75M events produced in last year, at more
    than 10 sites
  • More than double last years production
  • Vast majority on shared sites
  • DOSAR major part of this
  • SAM-Grid introduced in spring 04, becoming the
    default
  • Based on request system and jobmanager-mc_runjob
  • MC software package retrieved via SAMo way, inc
    central farm
  • Average production efficiency 90
  • Average inefficiency due to grid infrastructure
    1-5
  • http//www-d0.fnal.gov/computing/grid/deployment-i
    ssues.html
  • Continued move to common tools
  • DOSAR sites continue move to SAMGrid from McFarm

From 04
9
Remote Production Activities Monte Carlo - II
  • Beyond just shared resources
  • More than 17M events produced directly on LCG
    via submission from Nikhef
  • Good example of remote site driving the
    development
  • Similar momentum building on/for OSG
  • Two good site examples within p17 reprocessing

10
Remote Production Activities Reprocessing - I
  • After significant improvements to
    reconstruction, reprocess old data
  • P14 Winter 2003/04
  • 500M events, 100M remotely, from DST
  • Based around mc_runjob
  • Distributed computing rather than Grid
  • P17 End march ? Oct
  • x 10 larger ie 1000M events, 250TB
  • Basically all remote
  • From raw ie use of db proxy servers
  • SAM-Grid as default (using mc_runjob)
  • 3200 1GHz PIIIs for 6 months
  • Massive activity - largest grid activity in HEP

http//www-d0.fnal.gov/computing/reprocessing/p17/
11
Reprocessing - II
Grid jobs spawns many batch jobs
Production
Merging
12
Reprocessing -III
  • SAMGrid provides
  • Common environment operation scripts at each
    site
  • Effective book-keeping
  • SAM avoids data duplication defines recovery
    jobs
  • JIMs XML-DB used to ease bug tracing
  • Tough deploying a product, under evolution with
    limited manpower to new sites (we are a running
    experiment)
  • Very significant improvements in JIM
    (scalability) during this period
  • Certification of sites - Need to check
  • SAMGrid vs usual production
  • Remote sites vs central site
  • Merged vs unmerged files

FNAL vs SPRACE
13
Reprocessing - IV
http//samgrid.fnal.gov8080/cgi-bin/plot_efficien
cy.cgi
  • Monitoring (illustration)
  • Overall efficiency, speed or by site.
  • Status into the end-game
  • Data sets all allocated, moving to cleaning-up
  • Must now push on the Monte Carlo

855 Mevents done
14
SAM-Grid Interoperability
  • Need access to greater resources as data sets
    grow
  • Ongoing programme on LCG and OSG interoperability
  • Step 1 (co-existence) use shared resources with
    SAM-Grid head-node
  • Widely done for both Reprocessing and MC
  • OSG co-existence shown for data reprocessing
  • Step 2 SAMGrid-LCG interface
  • SAM does data handling JIM job submission
  • Basically forwarding mechanism
  • Prototype established at IN2P3/Wuppertal
  • Extending to production level
  • OSG activity increasing build on LCG experience
  • Team work between core developers / sites

15
Looking Forward
  • Increased data sets require increased resources
    for MC, repro etc
  • Route to these is increased use of grid and
    common tools
  • Have an ongoing joint program, but work to do..
  • Continue development of SAM-Grid
  • Automated production job submission by shifters
  • Deployment team
  • Bring in new sites in manpower efficient manner
  • Benefit of a new site goes well beyond a cpu
    count we appreciate / value this.
  • Full interoperability
  • Ability to access efficiently all shared
    resources
  • Additional resources for above recommended by
    Taskforce

16
Conclusions
  • Computing model continues to be successful
  • Based around grid-like computing, using common
    tools
  • Key part of this is the production computing MC
    and reprocessing
  • Significant advances this year
  • Continued migration to common tools
  • Progress on interoperability, both LCG and OSG
  • Two reprocessing sites operating under OSG
  • P17 reprocessing a tremendous success
  • Strongly praised by Review Committee
  • DOSAR major part of this
  • More general contribution also strongly
    acknowledged.
  • Thank you
  • Lets all keep up the good work

17
Back-up
18
Terms
  • Tevatron
  • Approx equiv challenge to LHC in todays money
  • Running experiments
  • SAM (Sequential Access to Metadata)
  • Well developed metadata and distributed data
    replication system
  • Originally developed by DØ FNAL-CD
  • JIM (Job Information and Monitoring)
  • handles job submission and monitoring (all but
    data handling)
  • SAM JIM ?SAM-Grid computational grid
  • Tools
  • Runjob - Handles job workflow management
  • dØtools User interface for job submission
  • dØrte - Specification of runtime needs

19
Reminder of Data Flow
  • Data acquisition (raw data in evpack format)
  • Currently limited to 50 Hz Level-3 accept rate
  • Request increase to 100 Hz, as planned for Run
    IIb see later
  • Reconstruction (tmb/DST in evpack format)
  • Additional information in tmb ? tmb (DST format
    stopped)
  • Sufficient for complex corrections, inc track
    fitting
  • Fixing (tmb in evpack format)
  • Improvements / corrections coming after cut of
    production release
  • Centrally performed
  • Skimming (tmb in evpack format)
  • Centralised event streaming based on
    reconstructed physics objects
  • Selection procedures regularly improved
  • Analysis (out root histogram)
  • Common root-based Analysis Format (CAF)
    introduced in last year
  • tmb format remains

20
Remote Production Activities Monte Carlo
21
The Good and Bad of the Grid
  • Only viable way to go
  • Increase in resources (cpu and potentially
    manpower)
  • Work with, not against, LHC
  • Still limited
  • BUT
  • Need to conform to standards dependence on
    others..
  • Long term solutions must be favoured over short
    term idiosyncratic convenience
  • Or wont be able to maintain adequate resources.
  • Must maintain production level service (papers),
    while increasing functionality
  • As transparent as possible to non-expert
Write a Comment
User Comments (0)
About PowerShow.com