Title: CMS Distributed Data Analysis Challenges
1CMS Distributed Data Analysis Challenges
- Claudio Grandi
- on behalf of the CMS Collaboration
2Outline
- CMS Computing Environment
- CMS Computing Milestones
- OCTOPUS CMS Production System
- 2002 Data productions
- 2003 Pre-Challenge production (PCP03)
- PCP03 on grid
- 2004 Data Challenge (DC04)
- Summary
3CMS Computing Environment
4CMS computing context
- LHC will produce 40 million bunch crossing per
second in the CMS detector (1000 TB/s) - The on-line system will reduce the rate to 100
events per second (100 MB/s raw data) - Level-1 trigger hardware
- High level trigger on-line farm
- Raw data (1MB/evt) will be
- archived on persistent storage (1 PB/year)
- reconstructed to DST (0.5 MB/evt) and AOD (20
KB/evt) - Reconstructed data (and part of raw data) will
be - distributed to computing centers of collaborating
institutes - analyzed by physicists at their own institutes
5CMS Data Production at LHC
40 MHz (1000 TB/sec)
Level 1 Trigger
75 KHz (50 GB/sec)
High Level Trigger
100 Hz (100 MB/sec)
Data Recording Offline Analysis
6CMS Distributed Computing Model
PByte/sec
100-1500 MBytes/sec
Online System
Experiment
CERN Center PBs of Disk Tape Robot
Tier 0 1
Tier 1
2.5-10 Gbps
FNAL Center
IN2P3 Center
INFN Center
RAL Center
2.5-10 Gbps
Tier 2
2.5-10 Gbps
Tier 3
Institute
Institute
Institute
Institute
Physics data cache
0.1 to 10 Gbps
Workstations
Tier 4
7CMS software for Data Simulation
- Event Generation
- Pythia and other generators
- Generally Fortran programs. Produce N-tuple files
(HEPEVT format) - Detector simulation
- CMSIM (uses GEANT-3)
- Fortran program. Produces Formatted Zebra (FZ)
files from N-tuples - OSCAR (uses GEANT-4 and the CMS COBRA framework)
- C program. Produces POOL files (hits) from
N-tuples - Digitization (DAQ simulation)
- ORCA (uses the CMS COBRA framework)
- C program. Produces POOL files (digis) from
hits POOL files or FZ - Trigger simulation
- ORCA
- Reads digis POOL files
- Normally run as part of the reconstruction phase
8CMS software for Data Analysis
- Reconstruction
- ORCA
- Produces POOL files (DST and AOD) from hits or
digis POOL files - Analysis
- ORCA
- Reads POOL files in (hits, digis,) DST, AOD
formats - IGUANA (uses ORCA and OSCAR as back-end)
- Visualization program (event display, statistical
analysis)
9CMS software ORCA C.
Zebra files with HITS
Pythia
CMSIM (GEANT3)
HEPEVT Ntuples
Other Generators
OSCAR/COBRA (GEANT4)
ORCA/COBRA Hit Formatter
Merge signal and pile-up
Digis Database (POOL)
Hits Database (POOL)
ORCA/COBRA Digitization
IGUANA Interactive Analysis
Database (POOL)
ORCA Reconstruction or User Analysis
Ntuples or Root files
10CMS Computing Milestones
11CMS computing milestones
DAQ TDR (Technical Design Report) Spring-2002
Data Production
Software Baselining
Computing Core Software TDR 2003 Data
Production (PCP04) 2004 Data Challenge (DC04)
Physics TDR 2004/05 Data Production (DC05) Data
Analysis for physics TDR
Readiness Review 2005 Data Production
(PCP06) 2006 Data Challenge (DC06)
Commissioning
12Size of CMS Data Challenges
- 1999 1TB 1 month 1
person - 2000-2001 27 TB 12 months 30 persons
- 2002 20 TB 2 months 30 persons
- 2003 175 TB 6 months lt30 persons
13World-wide Distributed Productions
CMS Production Regional Centre
CMS Distributed Production Regional Centre
14CMS Computing Challenges
- CMS Computing challenges include
- production of simulated data for studies on
- Detector design
- Trigger and DAQ design and validation
- Physics system setup
- definition and set-up of analysis infrastructure
- definition of computing infrastructure
- validation of computing model
- Distributed system
- Increasing size and complexity
- Tightened to other CMS activities
- provide computing support for all CMS activities
15OCTOPUSCMS Production System
16OCTOPUS Data Production System
Phys.Group asks for a new dataset
Production Manager defines assignments
RefDB
shell scripts
Data-level query
Local Batch Manager
BOSS DB
Job level query
McRunjob plug-in CMSProd
Site Manager starts an assignment
17Remote connections to databases
User Interface
Worker Node
Job input
Job input
Job Wrapper (job instru- mentation)
User Job
Job output
Job output
Journal writer
Journal Catalog
Journal Catalog
Remote updater
Asynchronous updater
Metadata DB
Direct connection from WN
- Metadata DB are RLS/POOL, RefDB, BOSS DB
18Job production
- MCRunJob
- Modular produce plug-ins for
- reading from RefDB
- reading from simple GUI
- submitting to a local resource manager
- submitting to DAGMan/Condor-G (MOP)
- submitting to the EDG/LCG scheduler
- producing derivations in the Chimera Virtual Data
Catalogue - Runs on the user (e.g. site manager) host
- Defines also the sandboxes needed by the job
- If needed, the specific submission plug-in takes
care of - preparing the XML POOL catalogue with input files
information - moving the sandbox files to the worker nodes
- CMSProd
19Job Metadata management
- Job parameters that represent the job running
status are stored in a dedicated database - when did the job start?
- is it finished?
- but also
- how many events did it produce so far?
- BOSS is a CMS-developed system that does this
extracting the info from the job standard
input/output/error streams - The remote updater is based on MySQL
- Remote updater are being developed now based on
- R-GMA (still has scalability problems)
- Clarens (just started)
20Dataset Metadata management
- Dataset metadata are stored in the RefDB
- by what (logical) files is it made of?
- but also
- what input parameters to the simulation program?
- how many events have been produced so far?
- Information may be updated in the RefDB in
many ways - manual Site Manager operation
- automatic e-mail from the job
- remote updaters based on R-GMA and Clarens
(similar to those developed for BOSS) will be
developed - Mapping of logical names to physical file
names will be done on the grid by RLS/POOL
212002 Data Productions
222002 production statistics
- Used Objectivity/DB for persistency
- 11 Regional Centers, more than 20 sites, about 30
site managers - Spring 2002 Data production
- Generation and detector simulation
- 6 million events in 150 physics channels
- Digitization
- gt13 million events with different configuration
(luminosity) - about 200 KSI2000 months
- more than 20 TB digitized data
- Fall 2002 Data production
- 10 million events, full chain (small output)
- about 300 KSI2000 months
- Also productions on grid!
23Spring 2002 production history
CMSIM
1.5 million events per month
24Fall 2002 CMS grid productions
- CMS/EDG Stress Test on EDG testbed CMS
sites - Top-down approach more functionality but less
robust, large manpower needed
1.2 million events in 2 months ?
260,000 events in 3 weeks
?
- USCMS IGT Production in the US
- Bottom-up approach less functionality but more
stable, little manpower needed
25 2003 Pre-Challenge Production
26PCP04 production statistics
- Started in july. Supposed to end by Xmas.
- Generation and simulation
- 48 million events with CMSIM
- 50 ?150 KSI2K s/event, 2000 KSI2K months
- 1MB/event, 50 TB
- hit-formatting in progress. POOL format reduces
size of a factor of 2! - 6 million events with OSCAR
- 100 ? 200 KSI2K s/event, 350 KSI2K months (in
progress) - Digitization just starting
- need to digitize 70 million events. Not all in
time for DC04! Estimated - 30-40 KSI2K s/event, 950 KSI2K months
- 1.5 MB/event, 100 TB
- Data movement to CERN
- 1TB/day for 2 months
27PCP 2003 production history
CMSIM
13 million events per month
28PCP04 on grid
29US DPE production system
- Running on Grid2003
- 2000 CPUs
- Based on VDT1.1.11
- EDG VOMS for authentication
- GLUE Schema for MDS Information Providers
- MonaLisa for monitoring
- MOP for production control
US DPE Production on Grid2003
MOP System
- - Dagman and Condor-G for specification and
submission - - Condor-based match-making process selects
resources
30Performance of US DPE
- USMOP Regional Center
- - 7.7 Mevts pythia
- 30000 jobs 1.5min each,
- 0.7 KSI2000 months
- - 2.3 Mevts cmsim
- 9000 jobs 10hours each,
- 90 KSI2000 months
- 3.5 TB data
CMSIM
Now running OSCAR productions
31CMS/LCG-0 testbed
- CMS/LCG-0 is a CMS-wide testbed based on the LCG
pilot distribution (LCG-0), owned by CMS - joint CMS DataTAG-WP4 LCG-EIS effort
- started in june 2003
- Components from VDT 1.1.6 and EDG 1.4.X (LCG
pilot) - Components from DataTAG (GLUE schemas and info
providers) - Virtual Organization Management VOMS
- RLS in place of the replica catalogue (uses
rlscms by CERN/IT) - Monitoring GridICE by DataTAG
- tests with R-GMA (as BOSS transport layer for
specific tests) - no MSS direct access (bridge to SRB at CERN)
- About 170 CPUs, 4 TB disk
- Bari Bologna Bristol Brunel CERN CNAF Ecole
Polytechnique Imperial College ISLAMABAD-NCP
Legnaro Milano NCU-Taiwan Padova U.Iowa - Allowed to do CMS software integration while
LCG-1 was not out
32CMS/LCG-0 Production system
- OCTOPUS installed on User Interface
- CMS software (installed on Computing Elements as
RPMs)
RefDB
SE
RLS
User Interface
McRunjob ImpalaLite
JDL
Grid (LCG) Scheduler
SE
Grid Information System (MDS)
SE
CE
BOSS DB
SE
33CMS/LCG-0 performance
- CMS-LCG Regional Center
- based on CMS/LCG-0
- 0.5 Mevts heavy pythia
- 2000 jobs 8hours each,
- 10 KSI2000 months
- 1.5 Mevts cmsim
- 6000 jobs 10hours each,
- 55 KSI2000 months
- 2.5 TB data
- Inefficiency estimation
- 5 to 10 due to sites misconfiguration and
local failures - 0 to 20 due to RLS unavailability
- few errors in execution of job wrapper
- Overall inefficiency 5 to 30
Pythia CMSIM
Now used as a play-ground for CMS grid-tools
development
34Data Challenge 2004(DC04)
352004 Data Challenge
- Test the CMS computing system at a rate which
corresponds to the 5 of the full LHC luminosity - corresponds to the 25 of the LHC startup
luminosity - for one month (February or March 2004)
- 25 Hz data taking rate at a luminosity of 0.2 x
1034 cm-2s-1 - 50 million events (completely simulated up to
digis during PCP03) used as input - Main tasks
- Reconstruction at Tier-0 (CERN) at 25 Hz (40
MB/s) - Distribution of DST to Tier-1 centers (5 sites)
- Re-calibration at selected Tier-1 centers
- Physics-groups analysis at the Tier-1 centers
- User analysis from the Tier-2 centers
36DC04 Calibration challenge
DC04 Analysis challenge
CERN disk pool 40 TByte (20 days data)
DC04 T0 challenge
25Hz 1MB/evt raw
25Hz 0.5MB reco DST
HLT Filter ?
Disk cache
Archive storage
CERN Tape archive
37Tier-0 challenge
- Data serving pool to serve digitized events at
25Hz to the computing farm with 20/24 hour
operation. - 40 MB/s
- Adequate buffer space (at least 1/4 of the digi
sample in the disk buffer). - Pre-staging software. File locking while in use,
buffer cleaning and restocking as files have been
processed - Computing Farm approximately 400 CPUs
- jobs running 20/24 hours. 500 events/job, 3
hour/job - Files in buffer locked till successful job
completion - No dead-time can be introduced to the DAQ.
Latencies must be no more than of order 6-8 hours - CERN MSS 50 MB/s archiving rate
- archive 1.5 MB 25 Hz raw data (digis)
- archive 0.5 MB 25 Hz reconstructed events
(DST) - File catalog POOL/RLS
- Secure and complete catalog of all data
input/products - Accessible and/or replicable to the other
computing centers
38Data distribution challenge
- Replication of the DST and part of raw data at
one or more Tier-1 centers - possibly using the LCG replication tools
- foreseen some event duplication
- At CERN 3 GB/s traffic without inefficiencies
(about 1/5 at Tier-1) - Tier-0 catalog accessible by all sites
- Replication of calibration samples (DST/raw) to
selected Tier-1 - Transparent access of jobs at the Tier-1 sites to
the local data whether in MSS or on disk buffer - Replication of any Physics-Groups (PG) data
produced at the Tier-1 sites to the other Tier-1
sites and interested Tier-2 sites - Monitoring of Data Transfer activites
- e.g. with MonaLisa
39Calibration challenge
- Selected sites will run calibration procedures
- Rapid distribution of the calibration samples
(within hours at most) to the Tier-1 site and
automatically scheduled jobs to process the data
as it arrives. - Publication of the results in an appropriate form
that can be returned to the Tier-0 for
incorporation in the calibration database - Ability to switch calibration database at the
Tier-0 on the fly and to be able to track from
the meta-data which calibration table has been
used.
40Tier-1 analysis challenge
- All data distributed from Tier-0 safely inserted
to local storage - Management and publication of a local catalog
indicating status of locally resident data - define tools and procedures to synchronize a
variety of catalogs with the CERN RLS catalog
(EDG-RLS, Globus-RLS, SRB-Mcat, ) - Tier-1 catalog accessible to at least the
associated Tier-2 centers - Operation of the physics-group (PG) productions
on the imported data - production-like activity
- Local computing facilities made available to
Tier-2 users - Possibly via the LCG job submission system
- Export of the PG-data to requesting sites
(Tier-0, -1 or -2) - Registration of the data produced locally to the
Tier-0 catalog to make them available to at least
selected sites - possibly via the LCG replication tools
41Tier-2 analysis challenge
- Point of access to computing resources of the
physicists - Pulling of data from peered Tier-1 sites as
defined by the local Tier-2 activities - Analysis on the local PG-data produces plots
and/or summary tables - Analysis on distributed PG-data or DST available
at least at the reference Tier-1 and associated
Tier-2 centers. - Results are made available to selected remote
users possibly via the LCG data replication
tools. - Private analysis on distributed PG-data or DST is
outside DC04 scope but will be kept as a
low-priority milestone - use of a Resource Broker and Replica Location
Service to gain access to appropriate resources
without knowing where the input data are - distribution of user-code to the executing
machines - user-friendly interface to prepare, submit and
monitor jobs and to retrieve results
42Summary of DC04 scale
- Tier-0
- Reconstruction and DST production at CERN
- 75 TB Input Data
- 180 KSI2K months 400 CPU _at_24 hour operation
(_at_500 SI2K/CPU) - 25TB Output data
- 1-2 TB/Day Data Distribution from CERN to sum of
T1 centers - Tier-1
- Assume all (except CERN) CMS Tier-1s
participate - CNAF, FNAL, Lyon, Karlsruhe, RAL
- Share the T0 output DST between them (5-10TB
each) - 200 GB/day transfer from CERN (per T1)
- Perform scheduled analysis group production
- 100 KSI2K months total 50 CPU per T1 (24
hrs/30 days) - Tier-2
- Assume about 5-8 T2
- may be more
- Store some of PG-data at each T2 (500GB-1TB)
- Estimate 20 CPU at each center for 1 month
43Summary
- Computing is a CMS-wide activity
- 18 regional centers, 50 sites
- Committed to support other CMS activities
- support analysis for DAQ, Trigger and Physics
studies - Increasing in size and complexity
- 1 TB in 1 month at 1 site in 1999
- 170 TB in 6 months at 50 sites today
- Ready for full LHC size in 2007
- Exploiting new technologies
- Grid paradigm adopted by CMS
- Close collaboration with LCG and EU and US grid
projects - Grid tools assuming more and more importance in
CMS