Title: Diapositiva 1
1The CMS Computing Software and Analysis
Challenge 2006
N. De Filippis
Department of Physics and INFN Bari
On behalf of the CMS collaboration
2(No Transcript)
3- A 50 million event exercise to test the workflow
and dataflow as defined in the CMS computing
model - A test at 25 of the capacity needed in 2008
- Main components
- Preparation of large MC simulated datasets (some
with HLT-tags) - Prompt reconstruction at Tier-0
- Reconstruction at 40 Hz (over 150 Hz) using CMSSW
- Application of calibration constants from offline
DB - Generation of Reco, AOD, and AlCaReco datasets
- Splitting of an HLT-tagged sample into 10 streams
- Distribution of all AOD some FEVT to all
participating Tier-1s - Calibration jobs on AlCaReco datasets at some
Tier-1s and CAF - Re-reconstruction performed at Tier-1s
- Skim jobs at some Tier-1s with data propagated to
Tier-2s - Physics jobs at Tier-2s and Tier-1s on AOD and
Reco
Italian contribution
4 June 1 June 14 First Version of Detector and
Physics reconstruction SW for CSA06 June 1
Computing systems ready for Service Challenge
SC4 June 15 physics simulation validation
complete July 1 start MC production Aug.15
Calibration, alignment, HLT (and first version L1
simulation), reconstruction, and analysis
tools ready Aug.30 50 Mevt produced, 5M with
HLT pre-processing Sep. 1 Computing systems
ready for CSA Sep 15 Start CSA06 Oct 1
Smooth operation for CSA06 Oct 30 End smooth
operation for CSA06 Nov 15 Finish CSA06
5- Most of performance metrics of the CSA06 are
- Number of participating Tier-1 - Goal 7 -
Threshold 5 - Number of participating Tier-2 - Goal 20 -
Threshold 15 - Weeks of running at sustained rate - Goal 4 -
Threshold 2 - Tier-0 Efficiency - Goal 80 - Threshold 30
, measured as unattended uptime fraction over 2
best weeks of the running period - Running grid jobs (Tier-1Tier-2) per day (2h
jobs typ.) - Goal 50K - Threshold 30K - Grid job efficiency - Goal 90 - Threshold 70
- Data serving capability at each participating
site from the disk storage to CPU Goal
1MB/s/execution slot - Threshold 400 MB/s
(Tier-1) or 100 MB/sec (Tier-2) - Data transfer Tier-0 to Tier-1 to tape -
Individual goals (threshold at 50 of goal) for
CNAF it was 25 MB/s - Data transfer Tier-1 to Tier-2 - Goal 20 MB/s
into each Tier-2 - Threshold 5 MB/s - Overall "success" is to have 50 of participant
at or above goal and 90 above threshold.
6- Tier-0 (CERN)
- 1.4M SI2K (1400 CPUs at CERN)
- 240 TB
- Tier-1 (7 sites)
- 2500 CPUs in total
- 70 TB disk tape as minimum to participate
- Tier-2 (25 sites)
- 2400 CPUs in total
- Average 10 TB disk at participating Tier-2
7(No Transcript)
8- ProdAgent tool used to automatise the
production - consists of many agents running in parallel
JobCreator, JobSubmitter, JobTracking,
MergeSensor. - ouput files are registered in Data bookeeping
service (DBS) blocks of files are registered in
Data Location System (DLS) which takes care of
mapping of file blocks and storage elements where
they exist - Files are merged for optimum size before
transfer to CERN - CMS software (CMSSW) installed via grid tools or
directly by site admins in remote sites. A local
catalogue used to map LFNs to local PFNs via a
set of rules - Storage technologies deployed CASTOR, dCache,
DPM
9- 4 production teams active
- 1 for OSG with contact person
- -- Ajit Mohapatra Wisconsin
- (taking care of 7 OSG CMS Tier2)
- 3 for LCG
- -- LCG(1) with contact person
- Jose Hernandez Madrid (Spain,
- France, Belgium, CERN)
- -- LCG(2) with contact person
- Carsten Hof Aachen (Germany,
- Estonia, Taiwan, Russia,
- Switzerland, FNAL)
- -- LCG(3) with contact person Nicola
- De Filippis Bari (Italy, UK, Hungary)
- Large partecipation of CMS T1s and T2s involved
10Maximum rate per day 1.15 M
11T1 -CNAF
Pisa
LNL
Bari
Most of the failures at CNAF were related to
stageout and stagein problems with CASTOR2
12Total 66 M eventsTotal FEVT O(150) TB
- 1. Minimum bias (40M)
- 2. Z?µµ (2M)
- 3. T-Tbar (6M)
- All decays
- 4. W?e? (4M)
- events selected in narrow range to illuminate 2
SMs - 5. Electroweak soup (5M)
- W?l nu Drell-Yan (mgt15 GeV) WW H?WW
- 6. HLT soup (5M) 10 effective MC HLT triggers
(no taus pass) - W (leptons) Drell-Yan (leptons) t-tbar (all
modes) dijets - 7. Jet calibration soup (1M)
- dijet Zjet, various pt-hat ranges
- 8. Soft Muon Soup (2M)
- Inclusive muons in minbias J/Psi production
- 9. Exotics Soup (1M)
- LM1 SUSY, Z (700 GeV), and excited quark (2000
GeV) all decays
12 M of events produced by the LCG(3) team
13- Efficiency
- Overall efficiency 88
- Probability for a job to end successfully once it
is submitted - Grid efficiency 95
- Aborted jobs jobs not submitted because
requirements not met (merge jobs) or jobs once
submitted fail due to Grid infrastructure reason
- Problems
- stage out was the main cause of job failures.
More robust checking were implemented, more
attempts to stage, a fallback strategy etc.. - merge jobs caused tipically an overload of the
storage system because of the high rate of read
access CASTOR2 at CNAF was tuned to cope with
the needs of the production (D. Bonacorsi and
CNAF admins) - site validation storage, software tag, software
mount points, matching of CE - consistency between fileblock/files in DBS/DLS
and the reality at sites.
Support of Italian Tier-1 and Tier-2 very
effective also in August
14(No Transcript)
15- Reconstruction with CMSSW_1_0_x (x?6)
- All main reconstruction components included
- Detector-specific local reconstruction and
clustering - Tracking (only 1 algo used), vertexing,
standalone ?, jets - Global ? (with tracker), electrons, photons,
btau tagging - Reconstruction time small (no p/u!) 4.5s/ev MB,
20s/ev TTbar - Computing model assumes 25 s/ev
- Calibration/Alignment
- Ability to pull in constants from Offline DB
included for ECAL, Tracker, and Muon
reconstruction - Direct access to Oracle or via Frontier cache
16- Processing for CSA officially launched October 2
- First week mostly minbias (with some EWK) using
CMSSW102 while bugs fixed to improve robustness
on signal samples - Second week processing included signal samples at
rates generally matched to T1 bandwidth metrics
and using CMSSW103 - After having run for about 23 days, 120M events
at 100 uptime, decided to increase scale for
last days - Reprocessed all signal samples in 5 days using
CMSSW106 and maximum CPU usage - Useful to re-do some samples (FEVT, Reco, AOD,
AlCaReco) because of some problems/mistakes in
earlier generation (missing files, missing muon
objects) - Performance
- 160 Hz processing rate, peaking at 300 Hz
- signals, minbias, and HLT split samples
- 1250 CPUs for prompt reconstruction
- 150 CPUs for AOD and AlCaReco production
(separate step) - All constants pulled from Frontier
- i.e. full complexity of CSA exercise
- 4 weeks uptime (goal), 207M events processed
17- Calibration/alignment tasks
- Specialized tasks to align/calibrate subsystems
using start-up miscalibrated samples, e.g. - Align a portion of Tracker with HIP algorithm by
using Z ?mm sample on the central analysis
facility (CAF) for prompt calibration/alignment - Intercalibrate ECAL crystals by phi symmetry in
minbias events, ?0/?, or by isolated electrons
from W/Z - Specialized reduced RECO data format (AlCaReco)
to be used for calibration/alignment stream from
Tier-0 - Mechanism to write constants back into offline DB
to be used - Re-reconstruction at Tier-1 required to test new
constants - Propose that miscalibration is applied at RECO
- Datasets for alignment exercise Z?µµ
18- CSA06 misalignment scenario TIB dets and TOB
rods misaligned by applying - random shifts, drawing from a flat distribution
of witdth /-100 mm in (x,y,z) for the double
sided modules and in x (sensitive coordinate) for
the single sided ones - random rotations, drawing from a flat
distribution of witdth /-10 mrad, in
(alpha,beta,gamma) for all the modules
TIB double sided dets positions
- Alignment exercise
- to read the object in the DB, to apply the
initial misalignment - to run the iterative HIP algorithm and to
determine alignment constants - 1M events used and 10 iterations.
- jobs running in parallel on 20 CPUs on a
dedicated queue at Tier-0 - new costants inserted into the DB
19- Tomcat and squids (caching servers) in place
- and tested before CSA
- DB populated with some sets of constants
- No miscalib., start-up miscalib. (4), etc
- But multiple failures on first tests
- Crashes (needed CORAL patch)
- Logging of 28K queries/job kills servers
(disabled) - Successfully in CSA by Oct.24
In CSA
Good Tests
Failed tests
20- All 7 Tier-1 centers participated in the
challenge performing very well - some storage element software or hardware
problems at individual sites - but all have recovered and rapidly cleared any
accumulated backlogs - The longest down time at any site has been about
18 hours - Files are injected into the CMS data transfer
system PhEDEx and transferred using FTS - One central service failures
- Recovery has been rapid
- Highest rate from CERN was 550MB/s
First 3 Week Average First 3 Week Average
Site Rate
ASGC 14.3MB/s
CNAF 18.0MB/s
FNAL 47.8MB/s
GridKa 21.7MB/s
IN2P3 14.6MB/s
PIC 14.4MB/s
RAL 16.4MB/s
Total 147MB/s
21..after the prompt reconstruction at Tier-0
Transfer to Tier1 CNAF overall successfull
22- To fit data at T2, and to reduce primary datasets
to manageable sizes, it was needed to run skim
jobs at T1s to select events according to the
analyses - Skim configuration files prepared according to
the RECO and AOD format (also including some MC
truth information) - Organized skim jobs ran with ProdAgent
- Different skim procedures prepared by the users
for running on the same dataset were unified in a
single skim job producing different streams - 10 filters prepared by the Italian people to cope
with the analyses prepared - 4 teams for running skim jobs at tier-1s
- N. De Filippis Electroweak soup (RAL, CNAF,
ASGC, IN2P3) - D. Mason Jets (FNAL)
- C. Hof TTbar ( FZK and FNAL)
- J. Hernandez Zmumu (PIC and CNAF)
- Skim job output files shipped to Tier-2s for
end-user analyses - 9 Oct. T1 Skim jobs started
23- First RECO/AOD definition completed for CSA06
production - RECO Content
- Tracker Clusters
- Rec-hits skipped for disk space reasons
- Can be recomputed from clusters
- Ecal/HCal/Muon RecHits
- Track core plus extra attached RecHits
- Refitting is straightforward from attached hits
- Vertices, Ecal Clusters, Calo Towers
- High Level Objects
- Photons, Electrons (links with tracks missing),
Muons, Jets, Met (from Calo Towers and Generator) - Tau tagging
- HLT output summary
- Trigger bits links to High Level Objects (as
candidates) - HepMC Generator
- Geant 4 Tracks/Vertices
- AOD Content a proper subset of RECO
- Clusters, Hits are dropped
- Track core only saved
24(No Transcript)
25- Problems related to
- wrong config. of Tier-2 sites
- wrong setup of download agents with FTS
- CNAF related problems (FTS server, CASTOR)
26Exceeded 1PB in 1 month!
27P. Govoni
28- All INFN Tier2s took part to the last step of
the CSA06 the physics analyses starting from the
output of skim procedures
Legnaro/ Padua (W?mn selection )
Pisa (tau validation)
(Study of minimum bias/underlying event)
Rome (electron reco)
Bari (tracker misalignment)
29- Three analyses with goal
- to study of the electron reconstruction in Z ?
ee events (Meridiani) - to measure the W mass in W ? en events
(Tabarelli De Fatis, Malberti, CMS NOTE 2006-061) - to run a simple calibration with W ? en events
(Govoni)
- Electron and Z mass reconstruction using the
hybrid supercluster - energy (barrel only)
Eff vs h
mZ
Eff vs pT
30- The general idea is to simulate a "early data
taking" activity of the t group - the goal is to study the tau tag efficiency from
the Z?? tt events (like described in CMS/AN
2006/074) - the goal is to study the misidentification with
the recoiling jet with Zjet, Z ? mm events - In addition run t validation package on skimmed
events
3) The t validation package has been run on pure
di-tau sample and on skimmed ttbar sample (S.
Gennai, G. Bagliesi).
pT of the jet
Isolation efficiency vs Isolation Cone
31Study of minimum bias/underlying event (Fanò,
Ambroglini, Bartalini)
- Monte Carlo tuning for LHC
- Pileup undestanding
- UE contribution measurements in MB events
MinBias
UE
32Goal to study the W? mn preselection with
different Monte Carlo data samples
Two data samples were considered (Torassa,
Margoni, Gasparini) (1) the electroweak soup
(3.4 M evts, 50 W?mn and 50 DY) (2) the soft
muons (1.8 M evts, 50 minimum bias and 50 J/y,
pTm gt 4 GeV)
EWK soup
The transverse momentum, the efficiency vs h and
vs pT as obtained with the GlobalMuon
reconstructor (to be compared with standalone)
33- Goals to study the effect of tracker
misalignment on track reconstruction performances
(De Filippis) - with the perfect tracker geometry
- in the short term and in the long term
misalignment scenario by reading misalignment
position and errors via frontier/squid from the
offline database ORCAOFF. - by using the tracker module position and errors
as obtained by the output of the alignment
process that will be run at CERN T0. - Data samples used Z?mm and TTbar (the second for
computing the fake rate)
34- CRAB_1_4_0 used to submit 1.8 k jobs
- grid efficiency 99 , appl. eff 94
- Bunch of 150 jobs run in different time slots
- max 45 jobs run in parallel
- the configuration of squid tuned to ensure that
the alignment data were read by the local cache
of squid via the frontier client rather than from
CERN (blue histo).
? frontier/squid works as expected at tier-2
Bari when accessing alignment data
35- Goals
- to demonstrate re-reconstruction from some RAW
data at Tier-1s as part of the calibration
exercise - Status
- access of Offline database via frontier working
- re-reconstruction demonstrated at ASGC, FNAL,
IN2P3, PIC and CNAF - Running at RAL and further tests at CNAF
PIC
36- Problems with CMSSW
- the "reasonability" of the code was not too much
taken into account. Operations were driven by
computing, and the feeing was "whatever you run
we do not care. It is enough it is not crashing". - as it often happens in this case, the release
schedule was crazy. Also the initial milestones
were somehow crazy, and it meant a really hard
work to cope with them. - CSA06 meant blocking developments for some time,
to make sure we were maintaining the
backward-compatibility. But it also meant a lot
of code had to live either in the head, or in pre
releases for some time. It would be better to
have specifically two releases ongoing at a time
a production one, and a development one. - - Framework proved to be usable for T0
reconstruction. HLT was not attempted at CSA06
and so no conclusions on that.
37- Storage system
- CASTOR and DPM support (in general rfio access )
for CMS application had a lot of problems (
libdpm patched, gt 2 GB files required a patch) - CASTOR updates too much critical for the
operation during the CSA06 operations that
caused a lot of problems and an emergency status
for CNAF - Integration issues
- all the pieces of the CSA06 worked (example
CMSSW releases, PA, skim jobs, DBS/DLS
interactions) but - a lot of effort of operation teams to make them
integrated each other - PA tool that required a lot of distributed
expertise, dedicated hw/sw setup (at least three
machines), realtime monitoring - the CMS SW installation in remote sites was
problematic - LCG/OSG performances very good
38- CSA06 was successful at INFN (all the steps were
executed) but thanks to the 100 work of few
experts and to the coordinated effort of many
people at Tier-1 and Tier-2 sites. - CSA06 was supposed to be a challenge to
commission the computing/software/analysis system
but in some cases it required also
development/deployment of the tools - CSA06 analysis exercises could be as the ramp-up
for the physics program/organization in Italy - A new CSA would be the best for 2007 with
simulated and real data focus on start-up
operations (calibration and alignment) and
analysis preparation
39(No Transcript)
40 PA_035, PA_041 PA_045, PA_047
various productions monitored Managed By
different PA versions
pccms30
Test and backup setup PhEDEx injection
ProdAgent UI
41First prototype of monitoring was developed by
Bari team
42(No Transcript)
43Overwhelming response from CSA analysis
demonstrations About 25 filters producing 37
(and 21 jet) datasets ! Variety of outputs and
sizes FEVT, RECOSim, AlCaReco
44- Goals to study the effect of tracker
misalignment on track reconstruction
performances. - with the perfect tracker geometry
- in the short term and in the long term
misalignment scenario by reading misalignment
position and errors via frontier/squid from the
offline database ORCAOFF. This step requires to
refit tracks with misaligned geometry but it can
be done at the T2. The effect of alignment
position error APE to be checked. - by using the tracker module position and errors
as obtained by the output of the alignment
process that will be run at CERN T0 to verify the
efficiency of the alignment procedure on the
track reconstruction. Refit of tracks to be done
in the T2. - Global efficiency of track recostruction, track
parameter resolution and fake rate are compared
in the a), b) and c) cases. - The same analysis was performed in ORCA. Plots
and documents at link - http//webcms.ba.infn.it/cms-software/cms-grid/ind
ex.php/Main/StudiesOfCMSTrackerMisalignment - Data samples needed Z?mm and TTbar (the second
for computing the fake rate)
45- Z?mm and TTbar samples produced during CSA06
pre-production with CMSSW_0_8_2. - CSA06 events reconstructed at T0 with CMSW_1_0_3
(and Z?mm with CMSSW_1_0_5 in transfer) - 2 skim cfg files used for skimming Z?mm and TTbar
sample . Skim jobs just run at T1 with CMSW_1_0_4
and CMSSW_1_0_5 and output data in reduced format
RECOSIM are produced. RECOSIM includes enough
information for misalignment analysis. - Z? mm filter to select HepMC muons from Z decay
with h lt 2.55, with pTgt 5 GeV/c2 and 50 lt m
(Z?mm) lt 130. Filter efficiency between 50 and 60
. - TTbar filter to select events with two muons
with h lt 2.5 and pTgt 15 GeV/c2 - RECOSIM produced with CMSSW_1_0_4 transferred at
T2-Bari and misalignment analysis run over
RECOSIM with CMSSW_1_0_6. - ¼ of the full statistics already analyzed at
T2-Bari .waiting for all the statistics of the
samples.
46- Selection
- track seeding, building, ambiguity resolution,
smoothing with KF. - ctfWithMaterialTracks refit after applying
alignment uncertainties - track associator by c2 to match simtracks with
rectracks - Efficiency number of reco tracks matching simul.
tracks / number of simul tracks - - Simul. track pT 0.9 GeV/c, 0lthlt2.5 , d0 3
cm, z0 30 cm, nhitgt0 - Reco. track pT 0.7 GeV/c, 0lthlt2.6 , d0 120
cm, z0 170 cm, nhit8 - Fake Rate number of reco tracks not associated
to simul tracks / number of reco tracks - - Simul. track pT 0.7 GeV/c, 0lthlt2.6 , d0 300
cm, z0 300 cm, nhitgt8 not used because Simtrack
does not have the number of simihit method ?
Tracking Particle will have but TP is not
compatible with CSA data samples - Reco. track pT 0.9 GeV/c 0lthlt2.5 , d0 3 cm,
z0 30 cm, nhit8 - Track parameters resolution sigma of Gauss fit
to distribution of residuals
47- CRAB_1_4_0 used to submit 1.8 k jobs
- grid efficiency 99 , appl. eff 94
- Bunch of 150 jobs run in different time slots
- max 45 jobs run in parallel
- the configuration of squid tuned to ensure that
the alignment data were read by the local cache
of squid via the frontier client rather than from
CERN (blue histo).
? frontier/squid works as expected at tier-2
Bari when accessing alignment data
48- The effect of misalignment affects the global
track reconstruction efficiency in the first data
taking scenario. - The effect of tracker misalignment is enough
relevant in track parameters resolution (factor
2-3 of degradation)
49- A factor between 2 and 3 in impact parameters
resolution due to misalignment
50Using CSA06 Z?mm sample
The Z mass resolution is increased by a factor
larger than 2 in the first data taking scenario
(RMS from 1.3 to 2.8)