Title: CMS Plans for Use of LCG1
1CMS Plans for Use of LCG-1
- David Stickland CMS Core Software and Computing
- (Heavily based on recent talks by Ian Fisk,
Claudio Grandi and Tony Wildish )
2Computing TDR Strategy
Technologies Evaluation and evolution
Estimated Available Resources (no cost book for
computing)
- Physics Model
- Data model
- Calibration
- Reconstruction
- Selection streams
- Simulation
- Analysis
- Policy/priorities
- Computing Model
- Architecture (grid, OO,)
- Tier 0, 1, 2 centres
- Networks, data handling
- System/grid software
- Applications, tools
- Policy/priorities
Iterations / scenarios
Required resources
Validation of Model
DC04 Data challenge Copes with 25Hz at 2x1033
for 1 month
Simulations Model systems usage patterns
- C-TDR
- Computing model ( scenarios)
- Specific plan for initial systems
- (Non-contractual) resource planning
3Schedule PCP, DC04, C-TDR
- 2003 Milestones
- June Switch to OSCAR (critical path)
- July Start GEANT4 production
- Sept Software baseline for DC04
- 2004 Milestones
- April DC04 done (incl. post-mortem)
- April First Draft C-TDR
- Oct C-TDR Submission
PCP
4PCP and DC04
- Two quite different stages
- Pre Challenge Production (PCP)
- Important thing is to get it done how its done
is not the big issue - But anything that can reduce manpower here is
good, this will be a 6-month process - We intend to run a hybrid (non-GRID and GRID)
operation (with migration from first to second
case as tools mature) - Data Challenge (DC04)
- Predicated on GRID operation for moving data and
for running jobs (and/or pseudo jobs) at the
appropriate locations - Exercise some calibration and analysis scenarios
5DC04 Workflow
- Process data at 25 Hz at the Tier-0
- Reconstruction produces DST and AOD
- AOD replicated to all Tier-1(assume 4 centers)
- Sent on to participating Tier 2 (pull)
- DST replicated to at least one Tier-1
- Assume Digis are already replicated in at least
one Tier-1 - No bandwidth to transfer Digis synchronously
- Archive Digis to tape library
- Express lines transferred to selected Tier-1
- Calibration streams, Higgs analysis stream,
- Analysis recalibration
- Produce new calibration data at selected Tier-1
and update the Conditions Database - Analysis from the Tier-2 on AOD, DST,
occasionally on Digis
6Data Flow
In DC04
DC04 Calibration challenge
DC04 Analysis challenge
DC04 T0 challenge
CERN disk pool 40 TByte (20 days data)
25Hz 1MB/evt raw
25Hz 0.5MB reco DST
HLT Filter ?
Disk cache
Archive storage
CERN Tape archive
7DC04 Strategy (partial...)
- DC04 is focussed on preparations for the first
months of data, not the operation in 2010 - Grid enters mainly in the data distribution and
analysis - Express Lines pushed from Tier-0 to Tier-1s
- AOD, DST published by Tier-0 and pulled by
Tier-1s - Use Replica Manager services to locate and move
the data - Use a Workload Management System to select
resources - Use a Grid-wide monitoring system
- Conditions DB segmented in read-only Calibration
Sets - Versioned
- Metadata stored in the RefDB
- Temporary solution
- need specific middleware for read-write data
management - Client-server analysis Clarens?
- How does it interface to LCG-1 information and
data management systems?
8Boundary conditions for PCP
- CMS persistency is changing
- POOL (by LCG) is replacing Objectivity/DB
- CMS Compiler is changing
- gcc 3.2.2 is replacing 2.95.2
- Operating system is changing
- Red Hat 7.3 is replacing 6.1.1
- Grid middleware structure is changing
- EDG on top of VDT
- ?
- Flexibility to deal with a dynamic environment
during the Pre-Challenge Production!
9Perspective on Spring 02 Production
- Strong control at each site
- Complex machinery to install and commission at
each site - Steep learning curve
- Jobs coupled by Objectivity
- Dataset-oriented production
- not well suited to most sites capabilities
10Perspective for PCP04
- Must be able to run on grid and on non-grid
resources - Have less (no!) control at non-dedicated sites
- Simplify requirements at each site
- Less tools to install
- Less configuration
- Fewer site-local services
- Simpler recovery procedures
- Opportunistic approach
- Assignment ? dataset
- Several short assignments make one large dataset
- Allows splitting assignments across sites
- Absolute central definition of assignment
- Use RefDB instead of COBRA metadata to control
run numbers etc - Random numbers events per run from RefDB
- Expect (intend) to allow greater mobility
- One assignment ! one site
11Catalogs, File Systems etc
- A reasonably functioning Storage Element needs
- A data catalog
- Replication management
- Transfer Capabilities
- On the WAN side at least GridFTP
- On the LAN side a POSIX compliant I/O (Some more
discussions here I expect) - CMS is installing on CMS Centers SRB to solve
the immediate data management issues - We aim to integrate with an LCG supported SE as
soon as reasonable functionality exists - CMS would like to install dCache on those centers
doing specialized tasks such as High Luminosity
Pileup
12Specifics operations
- CMKIN/CMSIM
- Can run now
- Would help us commission sites with McRunjob etc
- OSCAR (G4)
- Need to plug the gaps, be sure everything is
tightly controlled and recorded from assignment
request to delivered dataset - ORCA (Reconstruction)
- Presumably need the same control-exercise as for
OSCAR to keep up with the latest versions - Digitisation
- Dont yet know how to do this with pileup on a
grid - Looks like dCache may be able to replace the
functionality of AMS/RRP(AMS/RRP is a fabric
level tool not middleware or experiment
application)
13Operations (II)
- Data-movement will be our killer problem
- Can easily generate gt 1 TB per day in the RCs
- Especially if we dont work to reduce the event
size - Can we expect to import 2 TB/day to the T0?
- Yes, for a day or two, but for several months?
- Not without negotiating with Castor and the
network Authorities
14Timelines
- Start ramping up standalone production in the
next 2 weeks - Exercise the tools users
- Bring sites up to speed a few at a time
- Start deploying SRB and testing it on a larger
scale - Production-wide by end of June
- Start testing LCG-ltngt production in June
- Use CMS/LCG-0 as available
- Expect to be ready for PCP in July
- Partitioning of HW resources will depend on state
of LCG and the eventual workload
15PCP strategy
- PCP cannot be allowed to fail (no DC04)
- Hybrid, GRID and non-GRID operation
- Minimum baseline strategy is to be able to run on
dedicated, fully controllable resources without
the need of grid tools (local productions) - We plan to use LCG and other Grids wherever they
can operate with reasonable efficiency - Jobs will run in a limited-sandbox
- input data local to the job
- local XML POOL catalogue (prepared by the prod.
tools) - output data/metadata and job monitoring data
produced locally and moved to the site manager
asynchronously - synchronous components optionally update central
catalogs. If they fail the job will continue and
the catalogs are updated asynchronously - reduce dependencies on external environment and
improve robustness - (Conversely, DC04 can be allowed to fail. It is
the Milestone)
16Limited-sandbox environment
Worker Node
Users Site
Job input
Job input
Job Wrapper (job instru- mentation)
User Job
Job output
Job output
Journal writer
Journal
Journal
Remote updater
Asynchronous updater
Metadata DB
- File transfers, if needed, are managed by
external tools (EDG-JSS, additional DAG nodes,
etc...)
17Grid vs. Local Productions
- OCTOPUS runs only on the User Interface
- No CMS know-how is needed at CE/SE
sites - UI installation doesnt require full middleware
- Little Grid know-how is needed on UI
- CMS programs (ORCA, OSCAR) pre-installed on CEs
(RPM, DARPACMAN) - A site that does local productions is the
combination of a CE/SE and a UI that submits only
to that CE/SE - Switch between grid and non-grid production only
reconfiguring the production software on the UI - OCTOPUS Overtly Contrived Toolkit Of Previously
Unrelated Stuff - Soon to have a CVS software repository too
- Release McRunjob, DAR, BOSS, RefDB, BODE, RMT,
CMSprod as a coherent whole
18Hybrid production model
Users Site (or grid UI)
Resources
Production Manager defines assignments
RefDB
Phys.Group asks for an official dataset
shell scripts
Local Batch Manager
JDL
EDG Scheduler
Site Manager starts an assignment
CMS/ LCG-0
MCRunJob
DAGMan (MOP)
User starts a private production
Chimera VDL
Virtual Data Catalogue
Planner
19A Port in every GRID
- MCRunJob is compatible with opportunistic use of
almost any GRID environment - We can run purely on top of VDT (USCMS IGT tests,
Fall 03) - We can run in an EDG environment (EDG Stress
Tests) - We can run in CONDOR pools
- Griphyn/iVDGL have used CMS production as a test
case - (We think this is a common approach of all
experiments) - Within reason if a country/center wants to
experiment with different components above the
base GRID we can accommodate that - We think it even makes sense for centers to
explore the tools available!
20Other Grids
- The US already deploys a VDT based GRID
environment that has been very useful for our
productions(probably not just a US concern) - They will want to explore new VDT versions on
timescale that may be different to LCG - For good reason, they may have different higher
level middleware - This makes sense to allow/encourage
- For the PCP the goal is to get the work done any
way we can - If a region wants to work with LCG1 subsets or
extensions so be it - (Certainly in the US case there will also be pure
LCG functionality) - There is not one and one only GRID
- collaborative exploration makes sense
- But for DC04, we expect to validate pure LCGltngt
21CMS Development Environment
- CMS/LCG-0 is a CMS-wide testbed based on the LCG
pilot distribution, owned by CMS - Before LCG-1 (ready in July)
- gain experience with existing tools before start
of PCP - Feedback to GDA
- develop productionanalysis tools to be deployed
on LCG-1 - test new tools of potential interest to CMS
- Common environment for all CMS productions
- Use also as a base configuration for non-grid
productions
22dCache
- What is dCache?
- dCache is a disk caching system developed at DESY
as a front end for Mass Storage Systems - It now has significant developer support from
FNAL and is used in several running experiments - We are using it as a way to utilize disk space
on the worker nodes and efficiently supply data
in intense applications like simulation with
pile-up. - Applications access the data in d-cache space
over a POSIX compliant interface. The d-cache
directory (/pnfs) from the user perspective looks
like any other cross mounted file system - Since this was designed as a front-end to MSS,
once closed, files cannot be appended - Very promising set of features for load balancing
and error recovery - dCache can replicate data between servers if the
load is too high - if a server fails, dCache can create a new pool
and the application can wait until data is
available.
23High Luminosity Pile-Up
- This has been a bugbear of all our productions.
It needs specially configured data serving (Not
for every center) - Tried to make a realistic test. In the interest
of time we generated a fairly small minimum bias
dataset. - Created 10k cmsim minimum bias events writing the
fz files directly into d-cache. - Hit Formatted all events (4 times).
- Experimented with writing events to d-cache and
local disk - Placed all ROOT-IO files for Hits, THits, MCInfo,
etc. into d-cache and the META data in local
disk. - Used soft links from the local disk to /pnfs
d-cache space - Digi Files are written into local disk space
- Minimum bias Hit files are distributed across the
cluster
24writeAllDigis Case
- Pile-up is the most complicated case
- Pile-up events are stored across the pools
- Many applications can be running in parallel each
writing to their own metadata but reading the
same minimum bias ROOT-IO files
Pool Node
Pool Node
Pool Node
Pool Node
PNFS
Pool Node
Worker Node
libpdcap.so
Pool Node
writeAllDigis
Pool Node
Pool Node
Dcache Server
Local Disk
Pool Node
25dCache Request
- dCache successfully made use of disk space from
the computing elements, efficiently reading and
writing events for CMSIM - No drawbacks were observed
- Writing into d-Cache with writeHits jobs worked
properly - Performance is reduced for writing metadata,
reading fz files worked well - The lack of ability to append files makes storing
metadata inconvenient - Moving closed files into d-cache and soft linking
works fine - High Luminosity pile-up worked very well reading
events directly from d-cache - Data Rate is excellent
- System stability is good
- Tests were quite successful and generally quite
promising - We think this could be useful at many centers
(But we are not requiring this) - but it looks like it will be required at least at
the centers where we do Digitization (Typically
at some Tier -1s) - But can probably be setup in a corner if it is
not part of the center model
26Criteria for LCG1 Evaluation
- We can achieve a reasonable fraction of the PCP
on LCG resources - We continue to see a tendency to reducing the
(CMS) expertise required at each site - We see the ability to handle the DC04 scales
- 50-100k files
- 200TB
- 1 MSI2k
- 10-20 users in PCP, 50-100 in DC04 (From any
country, using any resource) - Final criteria for DC04/LCG success will be set
in September/October
27Summary
- Deploying LCG-1 will is clearly a major task
- expansion must be clearly controlled expand at
constant efficiency - CMS needs LCG-1 for DC04 itself
- and to prepare for DC04 during Q3Q4 of this year.
- The pre-challenge production will use LCG-1 as
much as feasible - help burn-in LCG-1
- PCP can run 'anywhere', not critically dependent
on LCG-1 - CMS-specific extensions to LCG-1 environment will
be minimal and non-intrusive to LCG-1 itself - SRB CMS-wide for the PCP
- dCache where we want to digitize with pileup