Title: The CMS Integration Grid Testbed and Distributed Processing Environment
1The CMS Integration Grid Testbed and Distributed
Processing Environment
- Greg Graham
- Fermilab CD/CMS
- 16-Jan-2003
2Goals
- CMS must have a working worldwide distributed
system enabling effective collaboration among US
based physicists and their colleagues worldwide. - Issues of scale
- Thousands of people
- Petabytes of distributed data
- Increasing complexity
- Tools must handle the scale in an integrated
fashion - The Grid has shown promise as a general framework
in pursuit of this goal.
3Some Significant Questions
- How does CMS move Grid Technology effectively
from the drawing board to real production
services? - How does CMS select from among many possible
emerging technologies? - How does CMS SC ramp up from the current level
of effort to that required for a production grid - in time for DC04?
- In time for real data taking?
- How can USCMS make an effective contribution?
4Strategic Focus
- Maintain production quality services with maximum
possible flexibility - Because new technology is coming on line all the
time - Complement the expected LCG services with an
effective regime of Grid prototyping. - Cooperate with the Trillium groups and directly
with middleware providers. - USCMS has a very successful and strong
relationship with the Condor Team. - We need to stay in close contact with the LCG
- We aim to provide what the LCG will provide
before they announce what they will provide.
5What is Needed
- A focussed CMS-oriented RD program is needed (in
addition to external Grid research projects) - Prototyping a rolling prototype with emphasis
on HA - Integration of Grid Tools with existing CMS
environments - Starting with Monte Carlo production
- Gaining experience with Grid middleware
- A middleware support plan with required level of
effort - Training and Documentation
- A guiding management plan and effective WBS
structure - The structure should contain mechanisms for
change - The plan should be comprehensive
6How the IGT Will Help
- The Integration Grid Testbed (IGT) complements
the existing Grids in USCMS - DGT Development Grid Testbed (The Initial
State) - Speculative development
- New tools, APIs, software layers, etc.
- PG Production Grid (The Final State)
- No middleware development
- Production quality services
- The IGT is a Transitional State where new
technologies are integrated in existing
environments - We expect to integrate Trillium/LCG provided
software here - Industry recognized cycle development-integration
-release
7The Current IGT - Hardware
DGT Sites IGT Sites
CERN LCG Participates with 72 2.4 GHz
CPU at RH7
Fermilab 40 dual 750 MHz nodes 2 servers,
RH6 Florida 40 dual 1 GHz nodes 1 server,
RH6 UCSD 20 dual 800 MHz nodes 1 server, RH6
New 20 dual 2.4 GHz nodes 1 server,
RH7 Caltech 20 dual 800 MHz nodes 1 server,
RH6 New 20 dual 2.4 GHz nodes 1 server,
RH7 UW Madison Not a protoype Tier-2 center,
support
Total 240 0.8 equiv. RH6 CPU 152 2.4 GHz RH7 CPU
8How the DPE Will Help
- The Distibuted Processing Environment (DPE)
comprises software that implements the rolling
prototype - DPE is a container for software that is developed
externally (ie- we have no developers) - DPE is a structure within which we do integration
testing - WBS Structure of DPE is comprehensive
- Effort is reported from outside where applicable
- Helps throw focus on areas where further effort
is needed - Rolling Prototype
- The DPE prototype must never be seriously
broken. - Maximum flexibility to schedule rapid deployment
of some Grid tools - Provides a continual baseline to limit exposure
to missing Grid tools
9Major Areas of the DPE
- 1.3.1 Self Evaluations
- 1.3.2 Evaluations of External Software
- This is where GRID Software can be explicitly
tracked - 1.3.3 Integration Rolling Prototype
- 1.3.4 Support and Transitioning
- 1.3.5 Milestones
- 1.3.6 Tier-0/Tier-1/Tier-2 Integration
- The next few slides highlight recent progress in
DPE. - There is not time to show them all...
10DPE Progress
- 1.3.1 Evaluations of Current Practice
- 1.3.1.1 Production Processing Tools Review
(BiAnnual) - Preliminary draft Nov. 2002
- 1.3.1.2 Analysis Tools Review (BiAnnual)
- First will take place next summer.
- 1.3.1.3 Domain Analysis (Semi-Annual)
- First will take place in Spring 2003.
- 1.3.1.4 CMS Software Tutorial (Annual)
- UCSD Tutorial Spring 2002
- Next will coincide with LHC Workshop at FNAL this
Spring.
11DPE Progress
- 1.3.2 Evaluations of External Software
Developments - 1.3.2.1 Grid Integration Task (Annual)
- Provided by CCS (C. Grandi) - First report out
Jan 2002 - 1.3.2.2 Testbed Deployments of Tools and Systems
(Ongoing) - Testbed deployment of Virtual Data Toolkit (VDT)
in preparation for Integration Grid Testbed and
Production Grid. - 1.3.2.3 LHC Computing Grid (Ongoing)
- Liason established- LCG participates in USCMS
led Integration Grid Testbed (IGT). - 1.3.2.4 CHIMERA (Ongoing)
- Chimera v1.0 deployed and tested successfully on
DGT. - CMS MCRunJob production tool successfully
integrated with Virtual Data Language (VDL).
12DPE Prototype Progress
- 1.3.3.1 Overall Architecture (SemiAnnual)
- DPE 1.0 Defined (To be released at end of
January) - 1.3.3.2 Distributed Process Management/Batch Job
Scheduling - Defined Micro/Mini/Group DAG structure of jobs.
- 1.3.3.4 Virtual Organization
- New VO management plan crafted Dec 2002
- what is needed to deploy VOMS and EDG Gatekeeper
in FNAL security environment. - 1.3.3.5 Monitoring
- Interface of MDS and MonaLisa
- Interface of Ganglia and MonaLisa
- Configuration and Application Monitoring Tools
Needed - Application monitoring can be provided by BOSS
13DPE Prototype Progress
- 1.3.3.6 Dataset Tracking
- 1.3.3.6.1 Metadata Definition
- Started Nov. 2002
- 1.3.3.6.2 Replica Catalogues
- Started investigations into SRB, Nov 2002
- Plan PACMAN deployment of SRB in Feb. 2003
- 1.3.3.7 Data Movement
- 1.3.3.7.1 Storage System Interfaces
- Collaboration with Fermilab/CCF dCache with
GridFTP protocol - part of work towards a more general MSS to MSS
API - 1.3.3.7.3 Performance Optimization
- Window sizes adjusted to optimal value in
globus-url-copy in context of investigations into
many tools - 1.3.3.8 Resource Brokering
- Discussions underway with Condor Team
14DPE Prototype Progress
- 1.3.3.9 Production processing Support Tools
- 1.3.3.9.1 Request Handlers/Trackers
- Provided by CCS The RefDB.
- 1.3.3.9.2 Job Builders
- Maintained Impala bash script based tools.
- Released MCRunJob Python based tools which
replace Impala. - 1.3.3.9.3 User Interfaces
- Version 0.9 of MCRunjob GUI
- 1.3.3.10 Analysis Tools
- Provided by CAIGEE
- 1.3.3.11 Internal Milestones
- SC2002 Milestone met, Nov 2002 soup to nuts
production/analysis in a grid environment - Production Grid Milestone (DPE 1.0) release at
end of January
15DPE Prototype Progress
- 1.3.3.13 Software Quality Assurance
- 1.3.3.13.2 Implementation of Software Quality
Assurance Tools - Started regular test procedures before release of
production tools. - 1.3.3.13.3 Technical Writing Assistance
- 0.1 FTE assigned in Jan 2003
- 1.3.3.15 Prototype Evaluations (Ongoing)
- Problems section written in IGT-long.ps document.
16DPE Progress
- 1.3.4 Software Support and Transitioning
- 1.3.4.1 Software Environment
- Provided by VDT this year
- 1.3.4.2 Release Management (SemiAnnual)
- To begin in Jan 2003
- 1.3.4.3 Release Notes (SemiAnnual)
- To begin in Jan 2003
- 1.3.4.4 Deployment Support
- Provided by VDT and Tier-1 site support coming
soon - 1.3.5 External Milestones
- LCG 24x7 Production Grid
- DC04 Pre-Challenge Production
- DC04 itself
17DPE Planning
- In addition to the WBS dictionary, there is an
ambitious meeting schedule - Weekly short term focus meetings
- Monthly Milestone Meetings
- Semi-Annual WBS and project review, Release
Status - Bi-Annual
- Production Tools Review
- Analysis Tools Review
- More detailed plans are being drawn up for
- Providing production grid services
- VOMS based (and FNAL compliant) VO structure
- EDG is producing this probably(?) to be adopted
by LCG
18SC2002 Highlights
- SC2002 soup to nuts demonstration proposed in
April 2002. - Production Phase Generate Monte Carlo with
MCRunjob on the Grid Analysis Phase Analysis of
distributed ROOT files using CLARENS, live on the
show floor
19DPE in Practice on the IGT
- The story begins before the IGT
- USMOP site was commissioned in Spring 2002
- Middleware was found lacking
- The IGT was commissioned in October 2002
- September engineering run with 50K events.
- The middleware was declared to be DPE 0.99
- 1.5 M official CMS events were produced
- 1 FTE sustained effort peaking at 2.5 (PPDG)
- But functionality was light
- Documentation
- Lots already written (and being integrated)
- Papers coming out
20Conclusions
- IGT is a necessary addition to the DGT and
Production Grid services - A CMS-oriented layer with focus on preparing
releases for production - coming from the LCG or in addition to the LCG!
- Relies heavily on developers and expertise
outside of USCMS - This is a risk
- Would have liked to explore configuration
monitoring, scheduling, and production tools
development in more detail. - The DPE is a necessary structure to allow us to
track what is installed on the IGT and on the PG. - An aid to planning for USCMS, not in competition
with LCG - Though would like to see more cooperation with
LCG - It may be useful as a vehicle to provide support
for Grid tools
21Acknowledgements
- Many thanks to Condor Team and to the Development
Grid Testbed - Especially Rick Cavanaugh, Anzar Afaq
- For further reference
- http//www.uscms.org/scpages/subsystems/DPE/index.
html - http//computing.fnal.gov/cms/Monitor/cms_producti
on.html