Title: The CMS Integration Grid Testbed
1The CMS Integration Grid Testbed
- Greg Graham
- Fermilab CD/CMS
- 17-January-2003
2The Integration Grid Testbed
- Controlled environment on which to test the DPE
in preparation for release - Currently, IGT uses USCMS Tier-1/Tier-2
designated resources. - Soon (end of this month), most of the IGT
resources will be turned over to a production
grid. - IGT will retain a small number of resources
deemed necessary to do integration testing - VO management should be flexible enough to allow
PG to loan resources to IGT when needed for
scalability tests - In the meantime, IGT has been commissioned with
real production assignments - Testing Grid operations, troubleshooting
procedures, and scalability issues
3The Current IGT - Hardware
DGT Sites IGT Sites
CERN LCG Participates with 72 2.4 GHz
CPU at RH7
Fermilab 40 dual 750 MHz nodes 2 servers,
RH6 Florida 40 dual 1 GHz nodes 1 server,
RH6 UCSD 20 dual 800 MHz nodes 1 server, RH6
New 20 dual 2.4 GHz nodes 1 server,
RH7 Caltech 20 dual 800 MHz nodes 1 server,
RH6 New 20 dual 2.4 GHz nodes 1 server,
RH7 UW Madison Not a protoype Tier-2 center,
support
Total 240 0.8 equiv. RH6 CPU 152 2.4 GHz RH7 CPU
4Grid Middleware in the DPE
- Based on the Virtual Data Toolkit 1.1.3
- VDT Client
- Globus Toolkit 2.0
- Condor-G 6.4.3
- VDT Server
- Globus Toolkit 2.0
- mkgridmap
- Condor 6.4.3
- ftsh
- GDMP 3.0.7
- Virtual Organisation Management
- LDAP Server deployed at Fermilab
- Contains the DNs for all US-CMS Grid Users
- GroupMAN (from PPDG and adapted from EDG) used to
manage the VO - Investigating/evaluting the use of VOMS from the
EDG - Use D.O.E. Science Grid certificates
- Accept EDG and Globus certificates
5DPE 1.0 Architecture - Layer View
- Value added at each layer
- Job Creation Chains many applications together
in a tree-like structure. MCRunjob keeps track
of functional dependencies among processing
nodes. - DAG creation Wraps applications in generic DAGs
for co-scheduling. MOP contains the structure of
the generic DAGs. - DAGMAN/Condor Scheduling layer. On IGT,
scheduling is still a human decision Dzero
version has scheduling. - Globus and Job Manager Grid Interface with VO
services
6DPE 1.0 Architecture - Layer View
- Monitoring not shown
- Monitoring information is scattered and unwieldy
- Health monitoring is more or less under control
- Application monitoring can be provided by BOSS
- Configuration monitoring is a new concept
- Monitoring information is not used at any level
yet. - Local Ganglia, MDS, SNMP, Hawkeye
- Grid MonaLisa, MDS, Hawkeye
- Scheduling not shown
- A roadmap for demonstrating where scheduling
decisions can be made - resource broker vs. scheduler
7DPE Architecture- Component View
- MCRunJob uses Configurators to manage
- metadata associated with each production step in
a complex tree of multi-step processing - metadata associated with different runtime
environments - functional depencencies
- Also in use at DZero
- mop_submitter defines generic DAGs
- wraps jobs into DAGs
- submits to DAGMAN for execution
- Condor-G
- runs DAG nodes as Globus jobs
- (in lieu of Condor backend)
- Results are returned to submit site using GridFTP
protocol - Could be returned anywhere in practice
- Monitoring (IGT) information is returned using
- Ganglia to MonaLisa interfaces
- MDS to MonaLisa interfaces
- Not used in any automatic system
VDT Server 1
Condor
Globus
VDT Client
MCRunJob
DAGMan/ Condor-G
GridFTP
Linker
ScriptGen
Config
Globus
mop-submitter
Master
Req.
Self Desc.
GridFTP
Globus
VDT Server N
Condor
Globus
GridFTP
8MCRunJob
- MCRunJob was developed at Dzero to assist in
managing large Monte Carlo productions. - It has also been used at the tail end of Spring02
Monte Carlo production in CMS, and was used on
the IGT production to chain different processing
steps together into a complex workflow
description. - It will be used in future CMS productions.
- MCRunJob has a modular architecture which is
metadata oriented. - Implemented in Python
- Receives input from many different sources
- Targets many different processing steps
- Targets many different runtime environments
9MCRunJob Architecture in Brief
- Metadata containers are called Configurators
- Configurators can communicate with each other in
structured ways - namely, when a dependency relationship is
declared - Configurators can explicitly declare metadata
elements to depend on other elements - All external entities are represented in MCRunJob
as Configurators (ie- as metadata collections) - The RefDB at CERN, SAM, An application in a
processing chain - Scripts are generated by registered objects with
the ScriptGen interface (not shown). - You guess it, configurators!
10MCRunJob Architecture in Brief
- Special Configurators called ScriptGenerators
manage building of concrete executable scripts to
implement the workplan. - Currently, targets Dzero production, Impala (CMS
production), and Virtual Data Language (Chimera) - Incidentally, this means that MCRunJob can
translate among these environments. - Users and machines communicate with MCRunJob
through a macro language - In CMS, the macro language has been implemented
in registered external functions on top of the
Configurator API - Thus the macro language itself is extensible to
meet the needs of different experiments
11IGT Production Results
- The IGT progress was remarkably consistent.
- Compare to Spring 2002 official production
- However, did not do production with pileup.
- Two flatliners
- SC2002 conference
- infamous // bug
- Holidays
- Eid - the end of Ramadan (Anzar Afaq was the man
behind the curtain.) - Word just in from Globusworld
- The IGT has garnered some attention in plenary
sessions there!
12IGT Production Results
- Efficiency Estimates
- Single events were about 430 sec on 750 MHz
processors. - Theoretical IGT max throughput was therefore 45K
events per day - adjusting all processors for 750 MHz
- Efficiency was calculated as throughput
normalized to the maximum. - Manpower estimates 1 FTE peaking at 2.5 during
troubleshooting - This is above the normal admin support, did not
have a helpdesk, etc. - Data gathered by informal survey.
13A Sampler of IGT Lessons
- The infamous (and amusing!) // problem
- UNIX filenames allow // in pathnames. The CMS
C application experts interpret // as a
comment. Instant mayhem. - This bug was first identified incorrectly as a
middleware problem. - Symptom Application dumps 230 MB binary data to
stdout- caused GASS cache to fail. - A production expert identified the problem within
5 minutes. - But only after days of middleware troubleshooting
- Lessons
- Problems come in all shapes and sizes!!!
- We have learned yet again that theres always
something new that can go wrong. - Need better application monitoring
- Can be provided by BOSS
- Need better error reporting and problem routing
14A Sampler of IGT Lessons
- Automatic scheduling should be done in an
environment with reliable health monitoring
information. - Poor-man script based queue length scheduling
failed because of particular middleware failures - Job Managers often lose contact with jobs in
failure mode - Condor has no independent way of verifying job
status, assumes it is dead - Queue length based scheduling leads to broken
farms! - New scalability problems always lurk behind the
next modest increase in scale - 200 or so jobs was OK. When CERN IGT site joined,
we discovered new GAHP server scale limitation
of 250 jobs per mop master site.
15Conclusions
- The Integration Grid Testbed has achieved a
measure of success using the USCMS designated
DPE. - 1.5 M events were produced for CMS
- Grid environment operated in continuous fashion
for over 2 months - Operational experiences of running the Grid
documented - Functionality was limited
- But it matched the available manpower well
- Much can be addressed in MCRrunJob also
- Success is recognized by demos at SC2002 and
poster at Globusworld.
16Acknowledgements
- Many thanks to Peter Couvares (CONDOR), Anzar
Afaq (PPDG), and Rick Cavanaugh (iVDGL/GriPhyN)
for heroic efforts! - As usual, many more that I am not mentioning
here... - For further reference
- http//www.uscms.org/scpages/subsystems/DPE/index.
html - http//computing.fnal.gov/cms/Monitor/cms_producti
on.html - http//home.fnal.gov/ggraham/MCRunJob_Presentatio
ns/