Title: GRID Nation: building National Grid Infrastructure for Science
1GRID Nationbuilding National Grid
Infrastructure for Science
- Rob Gardner
- University of Chicago
2Grid3 an application grid laboratory
CERN LHC US ATLAS testbeds data challenges
CERN LHC USCMS testbeds data challenges
end-to-end HENP applications
virtual data research
virtual data grid laboratory
3Grid3 at a Glance
- Grid environment built from core Globus and
Condor middleware, as delivered through the
Virtual Data Toolkit (VDT) - GRAM, GridFTP, MDS, RLS, VDS
- equipped with VO and multi-VO security,
monitoring, and operations services - allowing federation with other Grids where
possible, eg. CERN LHC Computing Grid (LCG) - USATLAS GriPhyN VDS execution on LCG sites
- USCMS storage element interoperability
(SRM/dCache) - Delivering the US LHC Data Challenges
4Grid3 Design
- Simple approach
- Sites consisting of
- Computing element (CE)
- Storage element (SE)
- Information and monitoring services
- VO level, and multi-VO
- VO information services
- Operations (iGOC)
- Minimal use of grid-wide systems
- No centralized workload manager, replica or data
management catalogs, or command line interface - higher level services are provided by individual
VOs
5Site Services and Installation
- Goal is to install and configure with minimal
human intervention - Use Pacman and distributed software caches
- Registers site with VO and Grid3 level services
- Accounts, application install areas working
directories
pacman get iVDGLGrid3
Grid3 Site
app
VDT VO service GIIS register Info providers Grid3
Schema Log management
tmp
Compute Element
Storage
6Multi-VO Security Model
- DOEGrids Certificate Authority
- PPDG or iVDGL Registration Authority
- Authorization service VOMS
- Each Grid3 site generates a Globus gridmap file
with an authenticated SOAP query to each VO
service - Site-specific adjustments or mappings
- Group accounts to associate VOs with jobs
VOMS
SDSS
US CMS
Grid3 grid-map
US ATLAS
Site
BTeV
LSC
iVDGL
7iVDGL Operations Center (iGOC)
- Co-located with Abilene NOC (Indianapolis)
- Hosts/manages multi-VO services
- top level Ganglia, GIIS collectors
- MonALISA web server and archival service
- VOMS servers for iVDGL, BTeV, SDSS
- Site Catalog service, Pacman caches
- Trouble ticket systems
- phone (24 hr), web and email based collection and
reporting system - Investigation and resolution of grid middleware
problems at the level of 30 contacts per week - Weekly operations meetings for troubleshooting
8Grid3 a snapshot of sites
- Sep 04
- 30 sites, multi-VO
- shared resources
- 3000 CPUs (shared)
9Grid3 Monitoring Framework
c.f. M. Mambelli, B. Kim et al., 490
10Monitors
Data IO (Monalisa)
Job Queues (Monalisa)
Metrics (MDViewer)
11Use of Grid3 led by US LHC
- 7 Scientific applications and 3 CS demonstrators
- A third HEP and two biology experiments also
participated - Over 100 users authorized to run on Grid3
- Application execution performed by dedicated
individuals - Typically few users ran the applications from a
particular experiment
12US CMS Data Challenge DC04
Opportunistic use of Grid3 non-CMS (blue)
Events produced vs. day
CMS dedicated (red)
c.f. A. Fanfani, 497
13Ramp up ATLAS DC2
c.f. R. Gardner, et al., 503
14Shared infrastructure, last 6 months
15ATLAS DC2 production on Grid3 a joint activity
with LCG and NorduGrid
G. Poulard, 9/21/04
total
Validated Jobs
c.f. L. Goossens, 501 O. Smirnova 499
Day
16Typical Job distribution on Grid3
G. Poulard, 9/21/04
17Operations Experience
- iGOC and US ATLAS Tier1 (BNL) developed
operations response model in support of DC2 - Tier1 center
- core services, on-call person available always
- response protocol developed
- iGOC
- Coordinates problem resolution for Tier1 off
hours - Trouble handling for non-ATLAS Grid3 sites.
Problems resolved at weekly iVDGL operations
meetings - 600 trouble tickets (generic) 20 ATLAS DC2
specific - Extensive use of email lists
18Not major problems
- bringing sites into single purpose grids
- simple computational grids for highly portable
applications - specific workflows as defined by todays JDL
and/or DAG approaches - centralized, project-managed grids to a
particular scale, yet to be seen
19Major problems two perspectives
- Site service providing perspective
- maintaining multiple logical grids with a given
resource maintaining robustness long term
management dynamic reconfiguration platforms - complex resource sharing policies (department,
university, projects, collaborative), user roles - Application developer perspective
- challenge of building integrated distributed
systems - end-to-end debugging of jobs, understanding
faults - common workload and data management systems
developed separately for each VO
20Grid3 is evolving into OSG
- Main features/enhancements
- Storage Resource Management
- Improve authorization service
- Add data management capabilities
- Improve monitoring and information services
- Service challenges and interoperability with
other Grids - Timeline
- Current Grid3 remains stable through 2004
- Service development continues
- Grid3dev platform
21Consortium Architecture
Campus, Labs
Technical Groups 0n (small)
Service Providers
Consortium Board (1)
Sites
Researchers
VO Org
Joint committees (0N small)
activity 1
Research Grid Projects
activity 1
activity 1
activity 0N (large)
Enterprise
OSG Process Framework
Participants provide resources, management,
project steering groups
22OSG deployment landscape
VOs apps
TG MonInfo
TG Policy
Arch
MIS
Policy
OSG deployment
TG Storage
TG Security
TG Support Centers
Chairs
23OSG Integration Activity
- Integrate middleware services from technology
providers targeted for the OSG - Develop processes for validation and
certification of OSG deployments - Provide testbed for evaluation and testing of new
services and applications - Test and exercise installation and distribution
methods - Devise best practices for configuration management
24OSG Integration, 2
- Establish framework for releases
- Provide feedback to service providers and VO
application developers - Allow contributions and testing of new,
interoperable technologies with established
baseline services - Supply requirements and feedback for new tools,
technology, and practices in all these areas
25Validation and Certification
- We will need to develop the model for new
services and distributed applications into the
OSG - Validation criteria will be specified by the
service providers (experts) - Coherence with OSG deployment paramount
- VO and experiments will specify what constitutes
acceptable functionality. Validation space - Deployment and configuration process
- Functionality
- Scale
26Conclusions
- Grid3 taught us many lessons about how to deploy
and run a production grid - Breakthrough in demonstrated use of
opportunistic resources enabled by grid
technologies - Grid3 will be a critical resource for continued
data challenges through 2004, and environment to
learn how to operate and upgrade large scale
production grids - Grid3 is evolving to OSG with enhanced
capabilities