Title: The%20Open%20Science%20Grid
1The Open Science Grid
Miron Livny OSG Facility Coordinator University
of Wisconsin-Madison
2- Some History
- and
- background
3U.S. Trillium Grid Partnership
- Trillium PPDG GriPhyN iVDGL
- Particle Physics Data Grid 18M (DOE) (1999
2006) - GriPhyN 12M (NSF) (2000 2005)
- iVDGL 14M (NSF) (2001 2006)
- Basic composition (150 people)
- PPDG 4 universities, 6 labs
- GriPhyN 12 universities, SDSC, 3 labs
- iVDGL 18 universities, SDSC, 4 labs, foreign
partners - Expts BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO,
SDSS/NVO - Complementarity of projects
- GriPhyN CS research, Virtual Data Toolkit (VDT)
development - PPDG End to end Grid services, monitoring,
analysis - iVDGL Grid laboratory deployment using VDT
- Experiments provide frontier challenges
- Unified entity when collaborating internationally
4From Grid3 to OSG
OSG 0.2.1
OSG 0.4.0
11/03
2/05
OSG 0.4.1
4/05
9/05
12/05
OSG 0.6.0
2/06
4/06
7/06
5What is OSG?
- The Open Science Grid is a USÂ national
distributed computing facility that supports
scientific computing via an open collaboration of
science researchers, software developers and
computing, storage and network providers. The
OSG Consortium is building and operating the OSG,
bringing resources and researchers from
universities and national laboratories together
and cooperating with other national and
international infrastructures to give scientists
from a broad range of disciplines access to
shared resources worldwide.
6The OSG Project
- Co-funded by DOE and NSF at an annual rate of
6M for 5 years starting FY-07 - Currently main stakeholders are from physics - US
LHC experiments, LIGO, STARÂ experiment, the
Tevatron Run II and Astrophysics experiments - A mix of DOE-Lab and campus resources
- Active engagement effort to add new domains and
resource providers to the OSG consortium
7OSG Consortium
8OSG Project Execution
Role includes Provision of middleware
Executive Director Ruth Pordes
OSG PI Miron Livny
OSG Executive Board
External Projects
v
Resources Managers Paul Avery, Albert Lazzarini
Deputy Executive Director Rob Gardner, Doug Olson
v
Facility Coordinator Miron Livny
v
Applications Coordinators Torre Wenaus, Frank
Würthwein
Education, Training, Outreach Coordinator Mike
Wilde
Security Officer Don Petravick
v
v
Engagement Coordinator Alan Blatecky
Operations Coordinator Leigh Grundhoefer
Software Coordinator Alain Roy
v
9OSG Principles
- Characteristics -
- Provide guaranteed and opportunistic access to
shared resources. - Operate a heterogeneous environment both in
services available at any site and for any VO,
and multiple implementations behind common
interfaces. - Interface to Campus and Regional Grids.
- Federate with other national/international Grids.
- Support multiple software releases at any one
time. - Drivers -
- Delivery to the schedule, capacity and capability
of LHC and LIGO - Contributions to/from and collaboration with the
US ATLAS, US CMS, LIGO software and computing
programs. - Support for/collaboration with other
physics/non-physics communities. - Partnerships with other Grids - especially EGEE
and TeraGrid. - Evolution by deployment of externally developed
new services and technologies.
10Grid of Grids - from Local to Global
National
Campus
Community
11Who are you?
- A resource can be accessed by a user via the
campus, community or national grid. - A user can access a resource with a campus,
community or national grid identity.
12OSG sites
13running (and monitored) OSG jobs in 06/06.
14Example GADU run in 04/06
15CMS Experiment - an exemplar community grid
OSG
EGEE
CERN
USA
Germany
France
UNL
MIT
Data jobs moving locally, regionally globally
within CMS grid. Transparently across grid
boundaries from campus to global.
16The CMS Grid of Grids
- Job submission
- 16,000 jobs per day submitted across EGEE OSG
via INFN Resource Broker (RB). -
- Data Transfer
- Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG
sites. - All 7 OSG sites have reached 5TB/day goal.
- 3 OSG sites (Caltech, Florida, UCSD) exceeded
10TB/day.
17CMS Xfer on OSG
All sites have exceeded 5TB per day in June.
18CMS Xfer FNAL to World
- The US CMS center at
- FNAL transfers data to
- 39 sites worldwide in
- CMS global Xfer challenge
- Peak Xfer rates of 5Gbps
- are reached.
19EGEEOSG inter-operability
- Agree on a common Virtual Organization Management
System (VOMS) - Active Joint Security groups leading to common
policies and procedures. - Condor-G interfaces to multiple remote job
execution services (GRAM, Condor-C). - File Transfers using GridFTP.
- SRM V1.1 for managed storage access. SRM V2.1 in
test. - Publish OSG BDII to shared BDII for Resource
Brokers to route jobs across the two grids. - Automate ticket routing between GOCs.
20OSG Middleware Layering
CMSServices Framework
CDF, D0SamGrid Framework
ATLAS Services Framework
LIGOData Grid
Applications
OSG Release Cache VDT Configuration,
Validation, VO management
Virtual Data Toolkit (VDT) Common Services NMI
VOMS, CEMon (common EGEE components), MonaLisa,
Clarens, AuthZ
Infrastructure
21OSG Middleware Pipeline
Domain science requirements.
Condor, Globus, EGEE etc
OSG stakeholders and middleware developer
(joint) projects.
Test on VO specific grid
Integrate into VDT Release. Deploy on OSG
integration grid
Test Interoperability With EGEE and TeraGrid
Provision in OSG release deploy to OSG
production.
22The Virtual Data Toolkit
- Alain Roy
- OSG Software Coordinator
- Condor Team
- University of Wisconsin-Madison
23What is the VDT?
- A collection of software
- Grid software (Condor, Globus and lots more)
- Virtual Data System (Origin of the name VDT)
- Utilities
- An easy installation
- Goal Push a button, everything just works
- Two methods
- Pacman installs and configures it all
- RPM installs some of the software, no
configuration - A support infrastructure
24How much software?
25Who makes the VDT?
- The VDT is a product of Open Science Grid (OSG)
- VDT is used on all OSG grid sites
- OSG is new, but VDT has been around since 2002
- Originally, VDT was a product of the
GriPhyN/iVDGL - VDT was used on all Grid2003 sites
26Who makes the VDT?
1 Mastermind 3 FTEs
Miron Livny
Alain Roy
Tim Cartwright
Andy Pavlo
27Who uses the VDT?
- Open Science Grid
- LIGO Data Grid
- LCG
- LHC Computing Grid, from CERN
- EGEE
- Enabling Grids for E-Science
28Why should you care?
- The VDT gives insight into technical challenges
in building a large grid - What software do you need?
- How do you build it?
- How do you test it?
- How do you deploy it?
- How do you support it?
29What software is in the VDT?
- Security
- VOMS (VO membership)
- GUMS (local authorization)
- mkgridmap (local authorization)
- MyProxy (proxy management)
- GSI SSH
- CA CRL updater
- Monitoring
- MonaLISA
- gLite CEMon
- Accounting
- OSG Gratia
- Job Management
- Condor (including Condor-G Condor-C)
- Globus GRAM
- Data Management
- GridFTP (data transfer)
- RLS (replication location)
- DRM (storage management)
- Globus RFT
- Information Services
- Globus MDS
- GLUE schema providers
Note The type, quantity, and variety of software
is more important to my talk today than the
specific software Im naming
30What software is in the VDT?
- Client tools
- Virtual Data System
- SRM clients (V1 and V2)
- UberFTP (GridFTP client)
- Developer Tools
- PyGlobus
- PyGridWare
- Testing
- NMI Build Test
- VDT Tests
- Support
- Apache
- Tomcat
- MySQL (with MyODBC)
- Non-standard Perl modules
- Wget
- Squid
- Logrotate
- Configuration Scripts
- And More!
31Building the VDT
- We distribute binaries
- Expecting everyone to build from source is
impractical - Essential to be able to build on many platforms,
and replicate builds - We build all binaries with NMI Build and Test
infrastructure
32Building the VDT
NMI
VDT
RPM downloads
Build Test Condor pool (70 computers)
Sources (CVS)
Test
Users
Package
Patching
Pacman Cache
Build
Binaries
Test
Build
Binaries
Contributors
33Testing the VDT
- Every night, we test
- Full VDT install
- Subsets of VDT
- Current release You might be surprised how often
things break after release! - Upcoming release
- On all supported platforms
- Supported means we test it every night
- VDT works on some unsupported platforms
- We care about interactions between the software
34Supported Platforms
- Fedora Core 3
- Fedora Core 4
- Fedora Core 4/x86-64
- ROCKS 3.3
- SuSE 9/ia64
- RedHat 7
- RedHat 9
- Debian 3.1
- RHAS 3
- RHAS 3/ia64
- RHAS 3/x86-64
- RHAS 4
- Scientific Linux 3
- The number of Linux distributions grows
constantly, and they have important differences - People ask for new platforms, but rarely ask to
drop platforms - System administration for heterogeneous systems
is a lot of work
35Tests
- Results on web
- Results via email
- A daily reminder!
36Deploying the VDT
- We want to support root and non-root
installations - We want to assist with configuration
- We want it to be simple
- Our solution Pacman
- Developed by Saul Youssef, BU
- Downloads and installs with one command
- Asks questions during install (optionally)
- Does not require root
- Can install multiple versions at same time
37Challenges we struggle with
- How should we smoothly update a production
service? - In-place vs. on-the-side
- Preserve old configuration while making big
changes. - As easy as we try to make it, it still takes
hours to fully install and set up from scratch - How do we support more platforms?
- Its a struggle to keep up with the onslaught of
Linux distributions - Mac OS X? Solaris?
38More challenges
- Improving testing
- We care about interactions between the software
When using a VOMS proxy with Condor-G, can we
run a GT4 job with GridFTP transfer, keeping the
proxy in MyProxy, while using PBS as the backend
batch system - Some people want native packaging formats
- RPM
- Deb
- What software should we have?
- New storage management software
39One more challenge
- Hiring
- We need high quality software developers
- Creating the VDT involves all aspects of software
development - But Developers prefer writing new code instead
of - Writing lots of little bits of code
- Thorough testing
- Lots of debugging
- User support
40Where do you learn more?
- http//vdt.cs.wisc.edu
- Support
- Alain Roy roy_at_cs.wisc.edu
- Miron Livny miron_at_cs.wisc.edu
- Official Support vdt-support_at_ivdgl.org
41Security Infrastructure
- Identity X509 Certificates
- OSG is a founding member of the US TAGPMA.
- DOEGrids provides script utilities for bulk
requests of Host certs, CRL checking etc. - VDT downloads CA information from IGTF.
- Authentication and Authorization using VOMS
extended attribute certficates. - DN-gt Account mapping done at Site (multiple CEs,
SEs) by GUMS. - Standard authorization callouts to Prima(CE) and
gPlazma(SE).
42Security Infrastructure
- Security Process modeled on NIST procedural
controls starting from an inventory of the OSG
assets - Management - Risk assessment, planning, Service
auditing and checking - Operational - Incident response, Awareness and
Training, Configuration management, - Technical - Authentication and Revocation,
Auditing and analysis. End to end trust in
quality of code executed on remote CPU
-signatures?
43User and VO Management
- VO Registers with with Operations Center
- Provides URL for VOMS service to be propagated
to the sites. - Several VOMS are shared with EGEE as part of
WLCG. - User registers through VOMRS or VO administrator
- User added to VOMS of one or more VOs.
- VO responsible for users to sign AUP.
- VO responsible for VOMS service support.
- Site Registers with the Operations Center
- Signs the Service Agreement.
- Decides which VOs to support (striving for
default admit) - Populates GUMS from VOMSes of all VOs. Chooses
account UID policy for each VO role. - VOs and Sites provide Support Center Contact and
joint Operations. - For WLCG US ATLAS and US CMS Tier-1s directly
registered to WLCG. Other support centers
propagated through OSG GOC to WLCG.
44Operations and User Support
- Virtual Organization (VO)
- Group of one or more researchers
- Resource provider (RP)
- Operates Compute Elements and Storage Elements
- Support Center (SC)
- SC provides support for one or more VO and/or RP
- VO support centers
- Provide end user support including triage of
user-related trouble tickets - Community Support
- Volunteer effort to provide SC for RP for VOs
without their own SC, and general help discussion
mail list
45Operations Model
Real support organizations often play multiple
roles
Lines represent communication paths and, in our
model, agreements. We have not progressed very
far with agreements yet.
Gray shading indicates that OSG Operations
composed of effort from all the support centers
46OSG Release Process
- Applications?Integration?Provision?Deploy
- Integration Testbed (15-20) Production (50) sites
ITB
OSG
47Integration Testbed
- As reported in GridCat status catalog
service
facility
ITB release
site
Ops map
Tier 2 sites
status
48Release Schedule
Incremental Updates
Incremental Updates
Incremental Updates (minor release)
OSG 1.0.0!
OSG 0.8.0
Functionality
OSG 0.6.0
OSG 0.4.1
SC4
CMS CSA06
Advanced LIGO
06/06
03/07
12/06
09/07
12/07
3/08
6/08
03/06
09/06
06/07
01/06
9/08
WLCG Service Commissioned
ATLAS Cosmic Ray Run
49OSG Release Timeline
Production
OSG 0.2.1
OSG 0.4.0
2/05
OSG 0.4.1
4/05
9/05
12/05
ITB 0.1.2
OSG 0.6.0
2/06
4/06
ITB 0.1.6
7/06
ITB 0.3.0
ITB 0.3.4
Integration
ITB 0.3.7
ITB 0.5.0
50Deployment and Maintenance
- Distribute s/w through VDT and OSG caches.
- Progress technically via VDT weekly office hours
- problems, help, planning - fed from multiple
sources (Ops, Int, VDT-Support, mail, phone). - Publish plans and problems through VDT To do
list, Int-Twiki and ticket systems. - Critical updates and patches follow Standard
Operating Procedures.
51Release Functionality
- OSG 0.6 Fall 2006
- Accounting
- Squid (Web caching in support of s/w distribution
database information) - SRM V2AuthZ
- CEMON-ClassAd based Resource Selection.
- Support for MDS-4.
- OSG 0.8 Spring 2007
- VM based Edge Services
- Just in time job scheduling, Pull-Mode Condor-C,
- Support for sites to run Pilot jobs and/or
Glide-ins using gLexec for identity changes. - OSG1.0 End of 2007
52Inter-operability with Campus grids
- FermiGrid is an interesting example for the
challenges we face when making the resources of a
campus (in this case a DOE Laboratory) grid
accessible to the OSG community
53OSG Principles
- Characteristics -
- Provide guaranteed and opportunistic access to
shared resources. - Operate a heterogeneous environment both in
services available at any site and for any VO,
and multiple implementations behind common
interfaces. - Interface to Campus and Regional Grids.
- Federate with other national/international Grids.
- Support multiple software releases at any one
time. - Drivers -
- Delivery to the schedule, capacity and capability
of LHC and LIGO - Contributions to/from and collaboration with the
US ATLAS, US CMS, LIGO software and computing
programs. - Support for/collaboration with other
physics/non-physics communities. - Partnerships with other Grids - especially EGEE
and TeraGrid. - Evolution by deployment of externally developed
new services and technologies.
54OSG Middleware Layering
CMSServices Framework
CDF, D0SamGrid Framework
ATLAS Services Framework
LIGOData Grid
Applications
OSG Release Cache VDT Configuration,
Validation, VO management
Virtual Data Toolkit (VDT) Common Services NMI
VOMS, CEMon (common EGEE components), MonaLisa,
Clarens, AuthZ
Infrastructure
55Summary
- OSG facility opened July 22nd 2005.
- OSG facility is under steady use
- 2-3000 jobs at all times
- HEP but large Bio/Eng/Med occasionally
- Moderate other physics - Astro/Nuclear - LIGO
expected to ramp up. - OSG project
- 5 year Proposal to DOE NSF funded starting 9/06
- Facility Improve/Expand/Extend/Interoperate
EO - Off to a running start but lots more to do.
- Routinely exceeding 1Gbps at 3 sites
- Scale by x4 by 2008 and many more sites
- Routinely exceeding 1000 running jobs per client
- Scale by at least x10 by 2008
- Have reached 99 success rate for 10,000 jobs per
day submission - Need to reach this routinely, even under heavy
load
56EGEEOSG inter-operability
- Agree on a common Virtual Organization Management
System (VOMS) - Active Joint Security groups leading to common
policies and procedures. - Condor-G interfaces to multiple remote job
execution services (GRAM, Condor-C). - File Transfers using GridFTP.
- SRM V1.1 for managed storage access. SRM V2.1 in
test. - Publish OSG BDII to shared BDII for Resource
Brokers to route jobs across the two grids. - Automate ticket routing between GOCs.
57What is FermiGrid?
- Integrates resources across most (soon all)
owners at Fermilab. - Supports jobs from Fermilab organizations to run
on any/all accessible campus FermiGrid and
national Open Science Grid resources. - Supports jobs from OSG to be scheduled onto
any/all Fermilab sites,. - Unified and reliable common interface and
services for FermiGrid gateway - including
security, job scheduling, user management, and
storage. - More information is available at
http//fermigrid.fnal.gov
58Job Forwarding and Resource Sharing
- Gateway currently interfaces 5 Condor pools with
diverse file systems and gt1000 Job Slots. Plans
to grow to 11 clusters (8 Condor, 2 PBS and 1
LSF) - Job scheduling policies and in place agreements
for sharing allow fast response to changes in
resource needs by Fermilab and OSG users. - Gateway provides single bridge between OSG wide
area distributed infrastructure and FermiGrid
local sites. Consists of a Glbus gate-keeper and
a Condor-G - Each cluster has its own Globus gate-keeper
- Storage and Job execution policies applied
through Site-wide managed security and
authorization services.
59Access to FermiGrid
FermiGrid Gateway
GT-GK
Condor-G
GT-GK
GT-GK
GT-GK
GT-GK
60GLOW UW Enterprise Grid
- Condor pools at various departments integrated
into a campus wide grid - Grid Laboratory of Wisconsin
- Older private Condor pools at other departments
- 1000 1GHz Intel CPUs at CS
- 100 2GHz Intel CPUs at Physics
-
- Condor jobs flock from on-campus and of-campus to
GLOW - Excellent utilization
- Especially when the Condor Standard Universe is
used - Premption, Checkpointing, Job Migration
61Grid Laboratory of Wisconsin
2003 Initiative funded by NSF/UWSix GLOW Sites
- Computational Genomics, Chemistry
- Amanda, Ice-cube, Physics/Space Science
- High Energy Physics/CMS, Physics
- Materials by Design, Chemical Engineering
- Radiation Therapy, Medical Physics
- Computer Science
GLOW phases-1,2 non-GLOW funded nodes have
1000 Xeons 100 TB disk
62How does it work?
- Each of the six sites manages a local Condor pool
with its own collector and matchmaker - Through the High Availability Demon (HAD) service
offered by Condor, one of these matchmaker is
elected to manage all GLOW resources
63GLOW Deployment
- GLOW is fully Commissioned and is in constant use
- CPU
- 66 GLOW 50 ATLAS 108 other nodes _at_ CS
- 74 GLOW 66 CMS nodes _at_ Physics
- 93 GLOW nodes _at_ ChemE
- 66 GLOW nodes _at_ LMCG, MedPhys, Physics
- 95 GLOW nodes _at_ MedPhys
- 60 GLOW nodes _at_ IceCube
- Total CPU 1339
- Storage
- Head nodes _at_ at all sites
- 45 TB each _at_ CS and Physics
- Total storage 100 TB
- GLOW Resources are used at 100 level
- Key is to have multiple user groups
- GLOW continues to grow
64GLOW Usage
- GLOW Nodes are always running hot!
- CS Guests
- Largest user
- Serving guests - many cycles delivered to guests!
- ChemE
- Largest community
- HEP/CMS
- Production for collaboration
- Production and analysis of local physicists
- LMCG
- Standard Universe
- Medical Physics
- MPI jobs
- IceCube
- Simulations
65GLOW Usage 3/04 9/05
Leftover cycles available for Others
Takes advantage of shadow jobs
Take advantage of check-pointing jobs
Over 7.6 million CPU-Hours (865 CPU-Years) served!
66Example Uses
- ATLAS
- Over 15 Million proton collision events simulated
at 10 minutes each - CMS
- Over 70 Million events simulated, reconstructed
and analyzed (total 10 minutes per event) in the
past one year - IceCube / Amanda
- Data filtering used 12 years of GLOW CPU in one
month - Computational Genomics
- Prof. Shwartz asserts that GLOW has opened up a
new paradigm of work patterns in his group - They no longer think about how long a particular
computational job will take - they just do it - Chemical Engineering
- Students do not know where the computing cycles
are coming from - they just do it - largest user
group
67Open Science Grid GLOW
- OSG Jobs can run on GLOW
- Gatekeeper routes jobs to local condor cluster
- Jobs flock to campus wide, including the GLOW
resources - dCache storage pool is also a registered OSG
storage resource - Beginning to see some use
- Now actively working on rerouting GLOW jobs to
the rest of OSG - Users do NOT have to adapt to OSG interface and
separately manage their OSG jobs - New Condor code development
68Elevating from GLOW to OSG
Specialized scheduler operating on schedds jobs.
Job 1 Job 2 Job 3 Job 4 Job 5
Schedd On The Side
Job 4
job queue
Schedd
69The Grid Universe
vanilla
site X
- easier to live with private networks
- may use non-Condor resources
- restricted Condor feature set(e.g. no std
universe over grid) - must pre-allocating jobsbetween vanilla and grid
universe
70Dynamic Routing Jobs
- dynamic allocation of jobsbetween vanilla and
grid universes. - not every job is appropriate fortransformation
into a grid job.
vanilla
site X
site Y
site Z
71Final Observation
- A production grid is the product of a complex
interplay of many forces - Resource providers
- Users
- Software providers
- Hardware trends
- Commercial offerings
- Funding agencies
- Culture of all parties involved
-