Title: The Open Science Grid
1The Open Science Grid
2The Vision
- Practical support for end-to-end community
systems in a heterogeneous gobal environment to - Transform compute and data intensive science
through a national cyberinfrastructure that
includes from the smallest to the largest
organizations.
3The Scope
4Community Systems
5The History
iVDGL
(NSF)
OSG
Trillium
Grid3
GriPhyN
(DOENSF)
(NSF)
PPDG
(DOE)
1999
2000
2001
2002
2005
2003
2004
2006
2007
2008
2009
6Goals of OSG
- Enable scientists to use and share a greater of
available compute cycles. - Help scientists to use distributed systems
storage, processors and software with less
effort. - Enable more sharing and reuse of software and
reduce duplication of effort through providing
effort in integration and extensions. - Establish open-source community working
together to communicate knowledge and experience
and also overheads for new participants.
7The Leaders
- High Energy Nuclear Physics (HENP)
Collaborations - Global communities with large
distributed systems in Europe as well as the US - Condor Project - distributed computing across
diverse clusters. - Globus - Grid security, data movement and
information services software. - Laser Interferometer Gravitational Wave
Observatory - legacy data grid with large data
collections, - DOE HENP Facilities
- University Groups and researchers
8Institutions Involved
Project Staff at
Boston U
Brookhaven National Lab
CalTech
(Clemson)
Columbia
Cornell
FermiLab
ISI, U of South California
Indiana U
LBNL
(Nebraska)
RENCI
SLAC
UCSD
U of Chicago
U of Florida
U of Urbana Champaign/NCSA
U of Wisconsin Madison
Sites on OSG Many with gt1 resource. 46 separate institutions. Sites on OSG Many with gt1 resource. 46 separate institutions. Sites on OSG Many with gt1 resource. 46 separate institutions. Sites on OSG Many with gt1 resource. 46 separate institutions.
- no physics Florida State U. Nebraska U. Of Arkansas
Kansas State LBNL U. Of Chicago
U. Of Michigan U of Iowa Notre Dame U. California at Riverside
Academia Sinica Hampton U Penn State U UCSD
Brookhaven National Lab UERJ Brazil Oaklahoma U. U. Of Florida
Boston U. Iowa State SLAC U. Illinois Chicago
Cinvestav, Mexico City Indiana University Purdue U. U. New Mexico
Caltech Lehigh University Rice U. U. Texas at Arlington
Clemson U. Louisiana University Southern Methodist U. U. Virginia
Dartmouth U Louisiana Tech U. Of Sao Paolo U. Wisconsin Madison
Florida International U. McGill U Wayne State U. U. Wisconsin Milwaukee
Fermilab MIT TTU Vanderbilt U.
9The Value Proposition
- Increased usage of CPUs and infrastructure alone
(ie cost of processing cycles) is not the
persuading cost-benefit value. - The benefits come from reducing risk in and
sharing support for large, complex systems which
must be run for many years with a short life-time
workforce. - Savings in effort for integration, system and
software support, - Opportunity and flexibility to distribute load
and address peak needs. - Maintainance of an experienced workforce in a
common system - Lowering the cost of entry to new contributors.
- Enabling of new computational opportunities to
communities that would not otherwise have access
to such resources.
10OSG Does Not
- Own the Resources The farms and storage are
contributed by the Consortium members. Use
commodity (and research) networks. - Own the Software middleware and applications are
developed by contributors and external projects. - Make a one size fits all We define interfaces
which people can interface to and provide a
reference sotware stack which people may use. - Take responsibility for Security outside of our
own assets.
11OSG Does
- Release, deploy and support Software.
- Integrate and test new software at the system
level. - Support operations and Grid-wide services.
- Provide Security operations and policy.
- Troubleshoot end to end user and system problems.
- Engage and help new communities.
- Extend capability and scale.
12And OSG Does Training
- Grid Schools train students, teachers and new
entrants to use grids - 2-3 day training with hands on workshops and core
curriculum (based on iVDGL annual weeklong
schools). - 3 held already several more this year (2
scheduled). Some as participants in
internationals schools. - 20-60 in each class. Each class regionally based
with broad cachement area. - Gathering an online repository of training
material. - End-to-end application training in collaboration
with user communities.
13The Implementation Architecture
- VOs and Sites
- Sites control their use and policy.
- VOs manage their users and policy.
- VOs share resources and services.
14Virtual Organizations
- A Virtual Organization is a collection of people
(VO members). - A VO has responsibilities to manage its members
and the services its runs on their behalf. - A VO may own resources and be prepared to share
in their use.
15VOs
Campus Grids 5.
Georgetown University Grid (GUGrid)
Grid Laboratory of Wisconsin (GLOW)
Grid Research and Education Group at Iowa (GROW)
University of New York at Buffalo (GRASE)
Fermi National Accelerator Center (Fermilab)
Self Operated Research Vos 15
Collider Detector at Fermilab (CDF)
Compact Muon Solenoid (CMS)
CompBioGrid (CompBioGrid)
D0 Experiment at Fermilab (DZero)
Dark Energy Survey (DES)
Functional Magnetic Resonance Imaging (fMRI)
Geant4 Software Toolkit (geant4)
Genome Analysis and Database Update (GADU)
International Linear Collider (ILC)
Laser Interferometer Gravitational-Wave Observatory (LIGO)
nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB)
Sloan Digital Sky Survey (SDSS)
Solenoidal Tracker at RHIC (STAR)
Structural Biology Grid (SBGrid)
United States ATLAS Collaboration (USATLAS)
Regional Grids 4
NYSGRID
Distributed Organization for Scientific and Academic Research (DOSAR)
Great Plains Network (GPN)
Northwest Indiana Computational Grid (NWICG)
OSG Operated VOs 4
Engagement (Engage)
Open Science Grid (OSG)
OSG Education Activity (OSGEDU)
OSG Monitoring Operations
16Sites
- A Site is a collection of commonly administered
computing and/or storage resources and services. - Resources can be owned by and shared among VOs
17A Compute Element
- Processing Farms accessed through Condor-G
submissions to Globus GRAM inteface which suppots
many different local batch systems. - Priorities and policies through assignment of VO
Roles mapped to accounts and batch queue
priorities, modified by Site policies and
priorities.
the network other OSG resources
OSG gateway machine services
From 20 CPU Department Computers to 10,000 CPU
Super Computers
Jobs run under any local batch system
18Storage Element
- Storage Services - access storage through Storage
Resource Manager (SRM) interface and GridFtp. - Allocation of shared storage through agreements
between Site and VO(s) facilitated by OSG.
the network other OSG resources
OSG SE gateway
From 20 GBytes Disk Cache To 4 Petabyte Robotic
Tape Systems
Any Shared Storage
19How are VOs supported?
- Virtual Organization Management services (VOMS)
allow registration, administration and control of
members of the group. - Facilities trust and authorize VOs not individual
users - Storage and Compute Services prioritize according
to VO group.
VO Middleware Applications
VO Management Service
Network other OSG resouces
Resources that Trust the VO
20Running Jobs
- Condor-G client
- Pre-ws or WS Gram as Site gateway
- Priority through VO role and policy, mitigate by
Site policy - Pilot jobs submitted through regular gateway can
then bring down multiple user jobs until batch
slot resources are used up. Glexec modelled on
Apache suexec allows jobs to run under user
identity.
21Data and Storage
- GridFTP data transfer
- Storage Resource Manager to manage shared and
common storage - Environment variables on the site let VOs know
where to put and leave files. - dCache - large scale, high I/O disk caching
system for large sites - DRM - NFS based disk management system for small
sites. - ? NFS V4 ? GPFS ?
22(No Transcript)
23Resource Management
- Many resources are owned or statically allocated
to one user community. - The institutions which own resources typically
have ongoing relationships with (a few)
particular user communities (VOs) - The remainder of an organizations available
resources can be used by everyone or anyone
else. - organizations can decide against supporting
particular VOs. - OSG staff are responsible for monitoring and, if
needed, managing this usage. - Our challenge is to maximize good - successful -
output from the whole system.
24An Example of Opportunistic use
- D0s own resources are committed to the
processing of newly acquired data and analysis of
the processed datasets. - In Nov 06 D0 asked to use 1500-2000 CPUs for 2-4
months for re-processing of an existing dataset
(500 million events) for science results for the
summer conferences in July 07. - The Executive Board estimated there were
currently sufficient opportunistically available
resources on OSG to meet the request We also
looked into the local storage and I/O needs. - The Council members agreed to contribute
resources to meet this request.
25How did D0 Reprocessing Go?
- D0 had 2-3 months of smooth production running
using gt1,000 CPUs and met their goal by the end
of May. - To achieve this
- D0 testing of the integrated software system took
until February. - OSG staff and D0 then worked closely together as
a team to reach the needed throughput goals -
facing and solving problems - sites - hardware, connectivity, software
configurations - application software - performance, error
recovery - scheduling of jobs to a changing mix of available
resources.
26D0 Throughput
D0 Event Throughput
D0 OSG CPUHours / Week
27What did this teach us ?
- Consortium members contributed significant
opportunistic resources as promised. - VOs can use a significant number of sites they
dont own to achieve a large effective
throughput. - Combined teams make large production runs
effective. - How does this scale?
- how we going to support multiple requests that
oversubcribe the resources? We anticipate this
may happen soon.
28Use by non-Physics
- Rosetta_at_Kuhlman lab in production across 15
sites since April - Weather Research Forecase MPI job running on 1
OSG site more to come - CHARMM molecular dynamic simulation to the
problem of water penetration in staphylococcal
nuclease - Genome Analysis and Database Update system
(GADU) portal across OSG TeraGrid. Runs Blast. - NanoHUB at Purdue Biomoca and Nanowire
production.
29Scale needed in 2008/2009
- 20-30 Petabyte tertiary automated tape storage at
12 centers world-wide physics and other
scientific collaborations. - High availability (365x24x7) and high data
access rates (1GByte/sec) locally and remotely. - Evolving and scaling smoothly to meet evolving
requirements. - E.g. for a single experiment
30CMS Data Transfer Analysis
31Software
User Science Codes and Interfaces
VO Middleware
Applications
Biology Portals, databases etc
HEP Data and workflow management etc
Astrophysics Data replication etc
OSG Release Cache OSG specific configurations,
utilities etc.
Virtual Data Toolkit (VDT) core technologies
software needed by stakeholdersmany components
shared with EGEE
Infrastructure
Core grid technology distributions Condor,
Globus, Myproxy shared with TeraGrid and others
Existing Operating, Batch systems and Utilities.
32Horizontal and Vertical Integrations
?
User Science Codes and Interfaces
Applications
Biology Portals, databases etc
HEP Data and workflow management etc
Astrophysics Data replication etc
?
Infrastructure
?
33The Virtual Data Toolkit Software
- Pre-built, integrated and packaged set of
software which is easy to download, install and
use to access OSG. - Client, Server, Storage, Service versions.
- Automated Build and Test Integration and
regression testing. - Software Included
- Grid software Condor, Globus, dCache, Authz
(voms/prima/gums), accounting (Gratia). - Utilities Monitoring, Authorization,
Configuration - Common components e.g. Apache
- Built for gt10 flavors/versions of Linux
- Support structure.
- Software acceptance structure.
34How we get to a Production Software Stack
35How we get to a Production Software Stack
Validation/Integration takes months and is the
result of work many people.
36How we get to a Production Software Stack
VDT used by others than OSG TeraGrid, Enabling
Grids for EscienE (Europe), APAC,
37Security
- Operational security a priority
- Incident response
- Signed agreements, template policies
- Auditing, assessment and training
- Parity of Sites and VOs
- A Sites trust the VOs that use it.
- A VO trusts the Sites it runs on.
- VOs trust their users.
- Infrastructure X509 certificate based. With
extended attributes for authorization.
38Illustrative example of trust model
Storage
I trust it is the VO (or agent)
Data
I trust it is the user
I trust the job is for the VO
VO infra.
C E
W
W
W
W
W
W
W
W
W
Jobs
I trust it is the users job
W
W
W
W
W
W
VO
User
Site
39Operations Troubleshooting Support
- Well established Grid Operations Center at
Indiana University - Users support distributed, includes
osg-general_at_opensciencegrid community support. - Site coordinator supports team of sites.
- Accounting and Site Validation required services
of sites. - Troubleshooting looks at targetted end to end
problems - Partnering with LBNL Troubleshooting work for
auditing and forensics.
40National Activities
- Run the OSG Distributed Facility.
- Interoperate and collaborate with TeraGrid.
- Help Campus to make their own local common
infrastructures. - Help Campuses to interface to the OSG.
- Participate in network research for high
throughput data users.
41Partnering with other organizations
42Campus Grids
43Campus Grids
- Sharing across compute clusters is a change and a
challenge for many Universities. - OSG, TeraGrid, Internet2, Educause working
together on CI Days - Work with CIO, Faculty, IT organizations for a 1
day meeting where we all come and talk about the
needs the ideas and, yes, the next steps.
44OSG and TeraGrid
- Complementary and interoperating infrastructures
TeraGrid OSG
Networks supercomputer centers. Includes small to large clusters and organizations.
Based on Condor Globus s/w stack built at Wisconsin Build and Test. Based on Same versions of Condor Globus in the Virtual Data Toolkit.
Development of User Portals/Science Gateways. Supports jobs/data from TeraGrid science gateways.
Currently relies mainly on remote login. No login access. Many sites expect VO attributes in the proxy certificate
Training covers OSG and TeraGrid usage.
45International Activities
- Interoperate with Europe for large physics users.
- Deliver the US based infrastructure for the World
Wide Large Hadron Collider (LHC) Grid
Collaboration (WLCG) in support of the LHC
experiments. - Include off-shore sites when approached.
- Help bring common interfaces and best practices
to the standards forums.
46Federation and Gateway
VO or User that acts across grids
OSG
A(nother) Grid e.g. EGEE
Gateway
Adaptor
Service-X
Service-X
47The Practical Parts of OSG
- Deliver to LHC, LIGO, STAR needs.
- Make OSG usable, robust and performant.
- Show use to other sciences.
- Show positive value proposition.
48The Vision parts of OSG
- Bring campus into a pervasive distributed
infrastructure. - Bring research into a ubiquitous appreciation of
the value of (distributed, opportunistic)
computation - Teach people to fish!