The Open Science Grid

1 / 48
About This Presentation
Title:

The Open Science Grid

Description:

International Linear Collider (ILC) Genome Analysis and Database Update (GADU) ... to 10,000 CPU Super Computers. Jobs run under any. local. batch system. OSG ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0

less

Transcript and Presenter's Notes

Title: The Open Science Grid


1
The Open Science Grid
  • Ruth Pordes
  • Fermilab

2
The Vision
  • Practical support for end-to-end community
    systems in a heterogeneous gobal environment to
  • Transform compute and data intensive science
    through a national cyberinfrastructure that
    includes from the smallest to the largest
    organizations.

3
The Scope
4
Community Systems
5
The History

iVDGL
(NSF)
OSG
Trillium
Grid3
GriPhyN
(DOENSF)
(NSF)
PPDG
(DOE)
1999
2000
2001
2002
2005
2003
2004
2006
2007
2008
2009
6
Goals of OSG
  • Enable scientists to use and share a greater of
    available compute cycles.
  • Help scientists to use distributed systems
    storage, processors and software with less
    effort.
  • Enable more sharing and reuse of software and
    reduce duplication of effort through providing
    effort in integration and extensions.
  • Establish open-source community working
    together to communicate knowledge and experience
    and also overheads for new participants.

7
The Leaders
  • High Energy Nuclear Physics (HENP)
    Collaborations - Global communities with large
    distributed systems in Europe as well as the US
  • Condor Project - distributed computing across
    diverse clusters.
  • Globus - Grid security, data movement and
    information services software.
  • Laser Interferometer Gravitational Wave
    Observatory - legacy data grid with large data
    collections,
  • DOE HENP Facilities
  • University Groups and researchers

8
Institutions Involved
Project Staff at
Boston U
Brookhaven National Lab
CalTech
(Clemson)
Columbia
Cornell
FermiLab
ISI, U of South California
Indiana U
LBNL
(Nebraska)
RENCI
SLAC
UCSD
U of Chicago
U of Florida
U of Urbana Champaign/NCSA
U of Wisconsin Madison
Sites on OSG Many with gt1 resource. 46 separate institutions. Sites on OSG Many with gt1 resource. 46 separate institutions. Sites on OSG Many with gt1 resource. 46 separate institutions. Sites on OSG Many with gt1 resource. 46 separate institutions.
- no physics Florida State U. Nebraska U. Of Arkansas
Kansas State LBNL U. Of Chicago
U. Of Michigan U of Iowa Notre Dame U. California at Riverside
Academia Sinica Hampton U Penn State U UCSD
Brookhaven National Lab UERJ Brazil Oaklahoma U. U. Of Florida
Boston U. Iowa State SLAC U. Illinois Chicago
Cinvestav, Mexico City Indiana University Purdue U. U. New Mexico
Caltech Lehigh University Rice U. U. Texas at Arlington
Clemson U. Louisiana University Southern Methodist U. U. Virginia
Dartmouth U Louisiana Tech U. Of Sao Paolo U. Wisconsin Madison
Florida International U. McGill U Wayne State U. U. Wisconsin Milwaukee
Fermilab MIT TTU Vanderbilt U.
9
The Value Proposition
  • Increased usage of CPUs and infrastructure alone
    (ie cost of processing cycles) is not the
    persuading cost-benefit value.
  • The benefits come from reducing risk in and
    sharing support for large, complex systems which
    must be run for many years with a short life-time
    workforce.
  • Savings in effort for integration, system and
    software support,
  • Opportunity and flexibility to distribute load
    and address peak needs.
  • Maintainance of an experienced workforce in a
    common system
  • Lowering the cost of entry to new contributors.
  • Enabling of new computational opportunities to
    communities that would not otherwise have access
    to such resources.

10
OSG Does Not
  • Own the Resources The farms and storage are
    contributed by the Consortium members. Use
    commodity (and research) networks.
  • Own the Software middleware and applications are
    developed by contributors and external projects.
  • Make a one size fits all We define interfaces
    which people can interface to and provide a
    reference sotware stack which people may use.
  • Take responsibility for Security outside of our
    own assets.

11
OSG Does
  • Release, deploy and support Software.
  • Integrate and test new software at the system
    level.
  • Support operations and Grid-wide services.
  • Provide Security operations and policy.
  • Troubleshoot end to end user and system problems.
  • Engage and help new communities.
  • Extend capability and scale.

12
And OSG Does Training
  • Grid Schools train students, teachers and new
    entrants to use grids
  • 2-3 day training with hands on workshops and core
    curriculum (based on iVDGL annual weeklong
    schools).
  • 3 held already several more this year (2
    scheduled). Some as participants in
    internationals schools.
  • 20-60 in each class. Each class regionally based
    with broad cachement area.
  • Gathering an online repository of training
    material.
  • End-to-end application training in collaboration
    with user communities.

13
The Implementation Architecture
  • VOs and Sites
  • Sites control their use and policy.
  • VOs manage their users and policy.
  • VOs share resources and services.

14
Virtual Organizations
  • A Virtual Organization is a collection of people
    (VO members).
  • A VO has responsibilities to manage its members
    and the services its runs on their behalf.
  • A VO may own resources and be prepared to share
    in their use.

15
VOs
Campus Grids 5.
Georgetown University Grid (GUGrid)
Grid Laboratory of Wisconsin (GLOW)
Grid Research and Education Group at Iowa (GROW)
University of New York at Buffalo (GRASE)
Fermi National Accelerator Center (Fermilab)
Self Operated Research Vos 15
Collider Detector at Fermilab (CDF)
Compact Muon Solenoid (CMS)
CompBioGrid (CompBioGrid)
D0 Experiment at Fermilab (DZero)
Dark Energy Survey (DES)
Functional Magnetic Resonance Imaging (fMRI)
Geant4 Software Toolkit (geant4)
Genome Analysis and Database Update (GADU)
International Linear Collider (ILC)
Laser Interferometer Gravitational-Wave Observatory (LIGO)
nanoHUB Network for Computational Nanotechnology (NCN) (nanoHUB)
Sloan Digital Sky Survey (SDSS)
Solenoidal Tracker at RHIC (STAR)
Structural Biology Grid (SBGrid)
United States ATLAS Collaboration (USATLAS)
Regional Grids 4
NYSGRID
Distributed Organization for Scientific and Academic Research (DOSAR)
Great Plains Network (GPN)
Northwest Indiana Computational Grid (NWICG)
OSG Operated VOs 4
Engagement (Engage)
Open Science Grid (OSG)
OSG Education Activity (OSGEDU)
OSG Monitoring Operations
16
Sites
  • A Site is a collection of commonly administered
    computing and/or storage resources and services.
  • Resources can be owned by and shared among VOs

17
A Compute Element
  • Processing Farms accessed through Condor-G
    submissions to Globus GRAM inteface which suppots
    many different local batch systems.
  • Priorities and policies through assignment of VO
    Roles mapped to accounts and batch queue
    priorities, modified by Site policies and
    priorities.

the network other OSG resources
OSG gateway machine services
From 20 CPU Department Computers to 10,000 CPU
Super Computers
Jobs run under any local batch system
18
Storage Element
  • Storage Services - access storage through Storage
    Resource Manager (SRM) interface and GridFtp.
  • Allocation of shared storage through agreements
    between Site and VO(s) facilitated by OSG.

the network other OSG resources
OSG SE gateway
From 20 GBytes Disk Cache To 4 Petabyte Robotic
Tape Systems
Any Shared Storage
19
How are VOs supported?
  • Virtual Organization Management services (VOMS)
    allow registration, administration and control of
    members of the group.
  • Facilities trust and authorize VOs not individual
    users
  • Storage and Compute Services prioritize according
    to VO group.

VO Middleware Applications
VO Management Service
Network other OSG resouces
Resources that Trust the VO
20
Running Jobs
  • Condor-G client
  • Pre-ws or WS Gram as Site gateway
  • Priority through VO role and policy, mitigate by
    Site policy
  • Pilot jobs submitted through regular gateway can
    then bring down multiple user jobs until batch
    slot resources are used up. Glexec modelled on
    Apache suexec allows jobs to run under user
    identity.

21
Data and Storage
  • GridFTP data transfer
  • Storage Resource Manager to manage shared and
    common storage
  • Environment variables on the site let VOs know
    where to put and leave files.
  • dCache - large scale, high I/O disk caching
    system for large sites
  • DRM - NFS based disk management system for small
    sites.
  • ? NFS V4 ? GPFS ?

22
(No Transcript)
23
Resource Management
  • Many resources are owned or statically allocated
    to one user community.
  • The institutions which own resources typically
    have ongoing relationships with (a few)
    particular user communities (VOs)
  • The remainder of an organizations available
    resources can be used by everyone or anyone
    else.
  • organizations can decide against supporting
    particular VOs.
  • OSG staff are responsible for monitoring and, if
    needed, managing this usage.
  • Our challenge is to maximize good - successful -
    output from the whole system.

24
An Example of Opportunistic use
  • D0s own resources are committed to the
    processing of newly acquired data and analysis of
    the processed datasets.
  • In Nov 06 D0 asked to use 1500-2000 CPUs for 2-4
    months for re-processing of an existing dataset
    (500 million events) for science results for the
    summer conferences in July 07.
  • The Executive Board estimated there were
    currently sufficient opportunistically available
    resources on OSG to meet the request We also
    looked into the local storage and I/O needs.
  • The Council members agreed to contribute
    resources to meet this request.

25
How did D0 Reprocessing Go?
  • D0 had 2-3 months of smooth production running
    using gt1,000 CPUs and met their goal by the end
    of May.
  • To achieve this
  • D0 testing of the integrated software system took
    until February.
  • OSG staff and D0 then worked closely together as
    a team to reach the needed throughput goals -
    facing and solving problems
  • sites - hardware, connectivity, software
    configurations
  • application software - performance, error
    recovery
  • scheduling of jobs to a changing mix of available
    resources.

26
D0 Throughput
D0 Event Throughput
D0 OSG CPUHours / Week
27
What did this teach us ?
  • Consortium members contributed significant
    opportunistic resources as promised.
  • VOs can use a significant number of sites they
    dont own to achieve a large effective
    throughput.
  • Combined teams make large production runs
    effective.
  • How does this scale?
  • how we going to support multiple requests that
    oversubcribe the resources? We anticipate this
    may happen soon.

28
Use by non-Physics
  • Rosetta_at_Kuhlman lab in production across 15
    sites since April
  • Weather Research Forecase MPI job running on 1
    OSG site more to come
  • CHARMM molecular dynamic simulation to the
    problem of water penetration in staphylococcal
    nuclease
  • Genome Analysis and Database Update system
    (GADU) portal across OSG TeraGrid. Runs Blast.
  • NanoHUB at Purdue Biomoca and Nanowire
    production.

29
Scale needed in 2008/2009
  • 20-30 Petabyte tertiary automated tape storage at
    12 centers world-wide physics and other
    scientific collaborations.
  • High availability (365x24x7) and high data
    access rates (1GByte/sec) locally and remotely.
  • Evolving and scaling smoothly to meet evolving
    requirements.
  • E.g. for a single experiment

30
CMS Data Transfer Analysis
31
Software
User Science Codes and Interfaces
VO Middleware
Applications
Biology Portals, databases etc
HEP Data and workflow management etc
Astrophysics Data replication etc
OSG Release Cache OSG specific configurations,
utilities etc.
Virtual Data Toolkit (VDT) core technologies
software needed by stakeholdersmany components
shared with EGEE
Infrastructure
Core grid technology distributions Condor,
Globus, Myproxy shared with TeraGrid and others
Existing Operating, Batch systems and Utilities.
32
Horizontal and Vertical Integrations
?
User Science Codes and Interfaces
Applications
Biology Portals, databases etc
HEP Data and workflow management etc
Astrophysics Data replication etc
?
Infrastructure
?
33
The Virtual Data Toolkit Software
  • Pre-built, integrated and packaged set of
    software which is easy to download, install and
    use to access OSG.
  • Client, Server, Storage, Service versions.
  • Automated Build and Test Integration and
    regression testing.
  • Software Included
  • Grid software Condor, Globus, dCache, Authz
    (voms/prima/gums), accounting (Gratia).
  • Utilities Monitoring, Authorization,
    Configuration
  • Common components e.g. Apache
  • Built for gt10 flavors/versions of Linux
  • Support structure.
  • Software acceptance structure.

34
How we get to a Production Software Stack
35
How we get to a Production Software Stack
Validation/Integration takes months and is the
result of work many people.
36
How we get to a Production Software Stack
VDT used by others than OSG TeraGrid, Enabling
Grids for EscienE (Europe), APAC,
37
Security
  • Operational security a priority
  • Incident response
  • Signed agreements, template policies
  • Auditing, assessment and training
  • Parity of Sites and VOs
  • A Sites trust the VOs that use it.
  • A VO trusts the Sites it runs on.
  • VOs trust their users.
  • Infrastructure X509 certificate based. With
    extended attributes for authorization.

38
Illustrative example of trust model
Storage
I trust it is the VO (or agent)
Data
I trust it is the user
I trust the job is for the VO
VO infra.
C E
W
W
W
W
W
W
W
W
W
Jobs
I trust it is the users job
W
W
W
W
W
W
VO
User
Site
39
Operations Troubleshooting Support
  • Well established Grid Operations Center at
    Indiana University
  • Users support distributed, includes
    osg-general_at_opensciencegrid community support.
  • Site coordinator supports team of sites.
  • Accounting and Site Validation required services
    of sites.
  • Troubleshooting looks at targetted end to end
    problems
  • Partnering with LBNL Troubleshooting work for
    auditing and forensics.

40
National Activities
  • Run the OSG Distributed Facility.
  • Interoperate and collaborate with TeraGrid.
  • Help Campus to make their own local common
    infrastructures.
  • Help Campuses to interface to the OSG.
  • Participate in network research for high
    throughput data users.

41
Partnering with other organizations
42
Campus Grids
43
Campus Grids
  • Sharing across compute clusters is a change and a
    challenge for many Universities.
  • OSG, TeraGrid, Internet2, Educause working
    together on CI Days
  • Work with CIO, Faculty, IT organizations for a 1
    day meeting where we all come and talk about the
    needs the ideas and, yes, the next steps.

44
OSG and TeraGrid
  • Complementary and interoperating infrastructures

TeraGrid OSG
Networks supercomputer centers. Includes small to large clusters and organizations.
Based on Condor Globus s/w stack built at Wisconsin Build and Test. Based on Same versions of Condor Globus in the Virtual Data Toolkit.
Development of User Portals/Science Gateways. Supports jobs/data from TeraGrid science gateways.
Currently relies mainly on remote login. No login access. Many sites expect VO attributes in the proxy certificate
Training covers OSG and TeraGrid usage.
45
International Activities
  • Interoperate with Europe for large physics users.
  • Deliver the US based infrastructure for the World
    Wide Large Hadron Collider (LHC) Grid
    Collaboration (WLCG) in support of the LHC
    experiments.
  • Include off-shore sites when approached.
  • Help bring common interfaces and best practices
    to the standards forums.

46
Federation and Gateway
VO or User that acts across grids
OSG
A(nother) Grid e.g. EGEE
Gateway
Adaptor
Service-X
Service-X
47
The Practical Parts of OSG
  • Deliver to LHC, LIGO, STAR needs.
  • Make OSG usable, robust and performant.
  • Show use to other sciences.
  • Show positive value proposition.

48
The Vision parts of OSG
  • Bring campus into a pervasive distributed
    infrastructure.
  • Bring research into a ubiquitous appreciation of
    the value of (distributed, opportunistic)
    computation
  • Teach people to fish!
Write a Comment
User Comments (0)