The%20Open%20Science%20Grid - PowerPoint PPT Presentation

About This Presentation
Title:

The%20Open%20Science%20Grid

Description:

DN- Account mapping done at Site (multiple CEs, SEs) by GUMS. ... Populates GUMS from VOMSes of all VOs. Chooses account UID policy for each VO & role. ... – PowerPoint PPT presentation

Number of Views:49
Avg rating:3.0/5.0
Slides: 72
Provided by: rga101
Category:

less

Transcript and Presenter's Notes

Title: The%20Open%20Science%20Grid


1
The Open Science Grid
Miron Livny OSG Facility Coordinator University
of Wisconsin-Madison
2
  • Some History
  • and
  • background

3
U.S. Trillium Grid Partnership
  • Trillium PPDG GriPhyN iVDGL
  • Particle Physics Data Grid 18M (DOE) (1999
    2006)
  • GriPhyN 12M (NSF) (2000 2005)
  • iVDGL 14M (NSF) (2001 2006)
  • Basic composition (150 people)
  • PPDG 4 universities, 6 labs
  • GriPhyN 12 universities, SDSC, 3 labs
  • iVDGL 18 universities, SDSC, 4 labs, foreign
    partners
  • Expts BaBar, D0, STAR, Jlab, CMS, ATLAS, LIGO,
    SDSS/NVO
  • Complementarity of projects
  • GriPhyN CS research, Virtual Data Toolkit (VDT)
    development
  • PPDG End to end Grid services, monitoring,
    analysis
  • iVDGL Grid laboratory deployment using VDT
  • Experiments provide frontier challenges
  • Unified entity when collaborating internationally

4
From Grid3 to OSG
OSG 0.2.1
OSG 0.4.0
11/03
2/05
OSG 0.4.1
4/05
9/05
12/05
OSG 0.6.0
2/06
4/06
7/06
5
What is OSG?
  • The Open Science Grid is a US national
    distributed computing facility that supports
    scientific computing via an open collaboration of
    science researchers, software developers and
    computing, storage  and network providers. The
    OSG Consortium is building and operating the OSG,
    bringing resources and researchers from
    universities and national laboratories together
    and cooperating with other national and
    international infrastructures to give scientists
    from a broad range of disciplines access to
    shared resources worldwide.

6
The OSG Project
  • Co-funded by DOE and NSF at an annual rate of
    6M for 5 years starting FY-07
  • Currently main stakeholders are from physics - US
    LHC experiments, LIGO, STAR  experiment, the
    Tevatron Run II and Astrophysics experiments
  • A mix of DOE-Lab and campus resources
  • Active engagement effort to add new domains and
    resource providers to the OSG consortium

7
OSG Consortium
8
OSG Project Execution
Role includes Provision of middleware
Executive Director Ruth Pordes
OSG PI Miron Livny
OSG Executive Board
External Projects
v
Resources Managers Paul Avery, Albert Lazzarini
Deputy Executive Director Rob Gardner, Doug Olson
v
Facility Coordinator Miron Livny
v
Applications Coordinators Torre Wenaus, Frank
Würthwein
Education, Training, Outreach Coordinator Mike
Wilde
Security Officer Don Petravick
v
v
Engagement Coordinator Alan Blatecky
Operations Coordinator Leigh Grundhoefer
Software Coordinator Alain Roy
v
9
OSG Principles
  • Characteristics -
  • Provide guaranteed and opportunistic access to
    shared resources.
  • Operate a heterogeneous environment both in
    services available at any site and for any VO,
    and multiple implementations behind common
    interfaces.
  • Interface to Campus and Regional Grids.
  • Federate with other national/international Grids.
  • Support multiple software releases at any one
    time.
  • Drivers -
  • Delivery to the schedule, capacity and capability
    of LHC and LIGO
  • Contributions to/from and collaboration with the
    US ATLAS, US CMS, LIGO software and computing
    programs.
  • Support for/collaboration with other
    physics/non-physics communities.
  • Partnerships with other Grids - especially EGEE
    and TeraGrid.
  • Evolution by deployment of externally developed
    new services and technologies.

10
Grid of Grids - from Local to Global
National
Campus
Community
11
Who are you?
  • A resource can be accessed by a user via the
    campus, community or national grid.
  • A user can access a resource with a campus,
    community or national grid identity.

12
OSG sites
13
running (and monitored) OSG jobs in 06/06.
14
Example GADU run in 04/06
15
CMS Experiment - an exemplar community grid
OSG
EGEE
CERN
USA
Germany
France
UNL
MIT
Data jobs moving locally, regionally globally
within CMS grid. Transparently across grid
boundaries from campus to global.
16
The CMS Grid of Grids
  • Job submission
  • 16,000 jobs per day submitted across EGEE OSG
    via INFN Resource Broker (RB).
  • Data Transfer
  • Peak IO of 5Gbps from FNAL to 32 EGEE and 7 OSG
    sites.
  • All 7 OSG sites have reached 5TB/day goal.
  • 3 OSG sites (Caltech, Florida, UCSD) exceeded
    10TB/day.

17
CMS Xfer on OSG
All sites have exceeded 5TB per day in June.
18
CMS Xfer FNAL to World
  • The US CMS center at
  • FNAL transfers data to
  • 39 sites worldwide in
  • CMS global Xfer challenge
  • Peak Xfer rates of 5Gbps
  • are reached.

19
EGEEOSG inter-operability
  • Agree on a common Virtual Organization Management
    System (VOMS)
  • Active Joint Security groups leading to common
    policies and procedures.
  • Condor-G interfaces to multiple remote job
    execution services (GRAM, Condor-C).
  • File Transfers using GridFTP.
  • SRM V1.1 for managed storage access. SRM V2.1 in
    test.
  • Publish OSG BDII to shared BDII for Resource
    Brokers to route jobs across the two grids.
  • Automate ticket routing between GOCs.

20
OSG Middleware Layering

CMSServices Framework
CDF, D0SamGrid Framework
ATLAS Services Framework
LIGOData Grid
Applications
OSG Release Cache VDT Configuration,
Validation, VO management
Virtual Data Toolkit (VDT) Common Services NMI
VOMS, CEMon (common EGEE components), MonaLisa,
Clarens, AuthZ
Infrastructure
21
OSG Middleware Pipeline
Domain science requirements.
Condor, Globus, EGEE etc
OSG stakeholders and middleware developer
(joint) projects.
Test on VO specific grid
Integrate into VDT Release. Deploy on OSG
integration grid
Test Interoperability With EGEE and TeraGrid
Provision in OSG release deploy to OSG
production.
22
The Virtual Data Toolkit
  • Alain Roy
  • OSG Software Coordinator
  • Condor Team
  • University of Wisconsin-Madison

23
What is the VDT?
  • A collection of software
  • Grid software (Condor, Globus and lots more)
  • Virtual Data System (Origin of the name VDT)
  • Utilities
  • An easy installation
  • Goal Push a button, everything just works
  • Two methods
  • Pacman installs and configures it all
  • RPM installs some of the software, no
    configuration
  • A support infrastructure

24
How much software?
25
Who makes the VDT?
  • The VDT is a product of Open Science Grid (OSG)
  • VDT is used on all OSG grid sites
  • OSG is new, but VDT has been around since 2002
  • Originally, VDT was a product of the
    GriPhyN/iVDGL
  • VDT was used on all Grid2003 sites

26
Who makes the VDT?
1 Mastermind 3 FTEs
Miron Livny
Alain Roy
Tim Cartwright
Andy Pavlo
27
Who uses the VDT?
  • Open Science Grid
  • LIGO Data Grid
  • LCG
  • LHC Computing Grid, from CERN
  • EGEE
  • Enabling Grids for E-Science

28
Why should you care?
  • The VDT gives insight into technical challenges
    in building a large grid
  • What software do you need?
  • How do you build it?
  • How do you test it?
  • How do you deploy it?
  • How do you support it?

29
What software is in the VDT?
  • Security
  • VOMS (VO membership)
  • GUMS (local authorization)
  • mkgridmap (local authorization)
  • MyProxy (proxy management)
  • GSI SSH
  • CA CRL updater
  • Monitoring
  • MonaLISA
  • gLite CEMon
  • Accounting
  • OSG Gratia
  • Job Management
  • Condor (including Condor-G Condor-C)
  • Globus GRAM
  • Data Management
  • GridFTP (data transfer)
  • RLS (replication location)
  • DRM (storage management)
  • Globus RFT
  • Information Services
  • Globus MDS
  • GLUE schema providers

Note The type, quantity, and variety of software
is more important to my talk today than the
specific software Im naming
30
What software is in the VDT?
  • Client tools
  • Virtual Data System
  • SRM clients (V1 and V2)
  • UberFTP (GridFTP client)
  • Developer Tools
  • PyGlobus
  • PyGridWare
  • Testing
  • NMI Build Test
  • VDT Tests
  • Support
  • Apache
  • Tomcat
  • MySQL (with MyODBC)
  • Non-standard Perl modules
  • Wget
  • Squid
  • Logrotate
  • Configuration Scripts
  • And More!

31
Building the VDT
  • We distribute binaries
  • Expecting everyone to build from source is
    impractical
  • Essential to be able to build on many platforms,
    and replicate builds
  • We build all binaries with NMI Build and Test
    infrastructure

32
Building the VDT
NMI
VDT
RPM downloads
Build Test Condor pool (70 computers)
Sources (CVS)
Test
Users
Package
Patching
Pacman Cache
Build
Binaries
Test
Build
Binaries

Contributors
33
Testing the VDT
  • Every night, we test
  • Full VDT install
  • Subsets of VDT
  • Current release You might be surprised how often
    things break after release!
  • Upcoming release
  • On all supported platforms
  • Supported means we test it every night
  • VDT works on some unsupported platforms
  • We care about interactions between the software

34
Supported Platforms
  • Fedora Core 3
  • Fedora Core 4
  • Fedora Core 4/x86-64
  • ROCKS 3.3
  • SuSE 9/ia64
  • RedHat 7
  • RedHat 9
  • Debian 3.1
  • RHAS 3
  • RHAS 3/ia64
  • RHAS 3/x86-64
  • RHAS 4
  • Scientific Linux 3
  • The number of Linux distributions grows
    constantly, and they have important differences
  • People ask for new platforms, but rarely ask to
    drop platforms
  • System administration for heterogeneous systems
    is a lot of work

35
Tests
  • Results on web
  • Results via email
  • A daily reminder!

36
Deploying the VDT
  • We want to support root and non-root
    installations
  • We want to assist with configuration
  • We want it to be simple
  • Our solution Pacman
  • Developed by Saul Youssef, BU
  • Downloads and installs with one command
  • Asks questions during install (optionally)
  • Does not require root
  • Can install multiple versions at same time

37
Challenges we struggle with
  • How should we smoothly update a production
    service?
  • In-place vs. on-the-side
  • Preserve old configuration while making big
    changes.
  • As easy as we try to make it, it still takes
    hours to fully install and set up from scratch
  • How do we support more platforms?
  • Its a struggle to keep up with the onslaught of
    Linux distributions
  • Mac OS X? Solaris?

38
More challenges
  • Improving testing
  • We care about interactions between the software
    When using a VOMS proxy with Condor-G, can we
    run a GT4 job with GridFTP transfer, keeping the
    proxy in MyProxy, while using PBS as the backend
    batch system
  • Some people want native packaging formats
  • RPM
  • Deb
  • What software should we have?
  • New storage management software

39
One more challenge
  • Hiring
  • We need high quality software developers
  • Creating the VDT involves all aspects of software
    development
  • But Developers prefer writing new code instead
    of
  • Writing lots of little bits of code
  • Thorough testing
  • Lots of debugging
  • User support

40
Where do you learn more?
  • http//vdt.cs.wisc.edu
  • Support
  • Alain Roy roy_at_cs.wisc.edu
  • Miron Livny miron_at_cs.wisc.edu
  • Official Support vdt-support_at_ivdgl.org

41
Security Infrastructure
  • Identity X509 Certificates
  • OSG is a founding member of the US TAGPMA.
  • DOEGrids provides script utilities for bulk
    requests of Host certs, CRL checking etc.
  • VDT downloads CA information from IGTF.
  • Authentication and Authorization using VOMS
    extended attribute certficates.
  • DN-gt Account mapping done at Site (multiple CEs,
    SEs) by GUMS.
  • Standard authorization callouts to Prima(CE) and
    gPlazma(SE).

42
Security Infrastructure
  • Security Process modeled on NIST procedural
    controls starting from an inventory of the OSG
    assets
  • Management - Risk assessment, planning, Service
    auditing and checking
  • Operational - Incident response, Awareness and
    Training, Configuration management,
  • Technical - Authentication and Revocation,
    Auditing and analysis. End to end trust in
    quality of code executed on remote CPU
    -signatures?

43
User and VO Management
  • VO Registers with with Operations Center
  • Provides URL for VOMS service to be propagated
    to the sites.
  • Several VOMS are shared with EGEE as part of
    WLCG.
  • User registers through VOMRS or VO administrator
  • User added to VOMS of one or more VOs.
  • VO responsible for users to sign AUP.
  • VO responsible for VOMS service support.
  • Site Registers with the Operations Center
  • Signs the Service Agreement.
  • Decides which VOs to support (striving for
    default admit)
  • Populates GUMS from VOMSes of all VOs. Chooses
    account UID policy for each VO role.
  • VOs and Sites provide Support Center Contact and
    joint Operations.
  • For WLCG US ATLAS and US CMS Tier-1s directly
    registered to WLCG. Other support centers
    propagated through OSG GOC to WLCG.

44
Operations and User Support
  • Virtual Organization (VO)
  • Group of one or more researchers
  • Resource provider (RP)
  • Operates Compute Elements and Storage Elements
  • Support Center (SC)
  • SC provides support for one or more VO and/or RP
  • VO support centers
  • Provide end user support including triage of
    user-related trouble tickets
  • Community Support
  • Volunteer effort to provide SC for RP for VOs
    without their own SC, and general help discussion
    mail list

45
Operations Model
Real support organizations often play multiple
roles
Lines represent communication paths and, in our
model, agreements. We have not progressed very
far with agreements yet.
Gray shading indicates that OSG Operations
composed of effort from all the support centers
46
OSG Release Process
  • Applications?Integration?Provision?Deploy
  • Integration Testbed (15-20) Production (50) sites

ITB
OSG
47
Integration Testbed
  • As reported in GridCat status catalog

service
facility
ITB release
site
Ops map
Tier 2 sites
status
48
Release Schedule
Incremental Updates
Incremental Updates
Incremental Updates (minor release)
OSG 1.0.0!
OSG 0.8.0
Functionality
OSG 0.6.0
OSG 0.4.1
SC4
CMS CSA06
Advanced LIGO
06/06
03/07
12/06
09/07
12/07
3/08
6/08
03/06
09/06
06/07
01/06
9/08
WLCG Service Commissioned
ATLAS Cosmic Ray Run
49
OSG Release Timeline
Production
OSG 0.2.1
OSG 0.4.0
2/05
OSG 0.4.1
4/05
9/05
12/05
ITB 0.1.2
OSG 0.6.0
2/06
4/06
ITB 0.1.6
7/06
ITB 0.3.0
ITB 0.3.4
Integration
ITB 0.3.7
ITB 0.5.0
50
Deployment and Maintenance
  • Distribute s/w through VDT and OSG caches.
  • Progress technically via VDT weekly office hours
    - problems, help, planning - fed from multiple
    sources (Ops, Int, VDT-Support, mail, phone).
  • Publish plans and problems through VDT To do
    list, Int-Twiki and ticket systems.
  • Critical updates and patches follow Standard
    Operating Procedures.

51
Release Functionality
  • OSG 0.6 Fall 2006
  • Accounting
  • Squid (Web caching in support of s/w distribution
    database information)
  • SRM V2AuthZ
  • CEMON-ClassAd based Resource Selection.
  • Support for MDS-4.
  • OSG 0.8 Spring 2007
  • VM based Edge Services
  • Just in time job scheduling, Pull-Mode Condor-C,
  • Support for sites to run Pilot jobs and/or
    Glide-ins using gLexec for identity changes.
  • OSG1.0 End of 2007

52
Inter-operability with Campus grids
  • FermiGrid is an interesting example for the
    challenges we face when making the resources of a
    campus (in this case a DOE Laboratory) grid
    accessible to the OSG community

53
OSG Principles
  • Characteristics -
  • Provide guaranteed and opportunistic access to
    shared resources.
  • Operate a heterogeneous environment both in
    services available at any site and for any VO,
    and multiple implementations behind common
    interfaces.
  • Interface to Campus and Regional Grids.
  • Federate with other national/international Grids.
  • Support multiple software releases at any one
    time.
  • Drivers -
  • Delivery to the schedule, capacity and capability
    of LHC and LIGO
  • Contributions to/from and collaboration with the
    US ATLAS, US CMS, LIGO software and computing
    programs.
  • Support for/collaboration with other
    physics/non-physics communities.
  • Partnerships with other Grids - especially EGEE
    and TeraGrid.
  • Evolution by deployment of externally developed
    new services and technologies.

54
OSG Middleware Layering

CMSServices Framework
CDF, D0SamGrid Framework
ATLAS Services Framework
LIGOData Grid
Applications
OSG Release Cache VDT Configuration,
Validation, VO management
Virtual Data Toolkit (VDT) Common Services NMI
VOMS, CEMon (common EGEE components), MonaLisa,
Clarens, AuthZ
Infrastructure
55
Summary
  • OSG facility opened July 22nd 2005.
  • OSG facility is under steady use
  • 2-3000 jobs at all times
  • HEP but large Bio/Eng/Med occasionally
  • Moderate other physics - Astro/Nuclear - LIGO
    expected to ramp up.
  • OSG project
  • 5 year Proposal to DOE NSF funded starting 9/06
  • Facility Improve/Expand/Extend/Interoperate
    EO
  • Off to a running start but lots more to do.
  • Routinely exceeding 1Gbps at 3 sites
  • Scale by x4 by 2008 and many more sites
  • Routinely exceeding 1000 running jobs per client
  • Scale by at least x10 by 2008
  • Have reached 99 success rate for 10,000 jobs per
    day submission
  • Need to reach this routinely, even under heavy
    load

56
EGEEOSG inter-operability
  • Agree on a common Virtual Organization Management
    System (VOMS)
  • Active Joint Security groups leading to common
    policies and procedures.
  • Condor-G interfaces to multiple remote job
    execution services (GRAM, Condor-C).
  • File Transfers using GridFTP.
  • SRM V1.1 for managed storage access. SRM V2.1 in
    test.
  • Publish OSG BDII to shared BDII for Resource
    Brokers to route jobs across the two grids.
  • Automate ticket routing between GOCs.

57
What is FermiGrid?
  • Integrates resources across most (soon all)
    owners at Fermilab.
  • Supports jobs from Fermilab organizations to run
    on any/all accessible campus FermiGrid and
    national Open Science Grid resources.
  • Supports jobs from OSG to be scheduled onto
    any/all Fermilab sites,.
  • Unified and reliable common interface and
    services for FermiGrid gateway - including
    security, job scheduling, user management, and
    storage.
  • More information is available at
    http//fermigrid.fnal.gov

58
Job Forwarding and Resource Sharing
  • Gateway currently interfaces 5 Condor pools with
    diverse file systems and gt1000 Job Slots. Plans
    to grow to 11 clusters (8 Condor, 2 PBS and 1
    LSF)
  • Job scheduling policies and in place agreements
    for sharing allow fast response to changes in
    resource needs by Fermilab and OSG users.
  • Gateway provides single bridge between OSG wide
    area distributed infrastructure and FermiGrid
    local sites. Consists of a Glbus gate-keeper and
    a Condor-G
  • Each cluster has its own Globus gate-keeper
  • Storage and Job execution policies applied
    through Site-wide managed security and
    authorization services.

59
Access to FermiGrid
FermiGrid Gateway
GT-GK
Condor-G
GT-GK
GT-GK
GT-GK
GT-GK
60
GLOW UW Enterprise Grid
  • Condor pools at various departments integrated
    into a campus wide grid
  • Grid Laboratory of Wisconsin
  • Older private Condor pools at other departments
  • 1000 1GHz Intel CPUs at CS
  • 100 2GHz Intel CPUs at Physics
  • Condor jobs flock from on-campus and of-campus to
    GLOW
  • Excellent utilization
  • Especially when the Condor Standard Universe is
    used
  • Premption, Checkpointing, Job Migration

61
Grid Laboratory of Wisconsin
2003 Initiative funded by NSF/UWSix GLOW Sites
  • Computational Genomics, Chemistry
  • Amanda, Ice-cube, Physics/Space Science
  • High Energy Physics/CMS, Physics
  • Materials by Design, Chemical Engineering
  • Radiation Therapy, Medical Physics
  • Computer Science

GLOW phases-1,2 non-GLOW funded nodes have
1000 Xeons 100 TB disk
62
How does it work?
  • Each of the six sites manages a local Condor pool
    with its own collector and matchmaker
  • Through the High Availability Demon (HAD) service
    offered by Condor, one of these matchmaker is
    elected to manage all GLOW resources

63
GLOW Deployment
  • GLOW is fully Commissioned and is in constant use
  • CPU
  • 66 GLOW 50 ATLAS 108 other nodes _at_ CS
  • 74 GLOW 66 CMS nodes _at_ Physics
  • 93 GLOW nodes _at_ ChemE
  • 66 GLOW nodes _at_ LMCG, MedPhys, Physics
  • 95 GLOW nodes _at_ MedPhys
  • 60 GLOW nodes _at_ IceCube
  • Total CPU 1339
  • Storage
  • Head nodes _at_ at all sites
  • 45 TB each _at_ CS and Physics
  • Total storage 100 TB
  • GLOW Resources are used at 100 level
  • Key is to have multiple user groups
  • GLOW continues to grow

64
GLOW Usage
  • GLOW Nodes are always running hot!
  • CS Guests
  • Largest user
  • Serving guests - many cycles delivered to guests!
  • ChemE
  • Largest community
  • HEP/CMS
  • Production for collaboration
  • Production and analysis of local physicists
  • LMCG
  • Standard Universe
  • Medical Physics
  • MPI jobs
  • IceCube
  • Simulations

65
GLOW Usage 3/04 9/05
Leftover cycles available for Others
Takes advantage of shadow jobs
Take advantage of check-pointing jobs
Over 7.6 million CPU-Hours (865 CPU-Years) served!
66
Example Uses
  • ATLAS
  • Over 15 Million proton collision events simulated
    at 10 minutes each
  • CMS
  • Over 70 Million events simulated, reconstructed
    and analyzed (total 10 minutes per event) in the
    past one year
  • IceCube / Amanda
  • Data filtering used 12 years of GLOW CPU in one
    month
  • Computational Genomics
  • Prof. Shwartz asserts that GLOW has opened up a
    new paradigm of work patterns in his group
  • They no longer think about how long a particular
    computational job will take - they just do it
  • Chemical Engineering
  • Students do not know where the computing cycles
    are coming from - they just do it - largest user
    group

67
Open Science Grid GLOW
  • OSG Jobs can run on GLOW
  • Gatekeeper routes jobs to local condor cluster
  • Jobs flock to campus wide, including the GLOW
    resources
  • dCache storage pool is also a registered OSG
    storage resource
  • Beginning to see some use
  • Now actively working on rerouting GLOW jobs to
    the rest of OSG
  • Users do NOT have to adapt to OSG interface and
    separately manage their OSG jobs
  • New Condor code development

68
Elevating from GLOW to OSG
Specialized scheduler operating on schedds jobs.
Job 1 Job 2 Job 3 Job 4 Job 5
Schedd On The Side
Job 4
job queue
Schedd
69
The Grid Universe
vanilla
site X
  • easier to live with private networks
  • may use non-Condor resources
  • restricted Condor feature set(e.g. no std
    universe over grid)
  • must pre-allocating jobsbetween vanilla and grid
    universe

70
Dynamic Routing Jobs
  • dynamic allocation of jobsbetween vanilla and
    grid universes.
  • not every job is appropriate fortransformation
    into a grid job.

vanilla
site X
site Y
site Z
71
Final Observation
  • A production grid is the product of a complex
    interplay of many forces
  • Resource providers
  • Users
  • Software providers
  • Hardware trends
  • Commercial offerings
  • Funding agencies
  • Culture of all parties involved
Write a Comment
User Comments (0)
About PowerShow.com