FermiGrid - PRIMA, VOMS, GUMS - PowerPoint PPT Presentation

About This Presentation
Title:

FermiGrid - PRIMA, VOMS, GUMS

Description:

... AUP (V2.0, February 9, 2006) is an example of an AUP ... Problem diagnosis is also very difficult (our job forwarding gateway does compound this problem) ... – PowerPoint PPT presentation

Number of Views:46
Avg rating:3.0/5.0
Slides: 25
Provided by: keithch
Category:
Tags: fermigrid | gums | prima | voms | aup | does | for | stand | what

less

Transcript and Presenter's Notes

Title: FermiGrid - PRIMA, VOMS, GUMS


1
FermiGrid - PRIMA, VOMS, GUMS SAZ
  • Keith Chadwick
  • Fermilab
  • chadwick_at_fnal.gov

2
What is FermiGrid?
  • FermiGrid is
  • The Fermilab campus Grid.
  • A set of common services to support the campus
    Grid
  • The site globus gateway, VOMS, VOMRS, GUMS, SAZ,
    MyProxy, Gratia Accounting, etc.
  • A forum for promoting stakeholder
    interoperability and resource sharing within
    Fermilab.
  • The portal from the Open Science Grid to Fermilab
    Compute and Storage Services
  • Production fermigrid1, fngp-osg, fcdfosg1,
    fcdfosg2, docabosg2, sdss-tam, FNAL_FERMIGRID_SE
    (public dcache), stken, etc
  • Integration fgtest1, fnpcg, etc
  • FermiGrid Web Site Additional Documentation
  • http//fermigrid.fnal.gov/

3
FermiGrid - Infrastructure
  • Site Globus Gateway
  • Job forwarding gateway using Condor-G and CEMon.
  • Makes use of accept limited globus gatekeeper
    option.
  • VOMS VOMRS
  • VO Membership Service VO Management
    Registration Service .
  • Allows user to select roles.
  • GUMS
  • Grid User Mapping Service.
  • maps FQAN in x509 proxy to site specific UID/GID.
  • SAZ
  • Site AuthoriZation Service.
  • Allows site to to make fine grained job
    authorization decisions.
  • MyProxy
  • Service to security store and retrieve signed
    x509 proxies.

4
Site Gatekeeper Job Forwarding
  • Why?
  • Single point of control.
  • Hide site internal details.
  • Facilitate resource sharing.
  • Allow (some) load balancing
  • Support specification of user job requirements
    (via ClassAds).
  • Why not?
  • Complicates problem diagnosis.
  • Non-standard configuration.
  • Can confuse users.

5
Site Gateway Job Forwarding with CEMon and
BlueArc - Animation
VOMS Server
GUMS Server
SAZ Server
Site Gateway
BlueArc
CMS WC1
CDF OSG1
CDF OSG2
D0 CAB2
SDSS TAM
GP Farm
LQCD
6
Globus gatekeeper - GUMS SAZ interface
  • GUMS and SAZ are interfaced to the globus
    gatekeeper through the gsi_authz callout
  • /etc/grid-security/gsi_authz.conf
  • PRIMA
  • globus_mapping /usr/local/vdt/prima/lib/libprima_a
    uthz_module_gcc32dbg globus_gridmap_callout
  • SAZ
  • globus_authorization /usr/local/vdt/saz/client/lib
    /libSAZ-gt3.2_gcc32dbg globus_saz_access_control_c
    allout

7
SAZ - Site AuthoriZation Service
  • We deployed the Fermilab Site AuthoriZation (SAZ)
    service on the Fermilab Site Globus Gatekeeper
    (fermigrid1) on Monday October 2, 2006.
  • SAZ allows Fermilab to make Grid job
    authorization decisions for the Fermilab site
    based using the DN, VO, Role and CA information
    contained in the proxy certificate provided by
    the user.
  • Fermilab has currently configured SAZ to operate
    in a default accept mode for user proxy
    credentials that are associated with VOs (user
    proxy credentials generated by voms-proxy-init).
  • Users that continue to use grid-proxy-init may no
    longer be able execute on Fermilab Compute
    Elements.

8
SAZ Database Table Structure
  • DN
  • user_name, enabled, trusted, changedAt
  • VO
  • vo_name, enabled, trusted, changedAt
  • Role
  • role_name, enabled, trusted, changedAt
  • CA
  • ca_name, enabled, trusted, changedAt

9
SAZ - Site AuthoriZation Pseudo-Code
  • Site authorization callout on globus gateway
    sends SAZ authorization request (example)
  • user /DCorg/DCdoegrids/OUPeople/CNKeith
    Chadwick 800325
  • VO fermilab
  • Role /fermilab/RoleNULL/CapabilityNULL
  • CA /DCorg/DCDOEGrids/OUCertificate
    Authorities/CNDOEGrids CA 1
  • SAZ server on fermigrid4 receives SAZ
    authorization request, and
  • 1. Verifies certificate and trust chain.
  • 2. If the certificate does not verify or the
    trust chain is invalid then
  • SAZ returns "Not-Authorized"
  • fi
  • 3. Issues select on "user" against the SAZDB
    user table
  • 4. if the select on "user" fails then
  • a record corresponding to the "user" is
    inserted into the SAZDB user table with
    (user.enabled Y, user.trustedF)
  • fi
  • 5. Issues select on "VO" against the local SAZDB
    vo table
  • 6. if the select on "VO" fails then

10
SAZ - Animation
DN
VO
Role
Gatekeeper
CA
11
SAZ - A Couple of Caveats
  • What about grid-proxy-init or voms-proxy-init
    without a VO?
  • The NULL VO is specifically disabled
    (vo.enabledF, vo.trustedF).
  • If a user has user.trustedY in their user
    record then
  • gtgtgt we allow them to execute jobs without VO
    sponsorship ltltlt.
  • This granting of user.trustedY is not
    automatic.
  • The number of users with this privilege will be
    VERY limited.
  • What about pilot jobs / glide-in operation?
  • To comply with the (draft) Fermilab policy on
    pilot jobs, VOs that submit pilot jobs will
    shortly be required to use glexec to launch their
    user portion of the glide-in jobs.
  • SAZ authoriization requests from glexec may
    require that the VO to have role.trustedY in
    the VO specific role record that they are using
    for glide-in operations.
  • The granting of role.trustedY will not be
    automatic.
  • Authorization for trustedY flags in the SAZ
    database tables is granted and revoked by the
    Fermilab Computer Security Executive based on
    explicit trust relationships.

12
SAZ - Open Issues
  • Extra /CNltrandom numbergt in DN.
  • Examples
  • /DCorg/DCdoegrids/OUPeople/CNLeigh
    Grundhoefer (GridCat) 693100/CN1173547087
  • /DCorg/DCdoegrids/OUPeople/CNLeigh
    Grundhoefer (GridCat) 693100/CN1642479879
  • /DCorg/DCdoegrids/OUPeople/CNLeigh
    Grundhoefer (GridCat) 693100/CN1769868279
  • Result of user issuing grid-proxy-init.
  • Does not occur in voms-proxy-init.
  • Looking at code changes to handle extra CN
    problem.
  • Condor fails to properly delegate the full voms
    proxy attributes.
  • This can be worked around in condor_config by
    setting
  • DELEGATE_JOB_GSI_CREDENTIALSFALSE
  • A ticket on this issue has been opened with the
    Condor developers.
  • Testing by Chris Green and John Weigand show that
    Reliable File Transfer (RFT) with WS-Gram is also
    failing to properly delegate the full voms
    attributes
  • RFT is using the full voms proxy for the first
    transaction, but uses a cached copy without the
    role information for the second transaction.
  • A ticket on this issue has been opened with the
    Globus developers.

13
Draft Fermilab VO Trust Relationship Policy
  • Fermilab will only accept jobs from Virtual
    Organizations (VOs) which have established trust
    relationships in good standing. Trust
    relationships can be requested by VO management
    by contacting Fermilab Computer Security, and are
    granted and revoked by the Fermilab Computer
    Security Executive.
  • Some VOs such as CDF, D0, MINOS, LQCD, already
    possess a valid trust relationship with Fermilab
    due to overlap of staff or the umbrella of
    Fermilab's own operational and management
    controls. Other VOs will be expected to
    establish the trust relationship as described
    below in order to continue using Fermilab
    resources.
  • Criteria for Establishing Trust Relationships
  • Policies and practices for mutual security are
    continually adjusted to meet changes in risk
    perceptions. (NIST)
  • Acceptable use of Fermilab resources is governed
    by both the VO's and Fermilab's Acceptable Use
    Policies. The Open Science Grid's User AUP (V2.0,
    February 9, 2006) is an example of an AUP
    acceptable to Fermilab and applies to users
    operating under OSG's auspices.
  • A VO must describe and operate its technical
    infrastructure in a transparent manner which
    permits verification of its functioning.
  • A VO must have an operational organization with
    an appropriate number of staff members who
    respond to Fermilab requests (email and/or phone
    calls) within a reasonable time, generally during
    the normal business hours of its home site.
  • A VO must have an established and published
    response plan to deal with security incidents and
    reports of unauthorized use, and the staff to
    implement the plan.
  • Non-compliance with site policies by a VO or its
    members may trigger early or frequent
    re-examination of the trust relationship with the
    VO.

14
Draft Pilot Job Policy
  • A Pilot Job (also called a glide-in or
    late-binding job) is a batch job which starts on
    a grid worker node but loads some other job,
    termed the User Job, which has been created by
    another user.
  • Rules
  • Pilot Jobs will only be acceptable from VOs whose
    trust relationships with Fermilab include
    authorization to use them.
  • A Pilot Job must use the site provided glexec
    facility to map the application and data files to
    the actual owner of the User Job. glexec will
    perform the necessary callout to the Grid User
    Management System (GUMS) and Site Authorization
    Service (SAZ), and the Pilot Job must respect the
    result of these Policy Decision Points.
  • A Pilot Job and the User Job will not attempt to
    circumvent job accounting or limits on placed
    system resources by the batch system.
  • A Pilot Job may launch multiple User Jobs in
    serial fashion, but must not attempt to maintain
    data files between jobs belonging to different
    users.
  • When transferring a User Job into the worker
    node, the Pilot Job will use a level of security
    equivalent to that of the original job submission
    process.
  • Consequences
  • Fermilab reserves the right to terminate any
    batch jobs that appear to be operating beyond
    their authorization, including Pilot Jobs and
    User Jobs not in compliance with this policy.
  • The DN of the Job Manager or the entire VO may be
    placed on the Site Black List until the situation
    is rectified.
  • Fermilab expects any VO authorized to run Pilot
    Jobs to assure compliance by its users.

15
glexec
  • Joint development by David Groep / Gerben
    Venekamp / Oscar Koeroo (NIKHEF) and Dan Yocum /
    Igor Sfiligoi (Fermilab).
  • Integrated (via plugins) with LCAS / LCMAPS
    infrastructure (for LCG) and GUMS / SAZ
    infrastructure (for OSG).
  • glexec is currently deployed on a couple of small
    clusters at Fermilab, moving towards a
    significant deployment at Fermilab this week.
  • Will be included in Condor 6.9.x.

16
glexec block diagram
17
High Availability / Service Redundancy Plans
  • Gatekeeper
  • Redundant Condor_Master and Condor_Negotiator.
  • VOMS
  • Sticky problem.
  • Have requested a change to VOMRS that will make
    things much easier.
  • GUMS
  • Have a test active/standby GUMS service operating
    with Linux-HA.
  • Believe that we know how to implement an
    active/active service.
  • SAZ
  • Can implement either active/standby or
    active/active.
  • MyProxy
  • Need for MyProxy will be eliminated by new CEMon
    based job forwarding mechanism.

18
Metrics
  • In addition to the normal operation effort of
    installing, running and upgrading the various
    FermiGrid services over the past year, we have
    spent significant effort to collect and publish
    operational metrics. Examples
  • Globus gatekeeper calls by jobmanager per day
  • Globus gatekeeper IP connections per day
  • VOMS calls per day
  • VOMS server IP connections per day
  • GUMS calls per day
  • GUMS server IP connections per day
  • GUMS server unique Certificates and Mappings per
    day
  • SAZ Authorizations and Rejections per day
  • SAZ server IP connections per day
  • SAZ server unique DN, VO, Role CA per day.
  • Metrics collection scripts run once a day and
    collect information for the previous day.

19
Metrics - fermigrid1
20
Service Monitoring
  • Service Monitor scripts run multiple times per
    day (typically once per hour).
  • They gather detailed information about the
    service that they are monitoring.
  • They also verify the health of the service that
    they are monitoring (together with any dependent
    services), notify administrators and
    automatically restart the service(s) as necessary
    to insure continuous operations.

21
Service Monitor - fermigrid1
22
Areas of Current Work within FermiGrid
  • SAZ and glexec - nearing completion.
  • BlueArc storage and public dcache storage element
    - ongoing.
  • Further Metrics and Service Monitor Development -
    ongoing.
  • Gratia Accounting.
  • Web Services.
  • XEN.
  • Service Failover
  • Research, Development Deployment of future ITBs
    and OSG releases

23
Parting Comments
  • Extracting metrics and service monitor
    information needs to be easier - trolling through
    (globus gatekeeper, voms, gums, saz) log files is
    not an efficient method.
  • Having a uniform standard time format (and some
    sort of unique process/thread id) is essential.
  • Problem diagnosis is also very difficult (our job
    forwarding gateway does compound this problem).
  • David Bianco from Jefferson Lab gave a
    presentation on Sguil at the Fall 2006 HEPiX
    conference. Having a similar common interface
    for the globus gatekeepers and services log files
    together with the ability to correlate events
    from multiple sources would significantly improve
    problem diagnosis.
  • https//indico.fnal.gov/conferenceDisplay.py?confI
    d384
  • https//indico.fnal.gov/materialDisplay.py?contrib
    Id9ampsessionId17ampmaterialIdslidesampco
    nfId384

24
fin
  • Any questions?
Write a Comment
User Comments (0)
About PowerShow.com