Title: FermiGrid - PRIMA, VOMS, GUMS
1FermiGrid - PRIMA, VOMS, GUMS SAZ
- Keith Chadwick
- Fermilab
- chadwick_at_fnal.gov
2What is FermiGrid?
- FermiGrid is
- The Fermilab campus Grid.
- A set of common services to support the campus
Grid - The site globus gateway, VOMS, VOMRS, GUMS, SAZ,
MyProxy, Gratia Accounting, etc. - A forum for promoting stakeholder
interoperability and resource sharing within
Fermilab. - The portal from the Open Science Grid to Fermilab
Compute and Storage Services - Production fermigrid1, fngp-osg, fcdfosg1,
fcdfosg2, docabosg2, sdss-tam, FNAL_FERMIGRID_SE
(public dcache), stken, etc - Integration fgtest1, fnpcg, etc
- FermiGrid Web Site Additional Documentation
- http//fermigrid.fnal.gov/
3FermiGrid - Infrastructure
- Site Globus Gateway
- Job forwarding gateway using Condor-G and CEMon.
- Makes use of accept limited globus gatekeeper
option. - VOMS VOMRS
- VO Membership Service VO Management
Registration Service . - Allows user to select roles.
- GUMS
- Grid User Mapping Service.
- maps FQAN in x509 proxy to site specific UID/GID.
- SAZ
- Site AuthoriZation Service.
- Allows site to to make fine grained job
authorization decisions. - MyProxy
- Service to security store and retrieve signed
x509 proxies.
4Site Gatekeeper Job Forwarding
- Why?
- Single point of control.
- Hide site internal details.
- Facilitate resource sharing.
- Allow (some) load balancing
- Support specification of user job requirements
(via ClassAds). - Why not?
- Complicates problem diagnosis.
- Non-standard configuration.
- Can confuse users.
5Site Gateway Job Forwarding with CEMon and
BlueArc - Animation
VOMS Server
GUMS Server
SAZ Server
Site Gateway
BlueArc
CMS WC1
CDF OSG1
CDF OSG2
D0 CAB2
SDSS TAM
GP Farm
LQCD
6Globus gatekeeper - GUMS SAZ interface
- GUMS and SAZ are interfaced to the globus
gatekeeper through the gsi_authz callout - /etc/grid-security/gsi_authz.conf
- PRIMA
- globus_mapping /usr/local/vdt/prima/lib/libprima_a
uthz_module_gcc32dbg globus_gridmap_callout - SAZ
- globus_authorization /usr/local/vdt/saz/client/lib
/libSAZ-gt3.2_gcc32dbg globus_saz_access_control_c
allout
7SAZ - Site AuthoriZation Service
- We deployed the Fermilab Site AuthoriZation (SAZ)
service on the Fermilab Site Globus Gatekeeper
(fermigrid1) on Monday October 2, 2006. - SAZ allows Fermilab to make Grid job
authorization decisions for the Fermilab site
based using the DN, VO, Role and CA information
contained in the proxy certificate provided by
the user. - Fermilab has currently configured SAZ to operate
in a default accept mode for user proxy
credentials that are associated with VOs (user
proxy credentials generated by voms-proxy-init). - Users that continue to use grid-proxy-init may no
longer be able execute on Fermilab Compute
Elements.
8SAZ Database Table Structure
- DN
- user_name, enabled, trusted, changedAt
- VO
- vo_name, enabled, trusted, changedAt
- Role
- role_name, enabled, trusted, changedAt
- CA
- ca_name, enabled, trusted, changedAt
9SAZ - Site AuthoriZation Pseudo-Code
- Site authorization callout on globus gateway
sends SAZ authorization request (example) - user /DCorg/DCdoegrids/OUPeople/CNKeith
Chadwick 800325 - VO fermilab
- Role /fermilab/RoleNULL/CapabilityNULL
- CA /DCorg/DCDOEGrids/OUCertificate
Authorities/CNDOEGrids CA 1 - SAZ server on fermigrid4 receives SAZ
authorization request, and - 1. Verifies certificate and trust chain.
- 2. If the certificate does not verify or the
trust chain is invalid then - SAZ returns "Not-Authorized"
- fi
- 3. Issues select on "user" against the SAZDB
user table - 4. if the select on "user" fails then
- a record corresponding to the "user" is
inserted into the SAZDB user table with
(user.enabled Y, user.trustedF) - fi
- 5. Issues select on "VO" against the local SAZDB
vo table - 6. if the select on "VO" fails then
10SAZ - Animation
DN
VO
Role
Gatekeeper
CA
11SAZ - A Couple of Caveats
- What about grid-proxy-init or voms-proxy-init
without a VO? - The NULL VO is specifically disabled
(vo.enabledF, vo.trustedF). - If a user has user.trustedY in their user
record then - gtgtgt we allow them to execute jobs without VO
sponsorship ltltlt. - This granting of user.trustedY is not
automatic. - The number of users with this privilege will be
VERY limited. - What about pilot jobs / glide-in operation?
- To comply with the (draft) Fermilab policy on
pilot jobs, VOs that submit pilot jobs will
shortly be required to use glexec to launch their
user portion of the glide-in jobs. - SAZ authoriization requests from glexec may
require that the VO to have role.trustedY in
the VO specific role record that they are using
for glide-in operations. - The granting of role.trustedY will not be
automatic. - Authorization for trustedY flags in the SAZ
database tables is granted and revoked by the
Fermilab Computer Security Executive based on
explicit trust relationships.
12SAZ - Open Issues
- Extra /CNltrandom numbergt in DN.
- Examples
- /DCorg/DCdoegrids/OUPeople/CNLeigh
Grundhoefer (GridCat) 693100/CN1173547087 - /DCorg/DCdoegrids/OUPeople/CNLeigh
Grundhoefer (GridCat) 693100/CN1642479879 - /DCorg/DCdoegrids/OUPeople/CNLeigh
Grundhoefer (GridCat) 693100/CN1769868279 - Result of user issuing grid-proxy-init.
- Does not occur in voms-proxy-init.
- Looking at code changes to handle extra CN
problem. - Condor fails to properly delegate the full voms
proxy attributes. - This can be worked around in condor_config by
setting - DELEGATE_JOB_GSI_CREDENTIALSFALSE
- A ticket on this issue has been opened with the
Condor developers. - Testing by Chris Green and John Weigand show that
Reliable File Transfer (RFT) with WS-Gram is also
failing to properly delegate the full voms
attributes - RFT is using the full voms proxy for the first
transaction, but uses a cached copy without the
role information for the second transaction. - A ticket on this issue has been opened with the
Globus developers.
13Draft Fermilab VO Trust Relationship Policy
- Fermilab will only accept jobs from Virtual
Organizations (VOs) which have established trust
relationships in good standing. Trust
relationships can be requested by VO management
by contacting Fermilab Computer Security, and are
granted and revoked by the Fermilab Computer
Security Executive. - Some VOs such as CDF, D0, MINOS, LQCD, already
possess a valid trust relationship with Fermilab
due to overlap of staff or the umbrella of
Fermilab's own operational and management
controls. Other VOs will be expected to
establish the trust relationship as described
below in order to continue using Fermilab
resources. - Criteria for Establishing Trust Relationships
- Policies and practices for mutual security are
continually adjusted to meet changes in risk
perceptions. (NIST) - Acceptable use of Fermilab resources is governed
by both the VO's and Fermilab's Acceptable Use
Policies. The Open Science Grid's User AUP (V2.0,
February 9, 2006) is an example of an AUP
acceptable to Fermilab and applies to users
operating under OSG's auspices. - A VO must describe and operate its technical
infrastructure in a transparent manner which
permits verification of its functioning. - A VO must have an operational organization with
an appropriate number of staff members who
respond to Fermilab requests (email and/or phone
calls) within a reasonable time, generally during
the normal business hours of its home site. - A VO must have an established and published
response plan to deal with security incidents and
reports of unauthorized use, and the staff to
implement the plan. - Non-compliance with site policies by a VO or its
members may trigger early or frequent
re-examination of the trust relationship with the
VO.
14Draft Pilot Job Policy
- A Pilot Job (also called a glide-in or
late-binding job) is a batch job which starts on
a grid worker node but loads some other job,
termed the User Job, which has been created by
another user. - Rules
- Pilot Jobs will only be acceptable from VOs whose
trust relationships with Fermilab include
authorization to use them. - A Pilot Job must use the site provided glexec
facility to map the application and data files to
the actual owner of the User Job. glexec will
perform the necessary callout to the Grid User
Management System (GUMS) and Site Authorization
Service (SAZ), and the Pilot Job must respect the
result of these Policy Decision Points. - A Pilot Job and the User Job will not attempt to
circumvent job accounting or limits on placed
system resources by the batch system. - A Pilot Job may launch multiple User Jobs in
serial fashion, but must not attempt to maintain
data files between jobs belonging to different
users. - When transferring a User Job into the worker
node, the Pilot Job will use a level of security
equivalent to that of the original job submission
process. - Consequences
- Fermilab reserves the right to terminate any
batch jobs that appear to be operating beyond
their authorization, including Pilot Jobs and
User Jobs not in compliance with this policy. - The DN of the Job Manager or the entire VO may be
placed on the Site Black List until the situation
is rectified. - Fermilab expects any VO authorized to run Pilot
Jobs to assure compliance by its users.
15glexec
- Joint development by David Groep / Gerben
Venekamp / Oscar Koeroo (NIKHEF) and Dan Yocum /
Igor Sfiligoi (Fermilab). - Integrated (via plugins) with LCAS / LCMAPS
infrastructure (for LCG) and GUMS / SAZ
infrastructure (for OSG). - glexec is currently deployed on a couple of small
clusters at Fermilab, moving towards a
significant deployment at Fermilab this week. - Will be included in Condor 6.9.x.
16glexec block diagram
17High Availability / Service Redundancy Plans
- Gatekeeper
- Redundant Condor_Master and Condor_Negotiator.
- VOMS
- Sticky problem.
- Have requested a change to VOMRS that will make
things much easier. - GUMS
- Have a test active/standby GUMS service operating
with Linux-HA. - Believe that we know how to implement an
active/active service. - SAZ
- Can implement either active/standby or
active/active. - MyProxy
- Need for MyProxy will be eliminated by new CEMon
based job forwarding mechanism.
18Metrics
- In addition to the normal operation effort of
installing, running and upgrading the various
FermiGrid services over the past year, we have
spent significant effort to collect and publish
operational metrics. Examples - Globus gatekeeper calls by jobmanager per day
- Globus gatekeeper IP connections per day
- VOMS calls per day
- VOMS server IP connections per day
- GUMS calls per day
- GUMS server IP connections per day
- GUMS server unique Certificates and Mappings per
day - SAZ Authorizations and Rejections per day
- SAZ server IP connections per day
- SAZ server unique DN, VO, Role CA per day.
- Metrics collection scripts run once a day and
collect information for the previous day.
19Metrics - fermigrid1
20Service Monitoring
- Service Monitor scripts run multiple times per
day (typically once per hour). - They gather detailed information about the
service that they are monitoring. - They also verify the health of the service that
they are monitoring (together with any dependent
services), notify administrators and
automatically restart the service(s) as necessary
to insure continuous operations.
21Service Monitor - fermigrid1
22Areas of Current Work within FermiGrid
- SAZ and glexec - nearing completion.
- BlueArc storage and public dcache storage element
- ongoing. - Further Metrics and Service Monitor Development -
ongoing. - Gratia Accounting.
- Web Services.
- XEN.
- Service Failover
- Research, Development Deployment of future ITBs
and OSG releases
23Parting Comments
- Extracting metrics and service monitor
information needs to be easier - trolling through
(globus gatekeeper, voms, gums, saz) log files is
not an efficient method. - Having a uniform standard time format (and some
sort of unique process/thread id) is essential. - Problem diagnosis is also very difficult (our job
forwarding gateway does compound this problem). - David Bianco from Jefferson Lab gave a
presentation on Sguil at the Fall 2006 HEPiX
conference. Having a similar common interface
for the globus gatekeepers and services log files
together with the ability to correlate events
from multiple sources would significantly improve
problem diagnosis. - https//indico.fnal.gov/conferenceDisplay.py?confI
d384 - https//indico.fnal.gov/materialDisplay.py?contrib
Id9ampsessionId17ampmaterialIdslidesampco
nfId384
24fin