Title: A Simple Virtual Organisation Model and Practical Implementation
1A Simple Virtual Organisation ModelandPractical
Implementation
- AusGrid05
- Feb 2005
- Lyle Winton, University of Melbourne
2VO Origins and Definition
- The Anatomy of the Grid Foster, Kesselman,
and Tuecke - The real problem underlying the Grid is
coordinated resource sharing in dynamic,
multi-institutional virtual organisations! (VOs) - Conditions and rules for resource sharing
- negotiated btwn. resource provider consumer
- They define the VO as the set of
individuals/institutions for which these apply. - VO is only half the picture!
- VO may have access to facilities they do not own
or manage - Facilities are organisations with security and
policy concerns. - Facilities may span multiple VOs and other user
communities. - ObservationA Grid is a complex network of
organisations and connecting policies. - user communities (traditionally called VOs)
- resource facilities
- negotiated policies for resource access and
sharing - certifying authorities (CAs for GSI based)
3VO Requirements
- Requirements from the community(EU DataGrid WP6
2002 Foster, Kesselman, and Tuecke) - Users may be members of any number of VOs
- A resource can participate in one or more VO
- User may have any number of roles within a given
VO - VOs must be able to specify membership policy
- A users VO membership must remain confidential
- A resource owner must be able to allow
authorisation by VO and VO role membership - It should be possible to list resources and
actions to which a VO member or role has access - It should be possible to list resources to which
a VO member or role has access to carry out
specific actions - Authorisation decisions must be consistent within
a VO - It must be possible to disable a users VO
authorisation - The VO must be able to specify security
requirements on any resource for specific roles - A user must be able to select and deselect VOs
and roles - It must be possible to assign job priorities
within resources - Looking at requirements we can identify some base
objects - The VO, users/members, groups or roles, resources
, priorities, authorities - Authorities and Priorities are important areas,
often overlooked
4VO In Practise
- Grid Middleware
- In literature the term VO used to describe groups
of users and sometimes resources - In a few cases information systems have been
developed to represent the VO - VO Information Systems
- Allows the development of tools to coordinate
resource sharing - Configuration tools for resources
- Complete authentication systems
- Most do not take into account
- Certifying authorities
- Facility's policies and priorities (geared
towards users) - VOs internal task and role priorities
5Why CAs in VO - LCG Example
- Users with Single Credential from Trusted Network
of Authorities - LHC Computing Grid (99 Compute Elements)
6Why CAs in VO - LCG Example
- Users with Single Credential from Trusted Network
of Authorities - LHC Computing Grid (28 Certificate Authorities)
- Problem New CA is added to the network.
Resource configuration for CAs is shipped with
Grid Middleware and updates.
7Why CAs in VO Generic Case
- Users with Single Credential from Multiple
Authorities - Problem new user/resource with certificate from
a different CA is added to the VO. The new CA
trust policy must be deployed to all resources.
8Why CAs in VO
- CAs effect the conditions for resource sharing
- VO owned services/tools must trust all members
CAs - Participating resources must trust all VO
members CAs - VO services/tools and Participating resources
must trust all resources CAs - Why should resources trust the VOs list of CAs?
- If they want to participate, they must.
- Generally, the VO knows its member institutions
and who can best certify their members. - If a CA starts issuing bad certificates the VO
can untrust them. (until CRLs are fixed?)
9Existing VO Implementations
- EU DataGrid and NorduGrid VO
- LDAP based VO Information System
- LDAP structure represents Users, Groups, and
Roles - mkGridmap tool queries LDAP service then
generates local resource grid-mapfile - NorduGridmap tool extended by NorduGrid community
- Problems
- Very simple VO model ignores CAs and VO
priorities - Limited resource policies mapping of VO members
to shared accounts or ranges of accounts
10Existing VO Implementations
- VOMS (VO Membership Service)
- Extends Globus GSI security
- 3 components
- Server holds user, group, and role info
- Client generates credentials (proxy) containing
additional role information - VOMS enabled gatekeeper service
- Authorisation split into 2 areas of
responsibility - Users relationship with VO (VOMS server level)
- Users access and usage of resources (resource
level) - Problems
- No CA information is available from VOMS
- Not clear users resource usage (eg. priorities)
determined solely by resource facilities - Urgent changes in VO priorities may require
renegotiation with resource facilites.
11Existing VO Implementations
- CAS (Community Authorization Service)
- 3 components
- CAS server contains info on users, groups,
resources, and access policies - Client requests authentication from the server
for a specific action/role (using proxy) - CAS enabled gatekeeper service
- Server returns signed policy assertion embedded
into new credentials (proxy), gatekeeper reads
this - Policies consistently control access (to data
etc.) eliminating least common denominator
problem - VOs can allocate/deallocate resource blocks to
individuals and groups (coarse grained priority
management) - Problems
- Still need to manually configure CAs at each
resource - Initially CAS delegated its own credentials on
users behalf - Claimed that allowing VOs to specify access/usage
policies breaks the Grid model! (VOMS claim, but
not true)
12Experiences
- Grid2003 HPC challenge
- Project lead by Raj Buyya
- Attempt to construct largest testbed
- Grew from several resource at University of
Melbourne to 218 resources in 50 locations across
21 countries. - Australian Belle Production Grid
- Belle experiment, KEK B-factory in Japan.
- Collaboration of 400 physicists from 50
institutions. - Australia took part in 4x109 event MC production
during 2004
13Experiences Grid2003
- Updating user and CA configuration was time
consuming and problematic. - Manual configuration led to errors.
- Automation of this was recognised as desirable.
14Experiences Belle Production
- Accessible resources for Belle
- Access to around 120 CPU(over 2 GHz)
- APAC, AC3, VPAC, ARC
- not all Grid accessible
- much production performedwithout Grid middleware
- Access to ANUSF petabytestorage facility (via
SRB) - Will request 10 TB for Belledata.
- Problems
- Simplest access method is to share oneaccount at
each facility. Some resourcepolicies forbid
this. - Each facility has an account applicationprocedure
. Typically, this is a manual processand
requires intervention from several people.
15A Simple VO Model
- start with all information necessary
- User/Service identification (certificate ID)
- Groups and Roles (user and service collections)
- Trusted user/service certifying bodies (CAs)
- Trusted resource certifying bodies (CAs)
- Untrusted certifying bodies and identities.
- Priorities assigned to Users, Groups, Roles
- extended the efforts from EU DataGrid and
NorduGrid
16A Simple VO Model
Belle
Uni. ofMelbourne
AnalysisWork Group
17A Simple VO Model
- Resource Configuration Manager
- Diagram???
18A Simple VO Model
- Resource Configuration Manager (GridMgr v3.0)
- Rewritten from NorduGridmap, originally from EU
DataGrid - Available resource usage policies
- Map VO/groups/roles/users to shared accounts
(existing func.) - Manually map individuals (existing func.)
- Map VO/groups/roles/users to a range of accounts
- Restriction of mappings to local Unix groups
- Mapping of users to individual accounts via full
name - Denial of access to VO/group/role/users
- Security requirements
- Valid full name matching only one account
- No new or modified system accounts (without
approval) - Admin approval of new/modified Users and CAs
- Valid account group (eg. can specify non-root)
- No shared accounts (optional)
- Allow or Deny by pattern match
- Reporting and notification of users failing
requirements - Update of certificate revocation lists (CRLs)
provided by CAs - Advanced notification of host certificate expiry
19A Simple VO Model
- What weve got so far
- VO Information System allows a VO to define its
structure, authorisation, and security policy
(limited) independent of resources. - Resource Configuration Manager allows a resource
to easily maintain a range of local security and
access policies for multiple VOs - Whats left
- How does a VO manage its priorities?
- eg. Some tasks might be critical to the VO!!!
- Problem is yet to become apparent
- many production Grids are effectively single user
- within some Grids resources are specifically
allocated or underutilised
20Managing Priorities
- Tradition Cluster Computing
- Locally managed queue determinesusers job
priorities - Jobs execute (are pulled from queue)when
resource become available - (Globus) Grid Computing
- A resources local priority cannot bedetermined
until jobs are submitted/completed - Jobs are submitted (pushed) to the local resource
queues - Problems with a push mechanism
- Jobs waiting in long queues could potentially be
run elsewhere. (submit each job to multiple
places?) - Heavily utilised resources may never appear
free, but may have short queue times. - Large, Fast, or apparently free resources may
have a low local priority for your job.
21Managing Priorities
- Alternative mechanism
- Allow job consumers to pull jobs when resources
become available or queues become short. - Eliminates the need for users to determine local
resource priority - Jobs consumed from a central VO Managed Queue
- VO priorities can be managed by allowing some
jobs to be consumed first - Jobs only leave the VO queue when they will be
run. No idol time on resource queues, can still
be run anywhere.
22VO Managed Job Queue
- VO Managed Queue Service (Prototype)
- Web service with simple authentication
- User submission of job with an optional
role(also job management list jobs, status of
jobs, delete jobs) - Overall job priority determined by VO Information
Serveruser or group priority, or specified role
priority - facility for Resource to pull of jobs, highest
priority first - facility for Resource to reserve/release job for
execution - and Resource can flag a job failed/completed
- Resource level Job Consumer (Simulated)
- Extract jobs and priorities from multiple VO
queues - single VO can host multiple queues for
scalability - VO Info Server ensures consistent priorities
- Convert VO priority to local priority
- each VO has a local priority range
- priorities from 0 to infinity are attenuated to
within the range - simple formula maps priorities below 200 to lower
80 of range - Simulate allocating successfully reserved job to
CPU resources - Simulate job completion
23VO Queue - Simulation Results
- Simple Simulation Run
- 3 VOs with 10 Users each of varying priority
- 1 VO Queue for each VO, each attached to a VO
Info System - 10 resources of 10-50 job slots, accepting jobs
from all VOs with varying local priority for each - Each user periodicallysubmitted 10-50 jobs
- Average of 3 timesmore jobs than slots!
- Saturated Gridqueue time ? 0
24VO Queue - Simulation Results
Completed jobsall VOs, all Resources
Completed jobsall VOs, one Resource
Completed jobsall VOs, one Resource(VO at 27.5
is missing)
Incomplete (queued) jobsall VOs, one Resource
25VO Queue - Simulation Results
- Brief outline of results
- Queue-time/Queue-size vs. VO Job Priority was too
complex to analyse. - Focusing on a single resource Low VO Priorities
Jobs tended to take longer - However, Low VO Priority Jobs were more likely to
left in the queue an not complete! - Lock-out occurred for low priority jobs
- In fact, looking at Mean Resource Priority, one
resource (mean priority of 27.5) was locked out
completely!
26VO Queue - Future Development
- Dynamic Job Priorities (perhaps Fairshare?)
- VOs can specify target fraction of resource usage
or job submission - Eg. 20 fairshare target for particular Working
Group - If few jobs submitted by group (lt 20) job
priority is increased - If too many jobs have been submitted (gt 20) job
priorities are decrease relatively - Facilities can specify a target fraction of
resource usage for each VO - Advantages
- help prevent job lock-out
- VO can specify fine-grained allocation of
resource, without allocating specific resources - Facilities can implement SLAs based on resource
usage - VO Queue deficiencies
- Advanced resource brokering is needed
- Is a job appropriate or good match for
resources? (data access) - Do jobs constantly fail (silently) on a resource?
(black hole effect)
27VO Queue - Future Development
- Integration with ATLAS (LCG) tools?
- Don Quijote Windmill (ATLAS)
- Supervisor (central queue) Executor (job
consumer, EDG/LCG, Grid3, NorduGrid) - Executor requests number of jobs, Supervisor
pushes jobs - ATLAS Data Challenge Production across 3 Grids
- AtCom (ATLAS Commander)
- submit multiple jobs tightly coupled with AMIdb
- EDG/LCG test plugin NorduGrid production plugin
- select data or param sweep -gt select operation
(transform) - No Resource Broker or Scheduler, allocate by
hand!
28Summary
- The Simple VO Model
- Easy to implement a VO Information System (via
OpenLDAP) - Proved sufficient for development of tools aiding
deployment and configuration - Provided encouraging results towards the
independent management of VO and facility
priorities. - Developed Tools (for VO Information System)
- GridMgr v3.0 (production ready)
- allows for a wide range of facility security
policies to co-exist with VO membership policies - VO Managed Queueing System (prototype)
- could help coordinate the use of resources based
on VO priorities assigned to groups, roles, and
users (supported by simulation) - future investigation required to prevent lock-out
and better allocate resource fractions.