Title: OxGrid, a campus grid environment for Oxford
1OxGrid, a campus grid environment for Oxford
- Dr. David Wallom
- Technical Manager
2Outline
- Aims of OxGrid?
- How have we made OxGrid?
- Central Systems
- Software
- Resources
- Users
- The future direction of the project
3Aim
- To develop and deploy Grid technology to
- increase utilisation of current and future
university resources - substantially increase the research computing
power available - capitalise on our status as a core node of the
NGS and evangelize usage of grids - lead international best practice effort at
running production grids - provide user authentication through either UK
e-Science certificates or university single
sign-on system
4OxGrid, a University Campus Grid
- Single entry point for users to shared and
dedicated resources - Seamless access to NGS and OSC for registered
users
5OxGrid Central System
- Resource Broker
- Distribution of submitted tasks
- User access point
- Information Service
- Central repository of system status information
- Virtual Organisation Management and Resource
Usage Service - User/resource control within the grid
- Record and analyse accounting information so that
full system as well as single resource usage can
be recorded - Systems monitoring
- Monitoring system for support, providing first
point of user contact - Storage
- Dynamic virtual file system independent of rest
of the system
6Grid Middleware
- Virtual Data Toolkit
- Contains
- Globus Toolkit version 2.4 with several
enhancements - GSI enhanced OpenSSH
- myProxy Client Server
- Has a defined support structure
7Resource Broker
- Built on top of Condor-G
- Allows treatment of a Globus (i.e. remote)
resource as a local resource - Command-line tools available to perform job
management (submit, query, cancel, etc.) with
detailed logging - Simple job submission language which is
translated into remote scheduler specific
language - Custom script for determination of resource
status priority - Integrated the Condor Resource description
mechanism and Globus Monitoring and Discovery
Service - Automated resource discovery
- Underlying capability matched to system database
including such information as installed software
and system load high watermark
8Resource Broker Operation
9Currently registered systems
10Job Submission Script
- Separates users from underlying Condor system
- Requires the following information to submit a
task - Executable name,
- Transfer exe (y/n)
- Command line arguments to exe
- Input files
- Output files
- Necessary installed software
- Extra Globus RSL parameters e.g. MPI Job and
number of concurrent processes - When job is submitted script contacts system
database to retrieve list of systems user has
permission to use
11Additional User Tools
- oxgrid_certificate_import
- Simplifies the installation of a user digital
certificate to a single command - oxgrid_q
- Display the users current queue at the resource
broker. Has the options to allow the user to see
the full task queue. - oxgrid_status
- Displays the resources that are available to the
user with options for all resource currently
registering with the resource broker - oxgrid_cleanup
- Removes either a single submitted process or a
range of child processes with their master
12Virtual Organisation Management
- Globus uses a mapping between certificate
Distinguished name to local usernames on each
resource - Important that for each resource that a user is
expecting to use, his DN is mapped locally - Make sure the correct resources are registered
- OxVOM
- Postgres database, web server, CGI scripts
- Custom in-house designed Web based user interface
- Persistent information stored in relational
databases - User DN list retrieved by remote resources using
standard tools from ldap database
13OxVOM
14Resource Usage Service and Accounting
- Jobmanagers altered to include commands to
determine job start and stop time as well as
interface with host scheduling system - Information returned from client to RUS server
when job completed and stored in relational
database - Stored information for a single job includes
- start end time
- Execution host, scheduler
- CPU walltime time
- Memory used
- Resource owner controlled cost variable
- Tune usage from campus grid
- Version 2 will use GGF Usage Record standard
15Resource Usage Service
- Enables presentation of system use to users as
well as system owners - Can form the basis of a charging model
16Overall system interactions with the VOM Database
VOM WEB INTERFACE
USER ADDITION
SYSTEM ADDITION
USER REMOVAL
SYSTEM REMOVAL
USER LISTING
SYSTEM LISTING
REMOTE SYSTEM CONNECTION
USER TASK RUS INFORMATION
DATABASE
SYSTEM ACCOUNTING
RESOURCE BROKER
USER JOB SUBMISSION
USER QUEUE QUERY
SYSTEM LISTING QUERY
LOCAL SYSTEM CONNECTION
USER COMMAND LINE INTERFACE
17Core Resources
- Available to all users of the campus grid
- Individual Departmental Clusters (PBS, SGE)
- Grid software interfaces installed
- Management of users through pool accounts or
manual account creation. - Clusters of PCs
- Running Condor/SGE
- Single master running up to 500 nodes
- Masters run either by owners or OeRC
- Execution environment on 2nd OS(Linux), Windows
or Virtual Machine
18External Resources
- Only accessible to users that have registered
with them - National Grid Service
- Peered access with individual systems
- OSC
- Gatekeeper system
- User management done through standard account
issuing procedures and manual DN mapping - Controlled grid submission to Oxford
Supercomputing Centre - Some departmental resources
- Used as method to bring new resources initially
online - Show the benefits of joining the grid
- Limited accessibility to donated by other
departments to maintain incentive to become full
participants
19Services necessary to connect to OxGrid
- For a system to connect to OxGrid
- Must support a minimum software set (without
which it is impossible to submit jobs from the
Resource Broker) - GT2 GRAM and RUS reconfigured jobmanager
- MDS compatible information provider
- Desirable though not mandated
- OxVOM compatible grid-mapfile installation
scripts - Scheduling system to give fair-share to users of
the resource
20Environmental Condor
- Cost environmental considerations of using
spare resources (7000/yr for OUCS Condor pool) - New daemon for Condor
- System starts and stops registering systems
depending on currently queued tasks. - Currently only works with Linux Condor systems
21Current Compute System Layout
- Central management services running on single
server - Current resources
- All Users
- OUCS Linux Pool (Condor, 250 CPU)
- Oxford NGS node (PBS, 128 CPU)
- Condensed Matter Physics (Condor, 10 CPU)
- Theoretical Physics (SGE,14 CPU)
- OeRC Cluster (SGE, 5 CPU)
- High Energy Physics (LCG-PBS,120 CPU) not
registering with RB - Registered users
- OSC (Zuse, 40 CPU)
- NGS (all nodes, 342 CPU)
- Biochemistry (SGE, 30 CPU)
- Materials Science (SGE, 20 CPU)
22Planned System Additions
- Physics Mac teaching laboratory (end Nov)
- OUCS Mac systems (end Nov, have agreement just
need time!) - Humanities cluster (Nov)
- Statistics cluster (end Dec)
- Biochemistry remaining two clusters (end Dec)
- OSC SRIF3 cluster tranche 1 (2007)
- Chemistry clusters (contacted department)
- NGS2 All resources (2007)
23Data Management
- Engagement of data as well as computational
users, - Provide a remote store for those groups that
cannot resource their own, - Distribute the client software as widely as
possible to departments that are not currently
engaged in e-Research.
24Data Management
- Two possible candidates for creation of system
- Storage Resource Broker to create large virtual
datastore - Through central metadata catalogue users
interface with single virtual file system though
physical volumes may be on several network
resources - In built metadata capability
- Disk Pool Manager
- Similar virtual disk presentation,
- Internationally recognised using SRM standard
interface, - No metadata capability,
- Integrated easily with VO server.
25Supporting OxGrid
- First point of contact is OUCS Helpdesk through
support email. - Given a preset list of questions to ask and log
files to ask to see if available. - Not expected to do any actual debugging.
- Pass problems onto Grid experts who then pass
problems on a system by system basis to their own
maintenance staff. - Significant cluster support expertise within
IeRC. - As one of the UK e-Science Centres we also have
access to the Grid Operations and Support Centre.
26Users
- Focused on users with serial computation
problems, individual researchers - Statistics (1 user)
- Materials Science (3 user)
- Inorganic chemistry (3 users)
- Theoretical Chemistry (4 users)
- Biochemistry (8 users)
- Computational Biology (2 user)
- Condensed Matter Physics (2 users)
- Quantum Computational Physics (1 user)
27User Code Porting
- User forwards code to OeRC that operates either
on single node or cluster. - Design a wrapper script
- Creates scratch directory in which all operations
occur - formats configuration information for each child
process from main configuration - Creates execution script and zip file for
remote execution - Submits child process onto grid
- Waits until all child processes have completed
- Collates results and archives temp files etc.
- Deposits scratch directory into SRB repository
- ltCan remove scratch directory from Resource
Broker if askedgt - Hand code back to user as an example of a real
computational task they want to do but a possible
basis for further code porting by themselves
28OxGrid, Users
Simulation of the quantum dynamics of correlated
electrons in a laser field. OxGrid NGS made
serious computational power easily available and
was crucial for making the simulating algorithm
work. Dr Dmitrii Shalashilin (Theoretical
Chemistry)
Orbitals and Electron Charge Distribution in
Boron Nitride NanostructuresDr. Amanda Barnard,
(Materials Science)
Molecular evolution of a large antigen gene
family in African trypanosomes. OxGrid has been
key to my research and has allowed me to complete
within a few weeks calculations which would have
taken months to run on my desktop.Dr Jay Taylor
(Statistics)
29Problems
- Sociological
- Getting academics to share resources
- IT officers in departments and colleges
- Technical
- Minimal firewall problems
- Information servers
- OS Versions
- Programming languages
- Time
30The Future
- Improve central service software
- RB usage algorithm
- Remove central information server
- Resource broker querying individual remote
systems is actually more efficient - Update Condor-G to latest version to allow
seamless transition from Pre-WS to WS based
middleware - Design and construct user training courses
31The Future, 2
- Develop Windows/Linux Condor pools so that all
shared systems can be included - Develop experimental system to harvest spare disk
spins so as to ensure complete ROI on shared
systems. - Connect MS Windows Cluster system
- Package central server modules for public
distribution - Already running on systems in Porto and Barcelona
universities as well as Monash University - Continue contacting users to expand the user base
32Conclusions
- Users are already able to log onto the Resource
Broker and schedule work onto the NGS and OUCS
Condor Systems - Working as quickly as possible to engage more
users - These users will encourage their local systems
owners (in departments and colleges) to donate
resource! - Need these users to then go out and evangelise.
33Thanks
- Co-Designer of parts of the system
- Jon Wakelin (CeRB)
- Oxford Sys Administrators
- Ian Atkin (OUCS)
- Jon Lockley (OSC)
- Steven Young (Ox NGS)
- Users
- Amanda Barnard (Materials Science)
- Dr Jay Taylor (Statistics)
- Dr Dmitry Shalashilin (Theoretical Chemistry)
34Contact
- Email david.wallom_at_oerc.ox.ac.uk
- Telephone 01865 283378