Title: Grid Deployment
1Grid Deployment Operations
20th September 2005
Jeremy Coles GridPP Production Manager UKI
Operations Manager for EGEE J.Coles_at_rl.ac.uk
2Overview
1 Project Background (to EGEE, LCG and GridPP)
2 The middleware and its deployment
3 Structures developed in response to operating a
large grid
4 How the infrastructure is being used
5 Particular problems being faced
6 Summary
3A reminder of the Enabling Grids for E-sciencE
project
32 Million Euros EU funding over 2 years starting
1st April 2004
- 48 service activities (Grid Operations, Support
and Management, Network Resource Provision) - 24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development) - 28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)
Emphasis in EGEE is on operating a
production grid and supporting the end-users
From Bob Joness talk AHM 2004!
4The UK Ireland contribution to SA1 deployment
operations
- Consists of 3 partners
- Grid Ireland
5The UK Ireland contribution to SA1 deployment
operations
- Consists of 3 partners
- Grid Ireland
- The National Grid Service (NGS)
- - Leeds/Manchester/Oxford/RAL
6The UK Ireland contribution to SA1 deployment
operations
- Consists of 3 partners
- Grid Ireland
- The National Grid Service (NGS)
- GridPP
- Currently the lead partner
- Based on a Tier-2 structure
7The UK Ireland contribution to SA1 deployment
operations
- Consists of 3 partners
- Grid Ireland
- The National Grid Service (NGS)
- GridPP
- Currently the lead partner
- Based on a Tier-2 structure within the Large
Hadron Collider Grid Project (LCG) See T Doyles
talk tomorrow 11am CR2
- The UKI structure
- Regional Operations Centre (ROC)
- Helpdesk
- Communications
- Liaison with ROCs and CICs
- Monitoring of resources
- Core Infrastructure Centre (CIC)
- Team take shifts to
- Monitor core services and
- Follow up on site problems
8GridPP is a major contributor to the growth of
EGEE resources
9When sites join EGEE the ROC
- Records site details in a central Grid
Operations Centre DataBase (GOCDB) with access
certificate controlled - Ensures that the site has agreed to and signed
the Acceptable Use and Incident Response
procedures - Runs tests against the site to ensure that the
setup is correctly configured
NB. Page access requires appropriate grid
certificate
10Experience has revealed growing requirements for
the GOCDB
- ROC manager control - To be able to update site
information and change the monitoring status for
or remove sites - A structure that allows easy population of
structured views (such as accounting according to
regional structures) - To be able to differentiate pure production
sites from test resources (e.g. preproduction
services)
11EGEE middleware is still evolving based on
operational needs
12An overview of the (changing) middleware release
process
Site deployment of middleware YAIM bash
script. Simple and transparent. Much preferred
by administrators. QUATTOR Steep learning
curve but allows tighter control over
installation.
Patches functionality Vs stability!
Porting to non-standard LCG operating systems
13A mixed infrastructure is inevitable and local
variations must be manageable
- Releases take time to be adopted how will more
frequent updates be tagged and handled!? - Grid Ireland has a completely different
deployment model to GridPP (central vs site
based)
14Additional components are added such as for
managed storage
- Storage Resource Management interface
- Provides a protocol for large scale storage
systems on the grid - Clients can retrieve and store files, control
file lifetimes and filespace - Sites will need to offer an SRM compliant
storage element to VOs - These SEs are basically filesystem mount points
- on specific servers
- There are few solutions available and
- deployment at test sites has proved time
- consuming (integration at sites, understanding
- hardware setup (documentation improving))
15Once sites are part of the grid they are actively
monitored
- The Site Functional Tests (SFTs) are a series of
jobs reporting whether a site is able to do basic
transfers, publishes required information etc. - These have recently been updated as certain
critical tests gave a misleading impression of
a site
16Once sites are part of the grid they are actively
monitored
- The Site Functional Tests (SFTs) are a series of
jobs reporting whether a site is able to do basic
transfers, publishes required information etc. - These have recently been updated as certain
critical tests gave a misleading impression of
a site - The tests are being used (and expanded) by
Virtual Organisations (VOs) to select stable
sites (to improve efficiency)
17Once sites are part of the grid they are actively
monitored
- The Site Functional Tests (SFTs) are a series of
jobs reporting whether a site is able to do basic
transfers, publishes required information etc. - These have recently been updated as certain
critical tests gave a misleading impression of
a site - The tests are being used (and expanded) by
Virtual Organisations (VOs) to select stable
sites (to improve efficiency) - They have proved very useful to sites and can now
be run by them on demand
18The tests form part of a suite of information
used by the Core Infrastructure Centres (CICs)
- There are currently 5 CICs in EGEE
- Introduction of a CIC on Duty rota (whereby
each CIC oversees EGEE operations for 1 week at a
time) saw a great improvement in grid stability - Available information is captured in a Trouble
Ticket and sent to problem sites (and their ROC)
informing them that there is a problem - Tickets are automatically escalated if not
resolved - Core services are monitored in addition to sites
19Good, reliable and easy to access information has
been extremely useful to sites and ROC staff
- At a glance we can see for each site
- whether it passes or fails the functional tests
- if there are configuration errors (via sanity
checks) - what middleware version is deployed
- the total job slots available and used as
published by the site - basic storage information
- average and maximum published jobs slots showing
deviations
20With a rapidly growing number of sites and
geographic coverage many tools have had to evolve
21And new ones developed. EGEE and LCG metrics are
an increasing area of focus how else are we to
manage!
22We need to develop a better understanding of grid
dynamics
Is this the result of a loss of the Tier-1
scheduler? Or just a problem with the tests!
Is this several sites with large farms upgrading?
23The good news is that UKI is currently the
largest contributor to EGEE resources
24 and resource usage is growing (at 55 for
August and 26 for period from June 04
- Utilisation may worry some people but note that
the majority of resources are being deployed for
High Energy Physics experiments which will ramp
up usage quickly in 2007 - Recent activity is due partly due to a
Biomedical data challenge in August
25Several sites have been running full for
July/August. The plot below is for the Tier-1 in
August
26However full does not always mean well used!
- The plot shows weighted job efficiencies for the
ATLAS VO in July 2005 - Straight line structures show jobs which ran for
a period of time before blocking on an external
resource and eventually being killed by an
elapsed time limit - Clusters at low efficiency probably show
performance problems on external storage elements - Many problems seen here are NOW FIXED
27 and some sites have specific scheduling
requirements
Grid scheduling (using user specified
requirements to select resources)
Vs
Local policies (the site
prefers certain VOs)
28The user community is expanding creating new
problems
- Over 900 users in some 60 VOs
- UK sites support about 10 VOs
- Opening up resources for non-traditional site
VOs/users requires effort - Negotiation between VOs and the regional sites
has required the creation of an Operational
Advisory Group - New Acceptable Use policies which apply across
countries and agreeable (and actually readable)
are taking time to develop.
29Aggregation of job accounting is recording VO
usage
Web summary view of data
GOC SITE
30Aggregation of job accounting is recording VO
usage, but
Web summary view of data
GOC SITE
- Not all batch systems are covered
- Not all sites are publishing data
- Farm normalisation factors are not consistent
- Publishing across grids yet to be tackled (but
the solution in EGEE does use a GGF schema)
31GridPP data is reasonably complete for recent
months
Note the usage by non particle physics
organisations. This is what the EGEE grid is all
about.
32Support is proving difficult because the project
is so large and diverse
Site administrators
Users
Experiments/VOs
GOSC (Footprints)
LCG-ROLLOUT
TB-SUPPORT
Grid-Ireland helpdesk (Request Tracker)
GGUS (Remedy)
UKI ROC ticket tracking system (Footprints)
Regional service 1
Regional service 1
Regional service 1
CIC-on-duty
Site A
Savannah bug tracking
Tier-1 helpdesk (Request tracker)
Site A
Site A
Site A
- This is ONLY the view for the UKI operations
centre. There are 9 ROCs
33The EGEE model uses a central helpdesk facility
and Ticket Process Managers
I need help! I send e-mail to vo-user-support_at_ggus
.org
E-mail automatically converted in GGUS
ticket. Can be addressed to TPM VO only, or TPM
only, or to both
Ticket Process Manager Monitor ticket
assignments. Direct to correct support unit.
Notify users of specific actions and ticket status
TPM VO Support People from VOs. Receive tickets
VO related and follow them. Solve/forward VO
specific problems. Recognize Grid related
problems and assign them to specific support
units or back to TPM
VO Support Units
ROC Support Units
Middleware Support Units
Other Grids Support Units
CIC Support Unit
Mailing lists
34The EGEE model uses a central helpdesk facility
and ticket process managers, but
Some users are confused - mixed messages
The central GGUS facility is taking time to
become stable
Ticket Process Managers are difficult to provide
EGEE funding did not account for them
VOs still have independent support lists and
routes especially the larger VOs
Linking up ROC helpdesks is taking time. Getting
VOs to populate their follow up lists is not
happening quickly
VO Support Units
Mailing lists are very active on their own!
35Interoperability is another area to be developed
- In terms of
- Operations
- Support
- Job submission
- Job monitoring
-
- Currently the VOs/experiments develop their own
solutions to this problem.
36Some other areas which are talks in themselves!
- Security
- Getting all sites to adopt best practices
- check patches
- check port changes
- reviewing log files
- Scanning for grid wide intrusion
- Network monitoring
- Aggregation of data from site network boxes
- Mediator for integrated network checks
37Going forward, one of the main drivers pushing
the service is a series of service challenges in
LCG
- Main UK site connected to CERN via UKLIGHT
- Up to 650 Mb/s sustained transfers
- 3 Tier-2 centres deployed an SRM and managed
sustained data transfer rates up to 550 Mb/s over
SJ4. One connected via UKLIGHT
38Summary
1 UKI has a strong presence in EGEE and LCG
2 Our grid management tools are now evolving
rapidly
3 Grid utilisation is improving we start to
look at the dynamics
4 Growing focus areas include support and
interoperation (and gLite!)
5 There is a lot of work not covered here!
FabricSecurityNetworking
6 Come and visit the GridPP (PPARC) and CCLRC
stands!
39VOMS
gLite vs LCG-2 Components
Catalogue and access control
LFC
FIREMAN
gLite WLM
RB
myProxy
BD-II
BD-II
APEL
dgas
Independent IS
R-GMA
R-GMA
R-GMAs can be merged (security ON)
UIs
gLite-IO
gLite
LCG
SITE
LCG CE
CEs use same batch system
WNs
gLite-CE
LCG
FTS for LCG uses user proxy, gLite uses service
cert
FTS
FTS
shared
SRM-SE
Data from LCG is owned by VO and role, gLite-IO
service owns gLite data
gLite