Grid Deployment - PowerPoint PPT Presentation

1 / 39

About This Presentation

Title:

Grid Deployment

Description:

1 Project Background (to EGEE, LCG and GridPP) ... FIREMAN. VOMS. LFC. shared. LCG. gLite. SRM-SE. myProxy. gLite. WLM. RB. UIs. WNs. gLite. LCG. gLite-IO ... – PowerPoint PPT presentation

Number of Views:48

Avg rating:3.0/5.0

Slides: 40

Provided by: jc75

Category:

more less

Transcript and Presenter's Notes

Title: Grid Deployment

1
Grid Deployment Operations

EGEE, LCG and GridPP

20th September 2005
Jeremy Coles GridPP Production Manager UKI
Operations Manager for EGEE J.Coles_at_rl.ac.uk
2
Overview
1 Project Background (to EGEE, LCG and GridPP)
2 The middleware and its deployment
3 Structures developed in response to operating a
large grid
4 How the infrastructure is being used
5 Particular problems being faced
6 Summary
3
A reminder of the Enabling Grids for E-sciencE
project
32 Million Euros EU funding over 2 years starting
1st April 2004

48 service activities (Grid Operations, Support
and Management, Network Resource Provision)
24 middleware re-engineering (Quality
Assurance, Security, Network Services
Development)
28 networking (Management, Dissemination and
Outreach, User Training and Education,
Application Identification and Support, Policy
and International Cooperation)

Emphasis in EGEE is on operating a
production grid and supporting the end-users
From Bob Joness talk AHM 2004!
4
The UK Ireland contribution to SA1 deployment
operations

Consists of 3 partners
Grid Ireland

5
The UK Ireland contribution to SA1 deployment
operations

Consists of 3 partners
Grid Ireland
The National Grid Service (NGS)
- Leeds/Manchester/Oxford/RAL

6
The UK Ireland contribution to SA1 deployment
operations

Consists of 3 partners
Grid Ireland
The National Grid Service (NGS)
GridPP
Currently the lead partner
Based on a Tier-2 structure

7
The UK Ireland contribution to SA1 deployment
operations

Consists of 3 partners
Grid Ireland
The National Grid Service (NGS)
GridPP
Currently the lead partner
Based on a Tier-2 structure within the Large
Hadron Collider Grid Project (LCG) See T Doyles
talk tomorrow 11am CR2

The UKI structure
Regional Operations Centre (ROC)
Helpdesk
Communications
Liaison with ROCs and CICs
Monitoring of resources
Core Infrastructure Centre (CIC)
Team take shifts to
Monitor core services and
Follow up on site problems

8
GridPP is a major contributor to the growth of
EGEE resources
9
When sites join EGEE the ROC

Records site details in a central Grid
Operations Centre DataBase (GOCDB) with access
certificate controlled
Ensures that the site has agreed to and signed
the Acceptable Use and Incident Response
procedures
Runs tests against the site to ensure that the
setup is correctly configured

NB. Page access requires appropriate grid
certificate
10
Experience has revealed growing requirements for
the GOCDB

ROC manager control - To be able to update site
information and change the monitoring status for
or remove sites
A structure that allows easy population of
structured views (such as accounting according to
regional structures)
To be able to differentiate pure production
sites from test resources (e.g. preproduction
services)

11
EGEE middleware is still evolving based on
operational needs
12
An overview of the (changing) middleware release
process
Site deployment of middleware YAIM bash
script. Simple and transparent. Much preferred
by administrators. QUATTOR Steep learning
curve but allows tighter control over
installation.
Patches functionality Vs stability!
Porting to non-standard LCG operating systems
13
A mixed infrastructure is inevitable and local
variations must be manageable

Releases take time to be adopted how will more
frequent updates be tagged and handled!?
Grid Ireland has a completely different
deployment model to GridPP (central vs site
based)

14
Additional components are added such as for
managed storage

Storage Resource Management interface
Provides a protocol for large scale storage
systems on the grid
Clients can retrieve and store files, control
file lifetimes and filespace
Sites will need to offer an SRM compliant
storage element to VOs
These SEs are basically filesystem mount points
on specific servers
There are few solutions available and
deployment at test sites has proved time
consuming (integration at sites, understanding
hardware setup (documentation improving))

15
Once sites are part of the grid they are actively
monitored

The Site Functional Tests (SFTs) are a series of
jobs reporting whether a site is able to do basic
transfers, publishes required information etc.
These have recently been updated as certain
critical tests gave a misleading impression of
a site

16
Once sites are part of the grid they are actively
monitored

The Site Functional Tests (SFTs) are a series of
jobs reporting whether a site is able to do basic
transfers, publishes required information etc.
These have recently been updated as certain
critical tests gave a misleading impression of
a site
The tests are being used (and expanded) by
Virtual Organisations (VOs) to select stable
sites (to improve efficiency)

17
Once sites are part of the grid they are actively
monitored

The Site Functional Tests (SFTs) are a series of
jobs reporting whether a site is able to do basic
transfers, publishes required information etc.
These have recently been updated as certain
critical tests gave a misleading impression of
a site
The tests are being used (and expanded) by
Virtual Organisations (VOs) to select stable
sites (to improve efficiency)
They have proved very useful to sites and can now
be run by them on demand

18
The tests form part of a suite of information
used by the Core Infrastructure Centres (CICs)

There are currently 5 CICs in EGEE
Introduction of a CIC on Duty rota (whereby
each CIC oversees EGEE operations for 1 week at a
time) saw a great improvement in grid stability
Available information is captured in a Trouble
Ticket and sent to problem sites (and their ROC)
informing them that there is a problem
Tickets are automatically escalated if not
resolved
Core services are monitored in addition to sites

19
Good, reliable and easy to access information has
been extremely useful to sites and ROC staff

At a glance we can see for each site
whether it passes or fails the functional tests
if there are configuration errors (via sanity
checks)
what middleware version is deployed
the total job slots available and used as
published by the site
basic storage information
average and maximum published jobs slots showing
deviations

20
With a rapidly growing number of sites and
geographic coverage many tools have had to evolve
21
And new ones developed. EGEE and LCG metrics are
an increasing area of focus how else are we to
manage!
22
We need to develop a better understanding of grid
dynamics
Is this the result of a loss of the Tier-1
scheduler? Or just a problem with the tests!
Is this several sites with large farms upgrading?
23
The good news is that UKI is currently the
largest contributor to EGEE resources
24
and resource usage is growing (at 55 for
August and 26 for period from June 04

Utilisation may worry some people but note that
the majority of resources are being deployed for
High Energy Physics experiments which will ramp
up usage quickly in 2007
Recent activity is due partly due to a
Biomedical data challenge in August

25
Several sites have been running full for
July/August. The plot below is for the Tier-1 in
August
26
However full does not always mean well used!

The plot shows weighted job efficiencies for the
ATLAS VO in July 2005
Straight line structures show jobs which ran for
a period of time before blocking on an external
resource and eventually being killed by an
elapsed time limit
Clusters at low efficiency probably show
performance problems on external storage elements
Many problems seen here are NOW FIXED

27
and some sites have specific scheduling
requirements
Grid scheduling (using user specified
requirements to select resources)
Vs
Local policies (the site
prefers certain VOs)
28
The user community is expanding creating new
problems

Over 900 users in some 60 VOs
UK sites support about 10 VOs
Opening up resources for non-traditional site
VOs/users requires effort
Negotiation between VOs and the regional sites
has required the creation of an Operational
Advisory Group
New Acceptable Use policies which apply across
countries and agreeable (and actually readable)
are taking time to develop.

29
Aggregation of job accounting is recording VO
usage
Web summary view of data
GOC SITE
30
Aggregation of job accounting is recording VO
usage, but
Web summary view of data
GOC SITE

Not all batch systems are covered
Not all sites are publishing data
Farm normalisation factors are not consistent
Publishing across grids yet to be tackled (but
the solution in EGEE does use a GGF schema)

31
GridPP data is reasonably complete for recent
months
Note the usage by non particle physics
organisations. This is what the EGEE grid is all
about.
32
Support is proving difficult because the project
is so large and diverse
Site administrators
Users
Experiments/VOs
GOSC (Footprints)
LCG-ROLLOUT
TB-SUPPORT
Grid-Ireland helpdesk (Request Tracker)
GGUS (Remedy)
UKI ROC ticket tracking system (Footprints)
Regional service 1
Regional service 1
Regional service 1
CIC-on-duty
Site A
Savannah bug tracking
Tier-1 helpdesk (Request tracker)
Site A
Site A
Site A

This is ONLY the view for the UKI operations
centre. There are 9 ROCs

33
The EGEE model uses a central helpdesk facility
and Ticket Process Managers
I need help! I send e-mail to vo-user-support_at_ggus
.org
E-mail automatically converted in GGUS
ticket. Can be addressed to TPM VO only, or TPM
only, or to both
Ticket Process Manager Monitor ticket
assignments. Direct to correct support unit.
Notify users of specific actions and ticket status
TPM VO Support People from VOs. Receive tickets
VO related and follow them. Solve/forward VO
specific problems. Recognize Grid related
problems and assign them to specific support
units or back to TPM
VO Support Units
ROC Support Units
Middleware Support Units
Other Grids Support Units
CIC Support Unit
Mailing lists
34
The EGEE model uses a central helpdesk facility
and ticket process managers, but
Some users are confused - mixed messages
The central GGUS facility is taking time to
become stable
Ticket Process Managers are difficult to provide
EGEE funding did not account for them
VOs still have independent support lists and
routes especially the larger VOs
Linking up ROC helpdesks is taking time. Getting
VOs to populate their follow up lists is not
happening quickly
VO Support Units
Mailing lists are very active on their own!
35
Interoperability is another area to be developed

In terms of
Operations
Support
Job submission
Job monitoring
Currently the VOs/experiments develop their own
solutions to this problem.

36
Some other areas which are talks in themselves!

Security
Getting all sites to adopt best practices
check patches
check port changes
reviewing log files
Scanning for grid wide intrusion

Network monitoring
Aggregation of data from site network boxes
Mediator for integrated network checks

37
Going forward, one of the main drivers pushing
the service is a series of service challenges in
LCG

Main UK site connected to CERN via UKLIGHT
Up to 650 Mb/s sustained transfers
3 Tier-2 centres deployed an SRM and managed
sustained data transfer rates up to 550 Mb/s over
SJ4. One connected via UKLIGHT

38
Summary
1 UKI has a strong presence in EGEE and LCG
2 Our grid management tools are now evolving
rapidly
3 Grid utilisation is improving we start to
look at the dynamics
4 Growing focus areas include support and
interoperation (and gLite!)
5 There is a lot of work not covered here!
FabricSecurityNetworking
6 Come and visit the GridPP (PPARC) and CCLRC
stands!
39
VOMS
gLite vs LCG-2 Components
Catalogue and access control
LFC
FIREMAN
gLite WLM
RB
myProxy
BD-II
BD-II
APEL
dgas
Independent IS
R-GMA
R-GMA
R-GMAs can be merged (security ON)
UIs
gLite-IO
gLite
LCG
SITE
LCG CE
CEs use same batch system
WNs
gLite-CE
LCG
FTS for LCG uses user proxy, gLite uses service
cert
FTS
FTS
shared
SRM-SE
Data from LCG is owned by VO and role, gLite-IO
service owns gLite data
gLite

Write a Comment

User Comments (0)