Grid Deployment - PowerPoint PPT Presentation

1 / 39
About This Presentation
Title:

Grid Deployment

Description:

1 Project Background (to EGEE, LCG and GridPP) ... FIREMAN. VOMS. LFC. shared. LCG. gLite. SRM-SE. myProxy. gLite. WLM. RB. UIs. WNs. gLite. LCG. gLite-IO ... – PowerPoint PPT presentation

Number of Views:48
Avg rating:3.0/5.0
Slides: 40
Provided by: jc75
Category:

less

Transcript and Presenter's Notes

Title: Grid Deployment


1
Grid Deployment Operations
  • EGEE, LCG and GridPP

20th September 2005
Jeremy Coles GridPP Production Manager UKI
Operations Manager for EGEE J.Coles_at_rl.ac.uk
2
Overview
1 Project Background (to EGEE, LCG and GridPP)
2 The middleware and its deployment
3 Structures developed in response to operating a
large grid
4 How the infrastructure is being used
5 Particular problems being faced
6 Summary
3
A reminder of the Enabling Grids for E-sciencE
project
32 Million Euros EU funding over 2 years starting
1st April 2004
  • 48 service activities (Grid Operations, Support
    and Management, Network Resource Provision)
  • 24 middleware re-engineering (Quality
    Assurance, Security, Network Services
    Development)
  • 28 networking (Management, Dissemination and
    Outreach, User Training and Education,
    Application Identification and Support, Policy
    and International Cooperation)

Emphasis in EGEE is on operating a
production grid and supporting the end-users
From Bob Joness talk AHM 2004!
4
The UK Ireland contribution to SA1 deployment
operations
  • Consists of 3 partners
  • Grid Ireland

5
The UK Ireland contribution to SA1 deployment
operations
  • Consists of 3 partners
  • Grid Ireland
  • The National Grid Service (NGS)
  • - Leeds/Manchester/Oxford/RAL

6
The UK Ireland contribution to SA1 deployment
operations
  • Consists of 3 partners
  • Grid Ireland
  • The National Grid Service (NGS)
  • GridPP
  • Currently the lead partner
  • Based on a Tier-2 structure

7
The UK Ireland contribution to SA1 deployment
operations
  • Consists of 3 partners
  • Grid Ireland
  • The National Grid Service (NGS)
  • GridPP
  • Currently the lead partner
  • Based on a Tier-2 structure within the Large
    Hadron Collider Grid Project (LCG) See T Doyles
    talk tomorrow 11am CR2
  • The UKI structure
  • Regional Operations Centre (ROC)
  • Helpdesk
  • Communications
  • Liaison with ROCs and CICs
  • Monitoring of resources
  • Core Infrastructure Centre (CIC)
  • Team take shifts to
  • Monitor core services and
  • Follow up on site problems

8
GridPP is a major contributor to the growth of
EGEE resources
9
When sites join EGEE the ROC
  • Records site details in a central Grid
    Operations Centre DataBase (GOCDB) with access
    certificate controlled
  • Ensures that the site has agreed to and signed
    the Acceptable Use and Incident Response
    procedures
  • Runs tests against the site to ensure that the
    setup is correctly configured

NB. Page access requires appropriate grid
certificate
10
Experience has revealed growing requirements for
the GOCDB
  • ROC manager control - To be able to update site
    information and change the monitoring status for
    or remove sites
  • A structure that allows easy population of
    structured views (such as accounting according to
    regional structures)
  • To be able to differentiate pure production
    sites from test resources (e.g. preproduction
    services)

11
EGEE middleware is still evolving based on
operational needs
12
An overview of the (changing) middleware release
process
Site deployment of middleware YAIM bash
script. Simple and transparent. Much preferred
by administrators. QUATTOR Steep learning
curve but allows tighter control over
installation.
Patches functionality Vs stability!
Porting to non-standard LCG operating systems
13
A mixed infrastructure is inevitable and local
variations must be manageable
  • Releases take time to be adopted how will more
    frequent updates be tagged and handled!?
  • Grid Ireland has a completely different
    deployment model to GridPP (central vs site
    based)

14
Additional components are added such as for
managed storage
  • Storage Resource Management interface
  • Provides a protocol for large scale storage
    systems on the grid
  • Clients can retrieve and store files, control
    file lifetimes and filespace
  • Sites will need to offer an SRM compliant
    storage element to VOs
  • These SEs are basically filesystem mount points
  • on specific servers
  • There are few solutions available and
  • deployment at test sites has proved time
  • consuming (integration at sites, understanding
  • hardware setup (documentation improving))

15
Once sites are part of the grid they are actively
monitored
  • The Site Functional Tests (SFTs) are a series of
    jobs reporting whether a site is able to do basic
    transfers, publishes required information etc.
  • These have recently been updated as certain
    critical tests gave a misleading impression of
    a site

16
Once sites are part of the grid they are actively
monitored
  • The Site Functional Tests (SFTs) are a series of
    jobs reporting whether a site is able to do basic
    transfers, publishes required information etc.
  • These have recently been updated as certain
    critical tests gave a misleading impression of
    a site
  • The tests are being used (and expanded) by
    Virtual Organisations (VOs) to select stable
    sites (to improve efficiency)

17
Once sites are part of the grid they are actively
monitored
  • The Site Functional Tests (SFTs) are a series of
    jobs reporting whether a site is able to do basic
    transfers, publishes required information etc.
  • These have recently been updated as certain
    critical tests gave a misleading impression of
    a site
  • The tests are being used (and expanded) by
    Virtual Organisations (VOs) to select stable
    sites (to improve efficiency)
  • They have proved very useful to sites and can now
    be run by them on demand

18
The tests form part of a suite of information
used by the Core Infrastructure Centres (CICs)
  • There are currently 5 CICs in EGEE
  • Introduction of a CIC on Duty rota (whereby
    each CIC oversees EGEE operations for 1 week at a
    time) saw a great improvement in grid stability
  • Available information is captured in a Trouble
    Ticket and sent to problem sites (and their ROC)
    informing them that there is a problem
  • Tickets are automatically escalated if not
    resolved
  • Core services are monitored in addition to sites

19
Good, reliable and easy to access information has
been extremely useful to sites and ROC staff
  • At a glance we can see for each site
  • whether it passes or fails the functional tests
  • if there are configuration errors (via sanity
    checks)
  • what middleware version is deployed
  • the total job slots available and used as
    published by the site
  • basic storage information
  • average and maximum published jobs slots showing
    deviations

20
With a rapidly growing number of sites and
geographic coverage many tools have had to evolve
21
And new ones developed. EGEE and LCG metrics are
an increasing area of focus how else are we to
manage!
22
We need to develop a better understanding of grid
dynamics
Is this the result of a loss of the Tier-1
scheduler? Or just a problem with the tests!
Is this several sites with large farms upgrading?
23
The good news is that UKI is currently the
largest contributor to EGEE resources
24
and resource usage is growing (at 55 for
August and 26 for period from June 04
  • Utilisation may worry some people but note that
    the majority of resources are being deployed for
    High Energy Physics experiments which will ramp
    up usage quickly in 2007
  • Recent activity is due partly due to a
    Biomedical data challenge in August

25
Several sites have been running full for
July/August. The plot below is for the Tier-1 in
August
26
However full does not always mean well used!
  • The plot shows weighted job efficiencies for the
    ATLAS VO in July 2005
  • Straight line structures show jobs which ran for
    a period of time before blocking on an external
    resource and eventually being killed by an
    elapsed time limit
  • Clusters at low efficiency probably show
    performance problems on external storage elements
  • Many problems seen here are NOW FIXED

27
and some sites have specific scheduling
requirements
Grid scheduling (using user specified
requirements to select resources)
Vs
Local policies (the site
prefers certain VOs)
28
The user community is expanding creating new
problems
  • Over 900 users in some 60 VOs
  • UK sites support about 10 VOs
  • Opening up resources for non-traditional site
    VOs/users requires effort
  • Negotiation between VOs and the regional sites
    has required the creation of an Operational
    Advisory Group
  • New Acceptable Use policies which apply across
    countries and agreeable (and actually readable)
    are taking time to develop.

29
Aggregation of job accounting is recording VO
usage
Web summary view of data
GOC SITE
30
Aggregation of job accounting is recording VO
usage, but
Web summary view of data
GOC SITE
  • Not all batch systems are covered
  • Not all sites are publishing data
  • Farm normalisation factors are not consistent
  • Publishing across grids yet to be tackled (but
    the solution in EGEE does use a GGF schema)

31
GridPP data is reasonably complete for recent
months
Note the usage by non particle physics
organisations. This is what the EGEE grid is all
about.
32
Support is proving difficult because the project
is so large and diverse
Site administrators
Users
Experiments/VOs
GOSC (Footprints)
LCG-ROLLOUT
TB-SUPPORT
Grid-Ireland helpdesk (Request Tracker)
GGUS (Remedy)
UKI ROC ticket tracking system (Footprints)
Regional service 1
Regional service 1
Regional service 1
CIC-on-duty
Site A
Savannah bug tracking
Tier-1 helpdesk (Request tracker)
Site A
Site A
Site A
  • This is ONLY the view for the UKI operations
    centre. There are 9 ROCs

33
The EGEE model uses a central helpdesk facility
and Ticket Process Managers
I need help! I send e-mail to vo-user-support_at_ggus
.org
E-mail automatically converted in GGUS
ticket. Can be addressed to TPM VO only, or TPM
only, or to both
Ticket Process Manager Monitor ticket
assignments. Direct to correct support unit.
Notify users of specific actions and ticket status
TPM VO Support People from VOs. Receive tickets
VO related and follow them. Solve/forward VO
specific problems. Recognize Grid related
problems and assign them to specific support
units or back to TPM
VO Support Units
ROC Support Units
Middleware Support Units
Other Grids Support Units
CIC Support Unit
Mailing lists
34
The EGEE model uses a central helpdesk facility
and ticket process managers, but
Some users are confused - mixed messages
The central GGUS facility is taking time to
become stable
Ticket Process Managers are difficult to provide
EGEE funding did not account for them
VOs still have independent support lists and
routes especially the larger VOs
Linking up ROC helpdesks is taking time. Getting
VOs to populate their follow up lists is not
happening quickly
VO Support Units
Mailing lists are very active on their own!
35
Interoperability is another area to be developed
  • In terms of
  • Operations
  • Support
  • Job submission
  • Job monitoring
  • Currently the VOs/experiments develop their own
    solutions to this problem.

36
Some other areas which are talks in themselves!
  • Security
  • Getting all sites to adopt best practices
  • check patches
  • check port changes
  • reviewing log files
  • Scanning for grid wide intrusion
  • Network monitoring
  • Aggregation of data from site network boxes
  • Mediator for integrated network checks

37
Going forward, one of the main drivers pushing
the service is a series of service challenges in
LCG
  • Main UK site connected to CERN via UKLIGHT
  • Up to 650 Mb/s sustained transfers
  • 3 Tier-2 centres deployed an SRM and managed
    sustained data transfer rates up to 550 Mb/s over
    SJ4. One connected via UKLIGHT


38
Summary
1 UKI has a strong presence in EGEE and LCG
2 Our grid management tools are now evolving
rapidly
3 Grid utilisation is improving we start to
look at the dynamics
4 Growing focus areas include support and
interoperation (and gLite!)
5 There is a lot of work not covered here!
FabricSecurityNetworking
6 Come and visit the GridPP (PPARC) and CCLRC
stands!
39
VOMS
gLite vs LCG-2 Components
Catalogue and access control
LFC
FIREMAN
gLite WLM
RB
myProxy
BD-II
BD-II
APEL
dgas
Independent IS
R-GMA
R-GMA
R-GMAs can be merged (security ON)
UIs
gLite-IO
gLite
LCG
SITE
LCG CE
CEs use same batch system
WNs
gLite-CE
LCG
FTS for LCG uses user proxy, gLite uses service
cert
FTS
FTS
shared
SRM-SE
Data from LCG is owned by VO and role, gLite-IO
service owns gLite data
gLite
Write a Comment
User Comments (0)
About PowerShow.com