GridPP: Running a Production Grid - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

GridPP: Running a Production Grid

Description:

Updates every couple of weeks. The system becomes more heterogenous ... FCR tool allows sites failing specified tests to be made 'invisible' ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 25

Provided by: stephe326

Category:

more less

Transcript and Presenter's Notes

Title: GridPP: Running a Production Grid

1
GridPP Running a Production Grid

Stephen Burke
CLRC/RAL
On behalf of the GridPP Deployment Operations
Team
UK e-Science All-hands, Nottingham, 21st
September 2006

2
Overview

EGEE, LCG and GridPP
Middleware
Deployment Operations
Conclusions

EGEE, LCG and GridPP

4
EGEE

Major EU Grid project 2004-08 (in two phases)
Successor to the European DataGrid (EDG) project,
2001-04
32 countries, 91 partners, 37 million matching
funding
Associated with several Grid projects outside
Europe
Expected to be succeeded by a permanent European
e-infrastructure
Supports many areas of e-science, but currently
High Energy Physics is the major user
Biomedical research is also a pioneer
Currently 3000 users in 200 Virtual
Organisations
Currently 195 sites, 28689 CPUs, 18.4 Pb of
storage
Values taken from the information system beware
of GIGO!

5
EGEE/LCG Google map
6
(W)LCG

The computing services for the LHC (Large Hadron
Collider) at CERN in Geneva are provided by the
LHC Computing Grid (LCG) project
LHC starts running in 1 year
Four experiments, all very large
5000 users at 500 sites worldwide, 15 year
lifetime
Expect 15 Pb/year, plus similar volumes of
simulated data
Processing requirement is 100,000 CPUs
Must transfer 100 Mbyte/sec/site sustained for
15 years!
Running a series of Service Challenges to ramp up
to full scale
LCG uses the EGEE infrastructure, but also the
Open Science Grid (OSG) in the US and other Grid
infrastructures
Hence WLCG Worldwide LCG

7
Organisation

EGEE sites are organised by region
GridPP is part of UK/Ireland
Also NGS Grid Ireland
Each region has a Regional Operation Centre (ROC)
to look after the sites in the region
Overall operations co-ordination rotates weekly
between ROCs
LCG divides sites into Tier 1/2/3
CERN as Tier 0
Function of size and QOS
Tier 1 needs 97 availability, max 24 hour
response
Tier 2 95/72 hours
Tier 3 are local facilities, no specific targets
ROC Tier 1 RAL is both

8
GridPP

Grid for UK Particle Physics
Two phases, 2001-2004-2007
Proposal for phase 3 to 2011
Part of EGEE and LCG
Working towards interoperability with NGS
20 sites, 4354 CPUs, 298 Tb of storage
Currently supports 33 VOs, including some non-PP
But not many non-PP from the UK any volunteers?
For LCG, sites are grouped into four virtual
Tier 2s
Plus RAL as Tier 1
Grouping is largely administrative, the Grid
sites remain separate
Runs UK-Ireland ROC (with NGS)
Grid Operations Centre (GOC) _at_ RAL (with NGS)
Gridwide configuration, monitoring and accounting
repository/portal
Operations and User Support shifts (working hours
only)

9
GridPP sites
10
Virtual Organisations

Users are grouped into Virtual Organisations
Users/VO varies from 1 to 792!
Broadly four classes of VO
LHC experiments
EGEE supported
Worldwide (mainly non-LHC particle physics)
Local/regional
Sites can choose which VOs to support, subject to
MOU/funding commitments
Most GridPP sites support 10-20 VOs
GridPP nominally allocates 1 of resources to
EGEE VOs

Middleware

12
Site services

Basis is Globus (still GT2, GT4 soon) and Condor,
as packaged in the Virtual Data Toolkit (VDT)
also used by NGS
EGEE/LCG/EDG middleware distribution now under
the gLite brand name
Computing Element (CE) Globus gatekeeper batch
system batch workers
In transition from Globus to Condor-C
Storage Element (SE) Storage Resource Manager
(SRM) GridFTP other data transports storage
system (disk-only or disktape)
Three SRM implementations in GridPP
Berkeley Database Information Index (BDII) LDAP
server publishing CE SE site service
information according to the GLUE schema
Relational Grid Monitoring Architecture (R-GMA)
server publishing GLUE schema, monitoring,
accounting, user information
VOBOX Container for VO-specific services (aka
edge services)

13
Core services

Workload Management System (WMS), aka Resource
Broker accepts jobs, dispatches them to sites
and manages their lifecycle
Logging Bookkeeping primarily logs lifecycle
events for jobs
MyProxy stores long-lived credentials
LCG File Catalogue (LFC) maps logical file names
to local names on SEs
File Transfer Service (FTS) provides managed,
reliable file transfers
BDII aggregates information from site BDIIs
R-GMA schema/registry stores table definitions
and lists of producers/consumers
VO Membership Service (VOMS) server stores VO
group/role assignments
User Interface (UI) provides user client tools
for the Grid services

14
Grid services

Some extra services are needed to allow the Grid
to be operated effectively
Mostly unique instances, not part of the gLite
distribution
Grid Operations Centre DataBase (GOCDB) stores
information about each site, including contact
details, status and a node list
Queried by other tools to generate configuration,
monitoring etc
Accounting (APEL) publishes information about
CPU and storage use
Various monitoring tools, including
gstat (Grid status) - collects data from the
information system, does sanity checks
Site Availability Monitoring (SAM) - runs regular
test jobs at every site, raises alerts and
measures availability over time
GridView collects and displays information
about file transfers
Real Time Monitor displays job movements, and
records statistics
Freedom of Choice for Resources (FCR) allows the
view of resources in a BDII to be filtered
according to VO-specific criteria, e.g. SAM test
failures
Operations portal aggregates monitoring and
operational information, broadcast email tool,
news, VO information,

15
SAM monitoring
16
GridView
17
Middleware issues

We need to operate a large production system with
247365 availability
Middleware development is usually done on small,
controlled test systems, but the production
system is much larger in many dimensions, more
heterogeneous and not under any central control
Much of the middleware is still immature, with a
significant number of bugs, and developing
rapidly
Documentation is sometimes lacking or out of date
There are therefore a number of issues which must
be managed by deployment and operational
procedures, for example
The rapid rate of change and sometimes lack of
backward compatibility requires careful
management of code deployment
Porting to new hardware, operating systems etc
can be time consuming
Components are often developed in isolation, so
integration of new components can take time
Configuration can be very complex, and only a
small subset of possible configurations produce a
working system
Fault tolerance, error reporting and logging are
in need of improvement
Remote management and diagnostic tools are
generally undeveloped

Deployment Operations

19
Configuration

We have tried many installation configuration
tools over the years
Configuration is complex, but system managers
dont like complex tools!
Most configuration flexibility needs to be
frozen
Admins dont understand all the options anyway
Many configuration changes will break something
The more an admin has to type, the more chances
for a mistake
Current method preferred by most sites is YAIM
(Yet Another Installation Method)
bash scripts
simple configuration of key parameters only
doesnt always have enough flexibility, but good
enough for most cases

20
Release management

There is a constant tension between the desire to
upgrade to get new features, and the desire to
have a stable system
Need to be realistic about how long it takes to
get new things into production
We have so far had a few big bang releases per
year, but these have some disadvantages
Anything which misses a release has to wait for a
long time, hence there is pressure to include
untested code
Releases can be held up by problems in any area,
hence are usually late
They involve a lot of work for system managers,
so it may be several months before all sites
upgrade
We are now moving to incremental releases,
updating each component as it completes
integration and testing
Have to avoid dependencies between component
upgrades
Releases go first to a 10-scale pre-production
Grid
Updates every couple of weeks
The system becomes more heterogenous
Still some big bangs e.g. new OS
Seems OK so far - time will tell!

21
VO support

If sites are going to support a large number of
VOs the configuration has to be done in a
standard way
Largely true, but not perfect adding a VO needs
changes in several areas
Configuration parameters for VOs should be
available on the operations portal, although many
VOs still need to add their data
It needs to be possible to install VO-specific
software, and maybe services, in a standard way
Software is OK NFS-shared area, writeable by
specific VO members, with publication in the
information system
Services still under discussion concerns about
security and support
VOs often expect to have dedicated contacts at
sites (and vice versa)
May be necessary in some cases but does not scale
Operations portal stores contacts, but site - VO
may not reach the right people need contacts by
area
Not too bad, but still needs some work to find a
good modus vivendi

22
Availability

LCG requires high availability, but the intrinsic
failure rate is high
Most of the middleware does not deal gracefully
with failures
Some failure modes can lead to black holes
Must fix/mask failures via operational tools so
users dont see them
Several monitoring tools have been developed,
including test jobs run regularly at sites
On-duty operators look for problems, and submit
tickets to sites
Currently 50 tickets per week (c.f. 200 sites)
FCR tool allows sites failing specified tests to
be made invisible
New sites must be certified before they become
visible
Persistently failing sites can be decertified
Sites can be removed temporarily for scheduled
downtime
Performance is monitored over time
The situation has improved a lot, but we still
have some way to go

23
Resource allocation

Need to be able to assign quotas and priorities
to VOs and groups inside VOs, and measure what is
delivered
VOMS provides group/role information in the proxy
Tools to control quotas and priorities in site
services being developed
So far only at whole-VO level
Maui batch scheduler is very flexible, easy to
map to groups/roles if VOs can define what they
want
Can publish VO/group-specific values in GLUE
schema, hence the RB can use them for scheduling
Accounting tool (APEL) measures CPU use
Storage accounting currently being added
Privacy issues around user-level accounting,
being solved by encryption
Most of the pieces are in place, just need to fit
them together

24
CPU time used by ATLAS
25
User Support

Becoming vital as the number of users grows
But not much effort available in the various
projects
Global Grid User Support (GGUS) portal at
Karlsruhe provides a central ticket interface
Tickets are classified by an on-duty Ticket
Process Manager, and assigned to an appropriate
support unit
GGUS has a web-service interface to ticketing
systems at each ROC
Other support units are just mailing lists
Mostly best-effort support, working hours only
Currently tens of tickets/week
Just about manageable, but may not scale much
further
Some tickets slip through the net
Will need more manpower

26
Documentation Training

Need documentation and training for both system
managers and users
Mostly expert users up to now, but user community
is expanding
Induction of new VOs is a particular problem no
peer support
EGEE is running User Fora for users to share
experience
Next in Manchester in May 07 (with OGF)
EGEE has a dedicated training activity run by
NESC/Edinburgh
Documentation is often a low priority, little
dedicated effort
The rapid pace of change means that material is
often out of date
Effort on documentation is now increasing
GridPP has appointed a documentation officer
GridPP web site, wiki
Installation manual for admins is good
There is also a wiki for admins to share
experience
Focus is now on user documentation
New EGEE web site coming soon

Conclusions

28
Lessons learnt

Good enough is not good enough
Grids are good at magnifying problems, so must
try to fix everything
Exceptions are the norm
15,000 nodes MTBF of 5 years 8 failures a day
Also 15,000 ways to be misconfigured!
Something somewhere will always be broken
But middleware developers tend to assume that
everything will work
It needs a lot of manpower to keep a big system
going
Bad error reporting can cost a lot of time
And reduce peoples confidence
Very few people understand how the whole system
works
Or even a large subset of it
Easy to do things which look reasonable but have
a bad side-effect
Communication between sites and users is an nm
problem
Need to collapse to nm

29
Summary

LHC turns on in 1 year we must focus on
delivering a high QOS
Grid middleware is still immature, developing
rapidly and in many cases a fair way from
production quality
Experience is that new middleware developments
take 2 years to reach the production system, so
LHC will start with what we have now
The underlying failure rate is high this will
always be true with so many components, so
middleware and operational procedures must allow
for it
We need procedures which can manage the
underlying problems, and present users with a
system which appears to work smoothly at all
times
Considerable progress has been made, but there is
more to do
GridPP is running a major part of the EGEE/LCG
Grid, which is now a very large system operated
as a high-quality service, 247365
We are living in interesting times!