GridPP: Running a Production Grid - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

GridPP: Running a Production Grid

Description:

Updates every couple of weeks. The system becomes more heterogenous ... FCR tool allows sites failing specified tests to be made 'invisible' ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 25
Provided by: stephe326
Category:

less

Transcript and Presenter's Notes

Title: GridPP: Running a Production Grid


1
GridPP Running a Production Grid
  • Stephen Burke
  • CLRC/RAL
  • On behalf of the GridPP Deployment Operations
    Team
  • UK e-Science All-hands, Nottingham, 21st
    September 2006

2
Overview
  • EGEE, LCG and GridPP
  • Middleware
  • Deployment Operations
  • Conclusions

3
  • EGEE, LCG and GridPP

4
EGEE
  • Major EU Grid project 2004-08 (in two phases)
  • Successor to the European DataGrid (EDG) project,
    2001-04
  • 32 countries, 91 partners, 37 million matching
    funding
  • Associated with several Grid projects outside
    Europe
  • Expected to be succeeded by a permanent European
    e-infrastructure
  • Supports many areas of e-science, but currently
    High Energy Physics is the major user
  • Biomedical research is also a pioneer
  • Currently 3000 users in 200 Virtual
    Organisations
  • Currently 195 sites, 28689 CPUs, 18.4 Pb of
    storage
  • Values taken from the information system beware
    of GIGO!

5
EGEE/LCG Google map
6
(W)LCG
  • The computing services for the LHC (Large Hadron
    Collider) at CERN in Geneva are provided by the
    LHC Computing Grid (LCG) project
  • LHC starts running in 1 year
  • Four experiments, all very large
  • 5000 users at 500 sites worldwide, 15 year
    lifetime
  • Expect 15 Pb/year, plus similar volumes of
    simulated data
  • Processing requirement is 100,000 CPUs
  • Must transfer 100 Mbyte/sec/site sustained for
    15 years!
  • Running a series of Service Challenges to ramp up
    to full scale
  • LCG uses the EGEE infrastructure, but also the
    Open Science Grid (OSG) in the US and other Grid
    infrastructures
  • Hence WLCG Worldwide LCG

7
Organisation
  • EGEE sites are organised by region
  • GridPP is part of UK/Ireland
  • Also NGS Grid Ireland
  • Each region has a Regional Operation Centre (ROC)
    to look after the sites in the region
  • Overall operations co-ordination rotates weekly
    between ROCs
  • LCG divides sites into Tier 1/2/3
  • CERN as Tier 0
  • Function of size and QOS
  • Tier 1 needs 97 availability, max 24 hour
    response
  • Tier 2 95/72 hours
  • Tier 3 are local facilities, no specific targets
  • ROC Tier 1 RAL is both

8
GridPP
  • Grid for UK Particle Physics
  • Two phases, 2001-2004-2007
  • Proposal for phase 3 to 2011
  • Part of EGEE and LCG
  • Working towards interoperability with NGS
  • 20 sites, 4354 CPUs, 298 Tb of storage
  • Currently supports 33 VOs, including some non-PP
  • But not many non-PP from the UK any volunteers?
  • For LCG, sites are grouped into four virtual
    Tier 2s
  • Plus RAL as Tier 1
  • Grouping is largely administrative, the Grid
    sites remain separate
  • Runs UK-Ireland ROC (with NGS)
  • Grid Operations Centre (GOC) _at_ RAL (with NGS)
  • Gridwide configuration, monitoring and accounting
    repository/portal
  • Operations and User Support shifts (working hours
    only)

9
GridPP sites
10
Virtual Organisations
  • Users are grouped into Virtual Organisations
  • Users/VO varies from 1 to 792!
  • Broadly four classes of VO
  • LHC experiments
  • EGEE supported
  • Worldwide (mainly non-LHC particle physics)
  • Local/regional
  • Sites can choose which VOs to support, subject to
    MOU/funding commitments
  • Most GridPP sites support 10-20 VOs
  • GridPP nominally allocates 1 of resources to
    EGEE VOs

11
  • Middleware

12
Site services
  • Basis is Globus (still GT2, GT4 soon) and Condor,
    as packaged in the Virtual Data Toolkit (VDT)
    also used by NGS
  • EGEE/LCG/EDG middleware distribution now under
    the gLite brand name
  • Computing Element (CE) Globus gatekeeper batch
    system batch workers
  • In transition from Globus to Condor-C
  • Storage Element (SE) Storage Resource Manager
    (SRM) GridFTP other data transports storage
    system (disk-only or disktape)
  • Three SRM implementations in GridPP
  • Berkeley Database Information Index (BDII) LDAP
    server publishing CE SE site service
    information according to the GLUE schema
  • Relational Grid Monitoring Architecture (R-GMA)
    server publishing GLUE schema, monitoring,
    accounting, user information
  • VOBOX Container for VO-specific services (aka
    edge services)

13
Core services
  • Workload Management System (WMS), aka Resource
    Broker accepts jobs, dispatches them to sites
    and manages their lifecycle
  • Logging Bookkeeping primarily logs lifecycle
    events for jobs
  • MyProxy stores long-lived credentials
  • LCG File Catalogue (LFC) maps logical file names
    to local names on SEs
  • File Transfer Service (FTS) provides managed,
    reliable file transfers
  • BDII aggregates information from site BDIIs
  • R-GMA schema/registry stores table definitions
    and lists of producers/consumers
  • VO Membership Service (VOMS) server stores VO
    group/role assignments
  • User Interface (UI) provides user client tools
    for the Grid services

14
Grid services
  • Some extra services are needed to allow the Grid
    to be operated effectively
  • Mostly unique instances, not part of the gLite
    distribution
  • Grid Operations Centre DataBase (GOCDB) stores
    information about each site, including contact
    details, status and a node list
  • Queried by other tools to generate configuration,
    monitoring etc
  • Accounting (APEL) publishes information about
    CPU and storage use
  • Various monitoring tools, including
  • gstat (Grid status) - collects data from the
    information system, does sanity checks
  • Site Availability Monitoring (SAM) - runs regular
    test jobs at every site, raises alerts and
    measures availability over time
  • GridView collects and displays information
    about file transfers
  • Real Time Monitor displays job movements, and
    records statistics
  • Freedom of Choice for Resources (FCR) allows the
    view of resources in a BDII to be filtered
    according to VO-specific criteria, e.g. SAM test
    failures
  • Operations portal aggregates monitoring and
    operational information, broadcast email tool,
    news, VO information,

15
SAM monitoring
16
GridView
17
Middleware issues
  • We need to operate a large production system with
    247365 availability
  • Middleware development is usually done on small,
    controlled test systems, but the production
    system is much larger in many dimensions, more
    heterogeneous and not under any central control
  • Much of the middleware is still immature, with a
    significant number of bugs, and developing
    rapidly
  • Documentation is sometimes lacking or out of date
  • There are therefore a number of issues which must
    be managed by deployment and operational
    procedures, for example
  • The rapid rate of change and sometimes lack of
    backward compatibility requires careful
    management of code deployment
  • Porting to new hardware, operating systems etc
    can be time consuming
  • Components are often developed in isolation, so
    integration of new components can take time
  • Configuration can be very complex, and only a
    small subset of possible configurations produce a
    working system
  • Fault tolerance, error reporting and logging are
    in need of improvement
  • Remote management and diagnostic tools are
    generally undeveloped

18
  • Deployment Operations

19
Configuration
  • We have tried many installation configuration
    tools over the years
  • Configuration is complex, but system managers
    dont like complex tools!
  • Most configuration flexibility needs to be
    frozen
  • Admins dont understand all the options anyway
  • Many configuration changes will break something
  • The more an admin has to type, the more chances
    for a mistake
  • Current method preferred by most sites is YAIM
    (Yet Another Installation Method)
  • bash scripts
  • simple configuration of key parameters only
  • doesnt always have enough flexibility, but good
    enough for most cases

20
Release management
  • There is a constant tension between the desire to
    upgrade to get new features, and the desire to
    have a stable system
  • Need to be realistic about how long it takes to
    get new things into production
  • We have so far had a few big bang releases per
    year, but these have some disadvantages
  • Anything which misses a release has to wait for a
    long time, hence there is pressure to include
    untested code
  • Releases can be held up by problems in any area,
    hence are usually late
  • They involve a lot of work for system managers,
    so it may be several months before all sites
    upgrade
  • We are now moving to incremental releases,
    updating each component as it completes
    integration and testing
  • Have to avoid dependencies between component
    upgrades
  • Releases go first to a 10-scale pre-production
    Grid
  • Updates every couple of weeks
  • The system becomes more heterogenous
  • Still some big bangs e.g. new OS
  • Seems OK so far - time will tell!

21
VO support
  • If sites are going to support a large number of
    VOs the configuration has to be done in a
    standard way
  • Largely true, but not perfect adding a VO needs
    changes in several areas
  • Configuration parameters for VOs should be
    available on the operations portal, although many
    VOs still need to add their data
  • It needs to be possible to install VO-specific
    software, and maybe services, in a standard way
  • Software is OK NFS-shared area, writeable by
    specific VO members, with publication in the
    information system
  • Services still under discussion concerns about
    security and support
  • VOs often expect to have dedicated contacts at
    sites (and vice versa)
  • May be necessary in some cases but does not scale
  • Operations portal stores contacts, but site - VO
    may not reach the right people need contacts by
    area
  • Not too bad, but still needs some work to find a
    good modus vivendi

22
Availability
  • LCG requires high availability, but the intrinsic
    failure rate is high
  • Most of the middleware does not deal gracefully
    with failures
  • Some failure modes can lead to black holes
  • Must fix/mask failures via operational tools so
    users dont see them
  • Several monitoring tools have been developed,
    including test jobs run regularly at sites
  • On-duty operators look for problems, and submit
    tickets to sites
  • Currently 50 tickets per week (c.f. 200 sites)
  • FCR tool allows sites failing specified tests to
    be made invisible
  • New sites must be certified before they become
    visible
  • Persistently failing sites can be decertified
  • Sites can be removed temporarily for scheduled
    downtime
  • Performance is monitored over time
  • The situation has improved a lot, but we still
    have some way to go

23
Resource allocation
  • Need to be able to assign quotas and priorities
    to VOs and groups inside VOs, and measure what is
    delivered
  • VOMS provides group/role information in the proxy
  • Tools to control quotas and priorities in site
    services being developed
  • So far only at whole-VO level
  • Maui batch scheduler is very flexible, easy to
    map to groups/roles if VOs can define what they
    want
  • Can publish VO/group-specific values in GLUE
    schema, hence the RB can use them for scheduling
  • Accounting tool (APEL) measures CPU use
  • Storage accounting currently being added
  • Privacy issues around user-level accounting,
    being solved by encryption
  • Most of the pieces are in place, just need to fit
    them together

24
CPU time used by ATLAS
25
User Support
  • Becoming vital as the number of users grows
  • But not much effort available in the various
    projects
  • Global Grid User Support (GGUS) portal at
    Karlsruhe provides a central ticket interface
  • Tickets are classified by an on-duty Ticket
    Process Manager, and assigned to an appropriate
    support unit
  • GGUS has a web-service interface to ticketing
    systems at each ROC
  • Other support units are just mailing lists
  • Mostly best-effort support, working hours only
  • Currently tens of tickets/week
  • Just about manageable, but may not scale much
    further
  • Some tickets slip through the net
  • Will need more manpower

26
Documentation Training
  • Need documentation and training for both system
    managers and users
  • Mostly expert users up to now, but user community
    is expanding
  • Induction of new VOs is a particular problem no
    peer support
  • EGEE is running User Fora for users to share
    experience
  • Next in Manchester in May 07 (with OGF)
  • EGEE has a dedicated training activity run by
    NESC/Edinburgh
  • Documentation is often a low priority, little
    dedicated effort
  • The rapid pace of change means that material is
    often out of date
  • Effort on documentation is now increasing
  • GridPP has appointed a documentation officer
  • GridPP web site, wiki
  • Installation manual for admins is good
  • There is also a wiki for admins to share
    experience
  • Focus is now on user documentation
  • New EGEE web site coming soon

27
  • Conclusions

28
Lessons learnt
  • Good enough is not good enough
  • Grids are good at magnifying problems, so must
    try to fix everything
  • Exceptions are the norm
  • 15,000 nodes MTBF of 5 years 8 failures a day
  • Also 15,000 ways to be misconfigured!
  • Something somewhere will always be broken
  • But middleware developers tend to assume that
    everything will work
  • It needs a lot of manpower to keep a big system
    going
  • Bad error reporting can cost a lot of time
  • And reduce peoples confidence
  • Very few people understand how the whole system
    works
  • Or even a large subset of it
  • Easy to do things which look reasonable but have
    a bad side-effect
  • Communication between sites and users is an nm
    problem
  • Need to collapse to nm

29
Summary
  • LHC turns on in 1 year we must focus on
    delivering a high QOS
  • Grid middleware is still immature, developing
    rapidly and in many cases a fair way from
    production quality
  • Experience is that new middleware developments
    take 2 years to reach the production system, so
    LHC will start with what we have now
  • The underlying failure rate is high this will
    always be true with so many components, so
    middleware and operational procedures must allow
    for it
  • We need procedures which can manage the
    underlying problems, and present users with a
    system which appears to work smoothly at all
    times
  • Considerable progress has been made, but there is
    more to do
  • GridPP is running a major part of the EGEE/LCG
    Grid, which is now a very large system operated
    as a high-quality service, 247365
  • We are living in interesting times!
Write a Comment
User Comments (0)
About PowerShow.com