EGEEII and Operations in Northern Europe - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

EGEEII and Operations in Northern Europe

Description:

EGEE and gLite are registered trademarks. EGEE-II and Operations in Northern Europe ... Hydra, AMGA, secure access to data. Will be deployed in production by June'06 ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 52
Provided by: kgu8
Category:

less

Transcript and Presenter's Notes

Title: EGEEII and Operations in Northern Europe


1
EGEE-II and Operations in Northern Europe
  • Per Öster, EGEE-II Northern Europe Regional
    Operating Centre Manager
  • KTH
  • per_at_pdc.kth.se

Presentation pack contributors Bob Jones, Maite
Barosso Lopez, Per Öster
2
EGEE What do we deliver?
  • Infrastructure operation
  • Currently include gt200 sites across 39 countries
  • Continuous monitoring of grid services in a
  • distributed global infrastructure
  • Automated site configuration/management
  • Middleware
  • Production quality middleware distributed under
    business friendly open source licence
  • User Support - Managed process from first contact
  • through to production usage
  • Training
  • Documentation
  • Expertise in grid-enabling applications
  • Online helpdesk
  • Networking events (User Forum, Conferences etc.)
  • Future
  • Expand on interoperability with related
    infrastructures

3
From EGEE to EGEE-II
  • EGEE-II started 1 April 2006
  • 2 years duration
  • EGEE final review 23-24 May _at_ CERN
  • EGEE-II a natural continuation of EGEE
  • 91 Partners, Budget 56 M, EU contribution 36
    M
  • Emphasis on providing an infrastructure
  • increased support for applications
  • interoperate with other infrastructures
  • more involvement from Industry
  • Expanded consortium
  • 91 partners
  • 11 Joint Research Units
  • EGEE/EGEE-II transition meeting
  • CERN 12-13 April

4
Related EU projects
5
EGEE-II Expertise Resources
  • 32 countries
  • 12 federations
  • Major and national Grid projects in Europe,
    USA, Asia
  • 27 countries through related projects
  • BalticGrid
  • EELA
  • EUChinaGrid
  • EUIndiaGrid
  • EUMedGrid
  • SEE-GRID

6
EGEE-II Activities
  • Service Activities
  • SA1 Grid Operations, Support and Management
    (CERN)
  • SA2 Networking Support (CNRS)
  • SA3 Integration, Testing and Certification
    (CERN)
  • Joint Research Activities
  • JRA1 Middleware Re-engineering (INFN)
  • JRA2 Quality Assurance (CS-SI)
  • Networking Activities
  • NA1 Management (CERN)
  • NA2 Dissemination, Outreach and Communication
    (CERN)
  • NA3 Training and Induction (UEdin)
  • NA4 Application Identification and Support
    (CNRS)
  • NA5 Policy and International Cooperation
    (GRNET)

7
Main changes from EGEE to EGEE-II
JRA1
Integration
Testing
EGEE
EGEE II
SA1
Certification
8
EGEE-II Cross-Activities Groups
  • Security
  • MWSG
  • Consistent usage of security framework in m/w
  • Coordination with related projects
  • JSPG
  • Operational security issues
  • Coordination with related projects
  • VO Support
  • OAG
  • Resource allocation to VOs
  • UIG
  • Organization of EGEE documentation
  • VO Managers Group NEW!
  • Link between the project and VOs
  • Industry Relations
  • Industry Forum
  • Industry Task Force NEW!
  • Technical Coordination TCG NEW!
  • See later

9
EGEE-II Technical Coordination
  • The EGEE-II proposal defines a Technical
    Coordination Group (TCG)
  • The TCG brings together the technical activities
    within the project in order to ensure the
    oversight and coordination of the technical
    direction of the project, and to ensure that the
    technical work progresses according to plan.
  • Basically coordinating the work of technical
    activities
  • Membership from all these activities but still
    remain a small team
  • Additional experts will join based on the topic
    of discussion
  • Working groups will be spawn off to solve
    specific problems
  • Focus on practical short term solutions
  • Long term projects will be sourced out to
    middleware providers
  • The group defines the technical direction of EGEE
  • Not just a discussion forum!
  • Decisions taken by the group must be honoured by
    the affected activities

10
EGEE-II Technical Coordination
  • TCG gives input to Middleware shopping based on
    input from stakeholders
  • SA3 Middleware component shopping
  • TCG verify short term work plans
  • SA3 do integration, debugging, packaging,
    certification and removal of obsolete components

11
Middleware structure
  • Higher-Level Grid Services may or may not be used
    by the applications
  • should help them but not be mandatory
  • Foundation Grid Middleware is deployed on the
    infrastructure
  • should not assume the use of Higher-Level Grid
    Services
  • must be complete and robust
  • should allow interoperation with other major grid
    infrastructures

12
gLite 3.0
  • Merge existing LCG and gLite to a single
    middleware distribution called gLite. The first
    version will be gLite 3.0
  • Process controlled by the Technical Coordination
    Group
  • gLite 1.5 and LCG 2.7 have been the last
    independent releases
  • Components in gLite 3.0
  • Certified
  • All components already in LCG 2.7 plus upgrades
  • this already includes new versions of VOMS, R-GMA
    and FTS
  • The Workload Management System (with LB, CE, UI)
    of gLite 1.5
  • Tested to some degree and with limited deployment
    support
  • The DGAS accounting system
  • Data management tools as needed by the Biomed
    community
  • Hydra, AMGA, secure access to data
  • Will be deployed in production by June06
  • Deployment started this week (8 May ) to first
    group of WLCG Tier-1s

13
Industry and EGEE-II
  • Industry Task Force
  • Group of industry partners in the project
  • Links related industry projects (NESSI, BEinGRID,
    )
  • Works with EGEEs Technical Coordination
  • Group (TCG) to place industry requirements on
    equal footing
  • Collaboration with CERN OpenLab project
  • IT industry partnerships for hardware and
    software
  • development
  • EGEE Business Associates (EBA)
  • Companies sponsoring work on joint-interest
    subjects
  • Technical developments
  • Market Surveys
  • Business modelling
  • Exploitation strategies
  • Transfer of know-how and services to industry
  • Industry Forum (representatives in most European
    countries)

14
Important EGEE-II events
EGEE-II start
1 April 2006
Industry day (Paris)
27 April 2006
EGEE final review (CERN)
23-24 May 2006
25-29 Sept 2006
1st Project Conference (Geneva)
1st User Forum event
Q1 2007
EGEE-II periodic review
May 2007
Autumn 2007
2nd Project Conference
2nd User Forum event
Q1 2008
31st March 2008
EGEE-II completion
May 2008
EGEE-II final review
15
EGEE06 Conference
  • EGEE06 Capitalising on e-infrastructures
  • Demos
  • Related Projects
  • Industry
  • International community
  • 25-29 September 2006
  • Geneva, Switzerland
  • http//www.cern.ch/egee-intranet/conferences/EGEE0
    6

16
Sustainability Beyond EGEE-II
  • Need to prepare for permanent Grid infrastructure
  • Maintain Europes leading position in global
    science Grids
  • Ensure a reliable and adaptive support for all
    sciences
  • Independent of short project funding cycles
  • Modelled on success of GÉANT
  • Infrastructure managed in collaboration with
    national grid initiatives

Sustainable e-Infrastructure
17
e-Infrastructure for Europe
  • The Vision (1)
  • An environment where research resources (H/W,
    S/W content) can be readily shared and accessed
    wherever this is necessary to promote better and
    more effective research
  • (1) A European vision for a Universal
    e-Infrastructure for Research by Malcolm Read
    http//www.e-irg.org/meetings/2005-UK/A_European_v
    ision_for_a_Universal_e-Infrastructure_for_Researc
    h.pdf

18
e-Infrastructure for Europe - Mission
  • Infrastructure
  • Co-ordination of production e-Infrastructure open
    to all user communities and service providers
  • Interoperate with e-Infrastructure projects
    around the globe
  • Contribute to Grid standardisation and policy
    efforts
  • Support applications from diverse communities
  • Astrophysics
  • Computational Chemistry
  • Earth Sciences
  • Finance
  • Fusion
  • Geophysics
  • High Energy Physics
  • Life Sciences
  • Material Sciences
  • Multimedia
  • Business

Encourage inter-disciplinary research and
increase datainter-operability
19
e-Infrastructure - Key Services
  • Based on experience gathered during EGEE, key
    services have been found necessary for a central
    organisation in coordination with the National
    Grid Initiatives
  • Coordination of infrastructure operations
  • Middleware testing and certification
  • Application support
  • Dissemination and outreach
  • Training
  • Now working with European Commission and member
    states, national grid representatives and user
    communities to develop the details of such a
    structure and how it can be put in place

20
Grid Operations
21
EGEE Grid Sites Q1 2006
EGEE gt 200 sites, 40 countries gt 24,000
processors, 5 PB storage
22
Where are we now?
  • EGEE has achieved a lot in first 2 years
  • 200 sites 25 kCPU
  • sustained regular workloads of 20K jobs/day
  • massive data transfers gt 1.5 GB/s

23
EGEE Operations Structure
  • Operations Coordination Centre (OCC)
  • Regional Operations Centres (ROC)
  • Front-line support for user and operations issues
  • Provide local knowledge and adaptations
  • One in each region many distributed (inc. A-P)
  • Manage daily grid operations oversight,
    troubleshooting
  • Operator on Duty
  • Run infrastructure services
  • User Support Centre (GGUS)
  • In FZK provide single point of contact (service
    desk) portal.

24
Operations support workflows
OSCT
Grid Operator on-duty
1st Level support
Regional Operations Centre


2nd Level support
ROC and Site work to resolve the problem
25
EGEE Operations Process
  • Grid operator on duty
  • 6 teams working in weekly rotation
  • CERN, IN2P3, INFN, UK/I, Ru,Taipei
  • Crucial in improving site stability and
    management
  • Expanding to all ROCs in EGEE-II
  • Operations coordination
  • Weekly operations meetings
  • Regular ROC managers meetings
  • Series of EGEE Operations Workshops
  • Nov 04, May 05, Sep 05, (June 06)
  • Geographically distributed responsibility for
    operations
  • There is no central operation
  • Tools are developed/hosted at different sites
  • GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC
    Portal (Lyon)
  • Procedures described in Operations Manual
  • Introducing new sites
  • Site downtime scheduling
  • Suspending a site
  • Escalation procedures

26
Operations tools Dashboard
  • Dashboard provides top level view of problems
  • Integrated view of monitoring tools (SFT, GStat)
    shows only failures and assigned tickets
  • Single tool for ticket creation and notification
    emails with detailed problem categorisation and
    templates
  • Detailed site view with table of open tickets and
    links to monitoring results
  • Ticket browser highlighting expired tickets

Developed and operated by CC-IN2P3
http//cic.in2p3.fr/
27
Site Functional Tests
  • Site Functional Tests (SFT)
  • Framework to test (sample) services at all sites
  • Shows results matrix
  • Detailed test log available for troubleshooting
    and debugging
  • History of individual tests is kept
  • Can include VO-specific tests (e.g. sw
    environment)
  • Normally gt80 of sites pass SFTs
  • NB of 180 sites, some are not well managed
  • Very important in stabilising sites
  • Apps use only good sites
  • Bad sites are automatically excluded
  • Sites work hard to fix problems
  • Extending to service availability
  • measure availability by service, site, VO
  • each service has associated service class
    defining required availability (Critical, highly
    available, etc.)
  • First approach to SLA
  • Use to generate alarms
  • generate trouble tickets
  • call out support staff

28
Checklist for a new service
  • User support procedures (GGUS)
  • Troubleshooting guides FAQs
  • User guides
  • Operations Team Training
  • Site admins
  • CIC personnel
  • GGUS personnel
  • Monitoring
  • Service status reporting
  • Performance data
  • Accounting
  • Usage data
  • Service Parameters
  • Scope - Global/Local/Regional
  • SLAs
  • Impact of service outage
  • Security implications
  • Contact Info
  • Developers
  • First level support procedures
  • How to start/stop/restart service
  • How to check its up
  • Which logs are useful to send to CIC/Developers
  • and where they are
  • SFT Tests
  • Client validation
  • Server validation
  • Procedure to analyse these
  • error messages and likely causes
  • Tools for ROC to spot problems
  • GIIS monitor validation rules (e.g. only one
    global component)
  • Definition of normal behaviour
  • Metrics
  • ROC Dashboard
  • Alarms
  • Deployment Info
  • RPM list
  • Configuration details
  • This is what is takes to make a reliable
    production service from a middleware component

29
Release preparation deployment
30
Process to deployment
Support, analysis, debugging
VDT/OSG
SA3
OMII- Europe
Testing Certification
Integration
Production service
Pre-production service

Middleware providers
JRA1
SA3
Certification activities SA3SA1
SA1
31
Certification test bed
  • Certification test bed
  • simulates deployment environments
  • large (80 machines)
  • runs functional and stress tests (regression
    testing)
  • partly distributed
  • Pre-production service
  • run as a service preview of next production
    versions
  • fully distributed (10-20 sites)
  • application integration and testing

32
User Support
33
User Support Goals
  • A single access point for support
  • A portal with a well structured information and
    updated documentation
  • Knowledgeable experts
  • Correct, complete and responsive support
  • Tools to help resolve problems
  • search engines
  • monitoring applications
  • resources status
  • Examples, templates, specific distributions for
    software of interest
  • Interface with other Grid support systems
  • Connection with developers, deployment, operation
    teams
  • Assistance during production use of the grid
    infrastructure

34
The Support Model
Regional Support with Central Coordination"
Regional Support units
The ROCs, VOs and other project-wide groups such
as the middleware groups (JRA), network groups
(NA), service groups (SA) are connected via a
central integration platform provided by GGUS.
Operations Support
ROC 1
ROC 10
ROC
Deployment Support
Central Application (GGUS)
TPM
Middleware Support
VOSupport
Network Support
User Support units
Technical Support units
35
The GGUS System
36
GGUS Portal user services
Browseable tickets Search through solved
tickets Useful links (Wiki FAQ) Broadcast
tools Latest News GGUS Search Engine Updated
documentation (Wiki FAQ)
37
Policy Security
38
CAs Authentication
  • Authentication
  • Use of GSI, X.509 certificates
  • Generally issued by national certification
    authorities
  • Agreed network of trust
  • International Grid Trust Federation (IGTF)
  • EUGridPMA
  • APGridPMA
  • TAGPMA
  • All EGEE sites will usually trust all IGTF root
    CAs
  • Security Groups (Operations)
  • Joint Security Policy Group
  • EUGridPMA
  • Operational Security Coordination Team
  • Vulnerability Group

39
Security Policy
  • Policy Revisions
  • Grid Acceptable Use Policy (AUP)
  • https//edms.cern.ch/document/428036/
  • common, general and simple AUP
  • for all VO members using many Grid
    infrastructures
  • EGEE, OSG, SEE-GRID, DEISA, national Grids
  • VO Security
  • https//edms.cern.ch/document/573348/
  • responsibilities for VO managers and members
  • VO AUP to tie members to Grid AUP accepted at
    registration
  • Incident Handling and Response
  • https//edms.cern.ch/document/428035/
  • defines basic communications paths
  • defines requirements (MUSTs) for IR
  • not to replace or interfere with local response
    plans
  • Joint Security Policy Group
  • EGEE with strong input from OSG
  • Policy Set

40
Northern Europe Regional Operating Centre
41
NE Sites
42
Partners and People
  • CSC (0.5 FTE)
  • Dan Still
  • Timo Kervinen
  • FOM/NIKHEF (4 FTE)
  • David Groep
  • Jeff Templon
  • Ronald Starink
  • Wim Heubers
  • NN
  • RU-RUG (0.5 FTE)
  • Arnold Meijster
  • Hans Gankema
  • SARA (4 FTE)
  • Jules Wolfrat
  • Fokke Dijkstra
  • Ron Trompert
  • Alexander Verkooijen
  • Ramon Bastiaans
  • Martin Pels
  • Mark van de Sanden
  • Jurriaan Saathof
  • UKBH (1 FTE)
  • Michael Grønager
  • Anders Wäänänen
  • NN
  • VR/SNIC (5.5 FTE)
  • Per Öster
  • Anders Selander
  • Lars Malinowsky
  • Åke Sandgren
  • Mattias Wadenstien
  • Johan Gunnarsson
  • Leif Nixon

43
SA1 Tasks for NE ROC
  • TSA1.1 Operate a production and pre-production
    service
  • TSA1.1.1 ROC management
  • TSA1.1.2 Pre-production service site
  • TSA1.2 Middleware deployment and support
  • TSA1.2.1 Coordination and support for middleware
    deployment
  • TSA1.2.2 Regional certification of middleware
    releases
  • TSA1.3. Grid operations and support
  • TSA1.3.1 1st line support for operational
    problems in region
  • TSA1.3.2 Oversight and management of operational
    problems
  • TSA1.3.3 Run essential regional grid services
  • TSA1.3.5 Grid services for infrastructure or Vos
  • TSA1.4 Grid security and incident response
  • TSA1.4.1 Grid incident response coordination in
    region
  • TSA1.4.2 Security vulnerability and risk
    analysis
  • TSA1.4.3 CA management
  • TSA1.4.5 Coordinate EUGridPMA
  • TSA1.5. Virtual organisations, applications and
    user support
  • TSA1.5.2 Call centre, helpdesk for ROC
  • TSA1.5.3 VO support, integration support
  • TSA1.5.4 User training in region
  • TSA1.5.5 Site admin training in region
  • TSA1.6 Grid Management
  • TSA1.6.2 Accounting coordination in region
  • TSA1.7 Interoperation
  • TSA1.7.1 National and regional grid project
    coordination
  • TSA1.7.2 International grid projects
  • TSA1.8 Applicationlt-gtresource provider
    coordination
  • TSA1.8.1 ROC management of resources/SLAs
  • TSA1.9 Application/resource provider/mw provider
    coord
  • TSA1.9.1 ROC representation in coordination

44
TSA1.A General tasks
  • TSA1.A.1 Deliverables formal review
  • TSA1.A.2 Activity Coordination (Internal
    meeting, Activity workshop, cross activities
    meeting, TCG,..)
  • TSA1.A.3 EGEE conferences (preparation
    attendance)
  • TSA1.A.4 EU reviews (Preparation attendance)
  • TSA1.A.5 Participation to Standardisation bodies
    (GGF, .)
  • TSA1.A.6 EGEE publications (journal papers, )
  • TSA1.A.7 Dissemination (others related
    conferences, press, )
  • TSA1.A.8 EGEE training (as a trainee)
  • TSA1.A.9 EGEE training (as a trainer including
    preparation)
  • TSA1.A.10 Partner related tasks (administration,
    timesheet, .)

45
TSA1.2 Middleware deployment and support
  • Deployment of the SA3-produced middleware
    distribution to all the sites. ROCs responsible
    in each region for coordinating and ensuring the
    agreed schedule is maintained.
  • Core services require coherent installation
    across the Grid, interface with local fabric
    (e.g. CE, SE, local Grid catalogues, etc). Core
    services have the longest update cycles (1 or 2
    per year)
  • Other services (central catalogues, information
    system components, monitoring tools, resource
    brokers) have shorter update cycles and may not
    need to be present at all sites
  • Client tools (on WN) installable in user-space,
    can be updated on the fly by a central team.

46
TSA1.3. Grid operations and support
  • Manage the Grid operation, including support for
    the sites for all operations aspects and issues.

47
TSA1.4 Grid security and incident response
  • Security Coordination Group
  • Responsibility for maintaining the Security and
    Availability Policy and policies related to
    acceptable use by users and VOs,
  • Ensuring the continued existence of a federated
    identity trust domain, and encouraging the
    integration of national or community based
    authorisation schemes,
  • Analysis of security risks and vulnerabilities in
    the procedures and software,
  • Responding to security incidents.
  • Joint Security Policy Group
  • This group is coordinated by one ROC UK/I
    (CCLRC) with a significant contribution by the
    OCC and contributions by other ROCs.
  • The Joint Security Policy Group is a body that
    provides policy and site requirements to the
    deployment and middleware engineering activities.

48
TSA1.5. Virtual organisation, application and
user support
  • central helpdesk keeps track of all service
    requests and assigns them to the appropriate
    support groups.
  • In this way, formal communication between all
    support groups is possible
  • To enable this, each group has to build only one
    interface between its internal support structure
    and the central

49
TSA1.7. Interoperation
  • SA1 will ensure interoperability/collaboration/coo
    rdination with other Grid projects. There are
    two distinct aspects
  • National and regional Grid projects in EGEE-II
    regions (UK, Italy, Nordic, SEE-Grid, BalticGrid,
    EUChinaGrid, EUMedGrid, EELA).
  • Other projects that together with EGEE-II support
    international VOs (e.g. Open Science Grid in the
    US).
  • The EGEE infrastructure will be expanded through
    extension and collaboration with associated
    projects (EU-ChinaGrid, EUMedGrid, BalticGrid,
    EELA)

50
TSA1.8. Application/resource provider coordination
  • This task ensures that sufficient resources are
    identified for the supported applications by
    negotiating with the resources providers. The
    ROCs take this as an ongoing responsibility to
    ensure that those commitments are fulfilled.
    This commitment will be included as part of an
    SLA with each site.

51
TSA1.9. Application/resource-provider/middleware-p
rovider coordination
  • SA1 will participate in the Technical
    Coordination Group (TCG)
  • Operational and service management
  • Security and vulnerability
  • Fabric and site management
  • Auditing, accounting, and accountability
  • Monitoring
  • Support issues.
  • The SA1 representation in the TCG itself will be
    a member of the OCC at CERN together with a
    representative of the ROC coordinators

52
Summary
  • Grids are all about sharing it is means of
    working with groups across Europe and beyond
  • Operations procedures and tools under constant
    evolution
  • We have gained significant experience in what it
    takes to deploy, operate and manage a large
    distributed infrastructure
  • Much is being learned but there remains much to
    be done to achieve long term sustainability
  • EGEE Infrastructure worlds largest
    multi-science production grid service
  • EGEE-II is the opportunity to expand on this
    existing base both in terms of scale and usage
  • Need to prepare the long-term
  • EGEE, related projects, national grid initiatives
    and user communities are working to define a
    model for a sustainable grid infrastructure that
    is independent of short project cycles
Write a Comment
User Comments (0)
About PowerShow.com