Title: EGEEII and Operations in Northern Europe
1EGEE-II and Operations in Northern Europe
- Per Öster, EGEE-II Northern Europe Regional
Operating Centre Manager - KTH
- per_at_pdc.kth.se
Presentation pack contributors Bob Jones, Maite
Barosso Lopez, Per Öster
2EGEE What do we deliver?
- Infrastructure operation
- Currently include gt200 sites across 39 countries
- Continuous monitoring of grid services in a
- distributed global infrastructure
- Automated site configuration/management
- Middleware
- Production quality middleware distributed under
business friendly open source licence - User Support - Managed process from first contact
- through to production usage
- Training
- Documentation
- Expertise in grid-enabling applications
- Online helpdesk
- Networking events (User Forum, Conferences etc.)
- Future
- Expand on interoperability with related
infrastructures
3From EGEE to EGEE-II
- EGEE-II started 1 April 2006
- 2 years duration
- EGEE final review 23-24 May _at_ CERN
- EGEE-II a natural continuation of EGEE
- 91 Partners, Budget 56 M, EU contribution 36
M - Emphasis on providing an infrastructure
- increased support for applications
- interoperate with other infrastructures
- more involvement from Industry
- Expanded consortium
- 91 partners
- 11 Joint Research Units
- EGEE/EGEE-II transition meeting
- CERN 12-13 April
4Related EU projects
5EGEE-II Expertise Resources
- 32 countries
- 12 federations
- Major and national Grid projects in Europe,
USA, Asia - 27 countries through related projects
- BalticGrid
- EELA
- EUChinaGrid
- EUIndiaGrid
- EUMedGrid
- SEE-GRID
6EGEE-II Activities
- Service Activities
- SA1 Grid Operations, Support and Management
(CERN) - SA2 Networking Support (CNRS)
- SA3 Integration, Testing and Certification
(CERN) - Joint Research Activities
- JRA1 Middleware Re-engineering (INFN)
- JRA2 Quality Assurance (CS-SI)
- Networking Activities
- NA1 Management (CERN)
- NA2 Dissemination, Outreach and Communication
(CERN) - NA3 Training and Induction (UEdin)
- NA4 Application Identification and Support
(CNRS) - NA5 Policy and International Cooperation
(GRNET)
7Main changes from EGEE to EGEE-II
JRA1
Integration
Testing
EGEE
EGEE II
SA1
Certification
8EGEE-II Cross-Activities Groups
- Security
- MWSG
- Consistent usage of security framework in m/w
- Coordination with related projects
- JSPG
- Operational security issues
- Coordination with related projects
- VO Support
- OAG
- Resource allocation to VOs
- UIG
- Organization of EGEE documentation
- VO Managers Group NEW!
- Link between the project and VOs
- Industry Relations
- Industry Forum
- Industry Task Force NEW!
- Technical Coordination TCG NEW!
- See later
9EGEE-II Technical Coordination
- The EGEE-II proposal defines a Technical
Coordination Group (TCG) - The TCG brings together the technical activities
within the project in order to ensure the
oversight and coordination of the technical
direction of the project, and to ensure that the
technical work progresses according to plan. - Basically coordinating the work of technical
activities - Membership from all these activities but still
remain a small team - Additional experts will join based on the topic
of discussion - Working groups will be spawn off to solve
specific problems - Focus on practical short term solutions
- Long term projects will be sourced out to
middleware providers - The group defines the technical direction of EGEE
- Not just a discussion forum!
- Decisions taken by the group must be honoured by
the affected activities
10EGEE-II Technical Coordination
- TCG gives input to Middleware shopping based on
input from stakeholders - SA3 Middleware component shopping
- TCG verify short term work plans
- SA3 do integration, debugging, packaging,
certification and removal of obsolete components
11Middleware structure
- Higher-Level Grid Services may or may not be used
by the applications - should help them but not be mandatory
- Foundation Grid Middleware is deployed on the
infrastructure - should not assume the use of Higher-Level Grid
Services - must be complete and robust
- should allow interoperation with other major grid
infrastructures
12gLite 3.0
- Merge existing LCG and gLite to a single
middleware distribution called gLite. The first
version will be gLite 3.0 - Process controlled by the Technical Coordination
Group - gLite 1.5 and LCG 2.7 have been the last
independent releases - Components in gLite 3.0
- Certified
- All components already in LCG 2.7 plus upgrades
- this already includes new versions of VOMS, R-GMA
and FTS - The Workload Management System (with LB, CE, UI)
of gLite 1.5 - Tested to some degree and with limited deployment
support - The DGAS accounting system
- Data management tools as needed by the Biomed
community - Hydra, AMGA, secure access to data
- Will be deployed in production by June06
- Deployment started this week (8 May ) to first
group of WLCG Tier-1s
13Industry and EGEE-II
- Industry Task Force
- Group of industry partners in the project
- Links related industry projects (NESSI, BEinGRID,
) - Works with EGEEs Technical Coordination
- Group (TCG) to place industry requirements on
equal footing -
- Collaboration with CERN OpenLab project
- IT industry partnerships for hardware and
software - development
- EGEE Business Associates (EBA)
- Companies sponsoring work on joint-interest
subjects - Technical developments
- Market Surveys
- Business modelling
- Exploitation strategies
- Transfer of know-how and services to industry
- Industry Forum (representatives in most European
countries)
14Important EGEE-II events
EGEE-II start
1 April 2006
Industry day (Paris)
27 April 2006
EGEE final review (CERN)
23-24 May 2006
25-29 Sept 2006
1st Project Conference (Geneva)
1st User Forum event
Q1 2007
EGEE-II periodic review
May 2007
Autumn 2007
2nd Project Conference
2nd User Forum event
Q1 2008
31st March 2008
EGEE-II completion
May 2008
EGEE-II final review
15EGEE06 Conference
- EGEE06 Capitalising on e-infrastructures
- Demos
- Related Projects
- Industry
- International community
- 25-29 September 2006
- Geneva, Switzerland
- http//www.cern.ch/egee-intranet/conferences/EGEE0
6
16Sustainability Beyond EGEE-II
- Need to prepare for permanent Grid infrastructure
- Maintain Europes leading position in global
science Grids - Ensure a reliable and adaptive support for all
sciences - Independent of short project funding cycles
- Modelled on success of GÉANT
- Infrastructure managed in collaboration with
national grid initiatives
Sustainable e-Infrastructure
17e-Infrastructure for Europe
- The Vision (1)
- An environment where research resources (H/W,
S/W content) can be readily shared and accessed
wherever this is necessary to promote better and
more effective research - (1) A European vision for a Universal
e-Infrastructure for Research by Malcolm Read
http//www.e-irg.org/meetings/2005-UK/A_European_v
ision_for_a_Universal_e-Infrastructure_for_Researc
h.pdf
18e-Infrastructure for Europe - Mission
- Infrastructure
- Co-ordination of production e-Infrastructure open
to all user communities and service providers - Interoperate with e-Infrastructure projects
around the globe - Contribute to Grid standardisation and policy
efforts - Support applications from diverse communities
- Astrophysics
- Computational Chemistry
- Earth Sciences
- Finance
- Fusion
- Geophysics
- High Energy Physics
- Life Sciences
- Material Sciences
- Multimedia
-
- Business
Encourage inter-disciplinary research and
increase datainter-operability
19e-Infrastructure - Key Services
- Based on experience gathered during EGEE, key
services have been found necessary for a central
organisation in coordination with the National
Grid Initiatives - Coordination of infrastructure operations
- Middleware testing and certification
- Application support
- Dissemination and outreach
- Training
- Now working with European Commission and member
states, national grid representatives and user
communities to develop the details of such a
structure and how it can be put in place
20Grid Operations
21EGEE Grid Sites Q1 2006
EGEE gt 200 sites, 40 countries gt 24,000
processors, 5 PB storage
22Where are we now?
- EGEE has achieved a lot in first 2 years
- 200 sites 25 kCPU
- sustained regular workloads of 20K jobs/day
- massive data transfers gt 1.5 GB/s
23 EGEE Operations Structure
- Operations Coordination Centre (OCC)
- Regional Operations Centres (ROC)
- Front-line support for user and operations issues
- Provide local knowledge and adaptations
- One in each region many distributed (inc. A-P)
- Manage daily grid operations oversight,
troubleshooting - Operator on Duty
- Run infrastructure services
- User Support Centre (GGUS)
- In FZK provide single point of contact (service
desk) portal.
24Operations support workflows
OSCT
Grid Operator on-duty
1st Level support
Regional Operations Centre
2nd Level support
ROC and Site work to resolve the problem
25EGEE Operations Process
- Grid operator on duty
- 6 teams working in weekly rotation
- CERN, IN2P3, INFN, UK/I, Ru,Taipei
- Crucial in improving site stability and
management - Expanding to all ROCs in EGEE-II
- Operations coordination
- Weekly operations meetings
- Regular ROC managers meetings
- Series of EGEE Operations Workshops
- Nov 04, May 05, Sep 05, (June 06)
- Geographically distributed responsibility for
operations - There is no central operation
- Tools are developed/hosted at different sites
- GOC DB (RAL), SFT (CERN), GStat (Taipei), CIC
Portal (Lyon) - Procedures described in Operations Manual
- Introducing new sites
- Site downtime scheduling
- Suspending a site
- Escalation procedures
26Operations tools Dashboard
- Dashboard provides top level view of problems
- Integrated view of monitoring tools (SFT, GStat)
shows only failures and assigned tickets - Single tool for ticket creation and notification
emails with detailed problem categorisation and
templates - Detailed site view with table of open tickets and
links to monitoring results - Ticket browser highlighting expired tickets
Developed and operated by CC-IN2P3
http//cic.in2p3.fr/
27Site Functional Tests
- Site Functional Tests (SFT)
- Framework to test (sample) services at all sites
- Shows results matrix
- Detailed test log available for troubleshooting
and debugging - History of individual tests is kept
- Can include VO-specific tests (e.g. sw
environment) - Normally gt80 of sites pass SFTs
- NB of 180 sites, some are not well managed
- Very important in stabilising sites
- Apps use only good sites
- Bad sites are automatically excluded
- Sites work hard to fix problems
- Extending to service availability
- measure availability by service, site, VO
- each service has associated service class
defining required availability (Critical, highly
available, etc.) - First approach to SLA
- Use to generate alarms
- generate trouble tickets
- call out support staff
28Checklist for a new service
- User support procedures (GGUS)
- Troubleshooting guides FAQs
- User guides
- Operations Team Training
- Site admins
- CIC personnel
- GGUS personnel
- Monitoring
- Service status reporting
- Performance data
- Accounting
- Usage data
- Service Parameters
- Scope - Global/Local/Regional
- SLAs
- Impact of service outage
- Security implications
- Contact Info
- Developers
- First level support procedures
- How to start/stop/restart service
- How to check its up
- Which logs are useful to send to CIC/Developers
- and where they are
- SFT Tests
- Client validation
- Server validation
- Procedure to analyse these
- error messages and likely causes
- Tools for ROC to spot problems
- GIIS monitor validation rules (e.g. only one
global component) - Definition of normal behaviour
- Metrics
- ROC Dashboard
- Alarms
- Deployment Info
- RPM list
- Configuration details
- This is what is takes to make a reliable
production service from a middleware component
29Release preparation deployment
30Process to deployment
Support, analysis, debugging
VDT/OSG
SA3
OMII- Europe
Testing Certification
Integration
Production service
Pre-production service
Middleware providers
JRA1
SA3
Certification activities SA3SA1
SA1
31Certification test bed
- Certification test bed
- simulates deployment environments
- large (80 machines)
- runs functional and stress tests (regression
testing) - partly distributed
- Pre-production service
- run as a service preview of next production
versions - fully distributed (10-20 sites)
- application integration and testing
32User Support
33User Support Goals
- A single access point for support
- A portal with a well structured information and
updated documentation - Knowledgeable experts
- Correct, complete and responsive support
- Tools to help resolve problems
- search engines
- monitoring applications
- resources status
- Examples, templates, specific distributions for
software of interest - Interface with other Grid support systems
- Connection with developers, deployment, operation
teams - Assistance during production use of the grid
infrastructure
34The Support Model
Regional Support with Central Coordination"
Regional Support units
The ROCs, VOs and other project-wide groups such
as the middleware groups (JRA), network groups
(NA), service groups (SA) are connected via a
central integration platform provided by GGUS.
Operations Support
ROC 1
ROC 10
ROC
Deployment Support
Central Application (GGUS)
TPM
Middleware Support
VOSupport
Network Support
User Support units
Technical Support units
35The GGUS System
36GGUS Portal user services
Browseable tickets Search through solved
tickets Useful links (Wiki FAQ) Broadcast
tools Latest News GGUS Search Engine Updated
documentation (Wiki FAQ)
37Policy Security
38CAs Authentication
- Authentication
- Use of GSI, X.509 certificates
- Generally issued by national certification
authorities - Agreed network of trust
- International Grid Trust Federation (IGTF)
- EUGridPMA
- APGridPMA
- TAGPMA
- All EGEE sites will usually trust all IGTF root
CAs
- Security Groups (Operations)
- Joint Security Policy Group
- EUGridPMA
- Operational Security Coordination Team
- Vulnerability Group
39Security Policy
- Policy Revisions
- Grid Acceptable Use Policy (AUP)
- https//edms.cern.ch/document/428036/
- common, general and simple AUP
- for all VO members using many Grid
infrastructures - EGEE, OSG, SEE-GRID, DEISA, national Grids
- VO Security
- https//edms.cern.ch/document/573348/
- responsibilities for VO managers and members
- VO AUP to tie members to Grid AUP accepted at
registration - Incident Handling and Response
- https//edms.cern.ch/document/428035/
- defines basic communications paths
- defines requirements (MUSTs) for IR
- not to replace or interfere with local response
plans
- Joint Security Policy Group
- EGEE with strong input from OSG
- Policy Set
40Northern Europe Regional Operating Centre
41NE Sites
42Partners and People
- CSC (0.5 FTE)
- Dan Still
- Timo Kervinen
- FOM/NIKHEF (4 FTE)
- David Groep
- Jeff Templon
- Ronald Starink
- Wim Heubers
- NN
- RU-RUG (0.5 FTE)
- Arnold Meijster
- Hans Gankema
- SARA (4 FTE)
- Jules Wolfrat
- Fokke Dijkstra
- Ron Trompert
- Alexander Verkooijen
- Ramon Bastiaans
- Martin Pels
- Mark van de Sanden
- Jurriaan Saathof
- UKBH (1 FTE)
- Michael Grønager
- Anders Wäänänen
- NN
- VR/SNIC (5.5 FTE)
- Per Öster
- Anders Selander
- Lars Malinowsky
- Åke Sandgren
- Mattias Wadenstien
- Johan Gunnarsson
- Leif Nixon
43SA1 Tasks for NE ROC
- TSA1.1 Operate a production and pre-production
service - TSA1.1.1 ROC management
- TSA1.1.2 Pre-production service site
- TSA1.2 Middleware deployment and support
- TSA1.2.1 Coordination and support for middleware
deployment - TSA1.2.2 Regional certification of middleware
releases - TSA1.3. Grid operations and support
- TSA1.3.1 1st line support for operational
problems in region - TSA1.3.2 Oversight and management of operational
problems - TSA1.3.3 Run essential regional grid services
- TSA1.3.5 Grid services for infrastructure or Vos
- TSA1.4 Grid security and incident response
- TSA1.4.1 Grid incident response coordination in
region
- TSA1.4.2 Security vulnerability and risk
analysis - TSA1.4.3 CA management
- TSA1.4.5 Coordinate EUGridPMA
- TSA1.5. Virtual organisations, applications and
user support - TSA1.5.2 Call centre, helpdesk for ROC
- TSA1.5.3 VO support, integration support
- TSA1.5.4 User training in region
- TSA1.5.5 Site admin training in region
- TSA1.6 Grid Management
- TSA1.6.2 Accounting coordination in region
- TSA1.7 Interoperation
- TSA1.7.1 National and regional grid project
coordination - TSA1.7.2 International grid projects
- TSA1.8 Applicationlt-gtresource provider
coordination - TSA1.8.1 ROC management of resources/SLAs
- TSA1.9 Application/resource provider/mw provider
coord - TSA1.9.1 ROC representation in coordination
44TSA1.A General tasks
- TSA1.A.1 Deliverables formal review
- TSA1.A.2 Activity Coordination (Internal
meeting, Activity workshop, cross activities
meeting, TCG,..) - TSA1.A.3 EGEE conferences (preparation
attendance) - TSA1.A.4 EU reviews (Preparation attendance)
- TSA1.A.5 Participation to Standardisation bodies
(GGF, .) - TSA1.A.6 EGEE publications (journal papers, )
- TSA1.A.7 Dissemination (others related
conferences, press, ) - TSA1.A.8 EGEE training (as a trainee)
- TSA1.A.9 EGEE training (as a trainer including
preparation) - TSA1.A.10 Partner related tasks (administration,
timesheet, .)
45TSA1.2 Middleware deployment and support
- Deployment of the SA3-produced middleware
distribution to all the sites. ROCs responsible
in each region for coordinating and ensuring the
agreed schedule is maintained. - Core services require coherent installation
across the Grid, interface with local fabric
(e.g. CE, SE, local Grid catalogues, etc). Core
services have the longest update cycles (1 or 2
per year) - Other services (central catalogues, information
system components, monitoring tools, resource
brokers) have shorter update cycles and may not
need to be present at all sites - Client tools (on WN) installable in user-space,
can be updated on the fly by a central team.
46TSA1.3. Grid operations and support
- Manage the Grid operation, including support for
the sites for all operations aspects and issues.
47TSA1.4 Grid security and incident response
- Security Coordination Group
- Responsibility for maintaining the Security and
Availability Policy and policies related to
acceptable use by users and VOs, - Ensuring the continued existence of a federated
identity trust domain, and encouraging the
integration of national or community based
authorisation schemes, - Analysis of security risks and vulnerabilities in
the procedures and software, - Responding to security incidents.
- Joint Security Policy Group
- This group is coordinated by one ROC UK/I
(CCLRC) with a significant contribution by the
OCC and contributions by other ROCs. - The Joint Security Policy Group is a body that
provides policy and site requirements to the
deployment and middleware engineering activities.
48TSA1.5. Virtual organisation, application and
user support
- central helpdesk keeps track of all service
requests and assigns them to the appropriate
support groups. - In this way, formal communication between all
support groups is possible - To enable this, each group has to build only one
interface between its internal support structure
and the central
49TSA1.7. Interoperation
- SA1 will ensure interoperability/collaboration/coo
rdination with other Grid projects. There are
two distinct aspects - National and regional Grid projects in EGEE-II
regions (UK, Italy, Nordic, SEE-Grid, BalticGrid,
EUChinaGrid, EUMedGrid, EELA). - Other projects that together with EGEE-II support
international VOs (e.g. Open Science Grid in the
US). - The EGEE infrastructure will be expanded through
extension and collaboration with associated
projects (EU-ChinaGrid, EUMedGrid, BalticGrid,
EELA)
50TSA1.8. Application/resource provider coordination
- This task ensures that sufficient resources are
identified for the supported applications by
negotiating with the resources providers. The
ROCs take this as an ongoing responsibility to
ensure that those commitments are fulfilled.
This commitment will be included as part of an
SLA with each site.
51TSA1.9. Application/resource-provider/middleware-p
rovider coordination
- SA1 will participate in the Technical
Coordination Group (TCG) - Operational and service management
- Security and vulnerability
- Fabric and site management
- Auditing, accounting, and accountability
- Monitoring
- Support issues.
- The SA1 representation in the TCG itself will be
a member of the OCC at CERN together with a
representative of the ROC coordinators
52Summary
- Grids are all about sharing it is means of
working with groups across Europe and beyond - Operations procedures and tools under constant
evolution - We have gained significant experience in what it
takes to deploy, operate and manage a large
distributed infrastructure - Much is being learned but there remains much to
be done to achieve long term sustainability - EGEE Infrastructure worlds largest
multi-science production grid service - EGEE-II is the opportunity to expand on this
existing base both in terms of scale and usage - Need to prepare the long-term
- EGEE, related projects, national grid initiatives
and user communities are working to define a
model for a sustainable grid infrastructure that
is independent of short project cycles