Title: Deployment Aspects of LCG
1Deployment Aspects of LCG
- Ian Bird
- LCG Deployment Area Manager
- Presentation to HEP-CCC Meeting
- 18-Oct-2002
2Summary
- Introduction
- LCG areas of activity
- Deployment Goals and Timescale
- Deployment Activities
- Technology
- Testing and certification
- Support
- Resources
- Operations
- Coordination, collaboration
- Conclusions
3Project Goals
Goal Prepare and deploy the LHC computing
environment
- applications - tools, frameworks, environment,
persistency - computing system ? global grid service
- cluster ? automated fabric
- collaborating computer centres ? grid
- CERN-centric analysis ? global analysis
environment - central role of data challenges
This is not another grid technology project
it is a grid deployment project
4LCG Level 1 Milestonesproposed to LHCC
5LCG-1 Timescale in a nutshell
- LCG-1 must be defined end 2002
- 2 major areas to be addressed
- Define LCG-1 in terms of required functionality
and services - Deployment schedule
- Set up distributed organisational structure
- Resources and scheduling,
- Policies security, authentication, etc.
- Operational agreements and responsibilities
- Support services
- End November Level 1 and 2 milestones in a
quantifiable form - LCG-1 service must be in place July 2003
- 6 months testing, integration, certification,
packaging and deployment - Need to demonstrate performance end 2003
- This should include adding current production
services into LCG - Provide production service for data challenges in
2004
6LCG Activities
7LCG and its interactions
Experiments
Grid Projects
HEPCAL
PPDG
iVDGL (VDT)
GriPhyN
Globus
GLUE
EDG
NorduGrid
GDB
AliEn
Regional Centres
CERN
8Multi-dimensional problem
- Regional Centres
- Host one or more experiments
- Different RCs deploy different grid middleware
in existing testbeds - Have different operational and security policies
- Experiments
- Use middleware from various grid projects
- Run at many regional centres
- Provide applications that rely on specific
middleware - Grid projects
- Provide middleware that does not often (yet)
interoperate - Starting to collaborate on common solutions and
interoperability - ? The Deployment area of LCG ties these all
together
9Grid Technology
- Short term (next 3-4 months)
- Define LCG-1 in terms of minimum functionality
and services to be provided - Recommend how to provide them
- GTA, GDB, using HEPCAL document as a basis
- Longer term
- ensuring that the LCG requirements are known to
current and potential Grid projects - active lobbying for suitable solutions
influencing plans and priorities - negotiating support for tools developed by Grid
projects - Essential for a production service!
- developing a plan to supply solutions that do not
emerge from other sources - BUT this must be done with caution important to
avoid HEP-specific solutions
10Technology
- A base set of requirements has been defined
(HEPCAL) - 43 use cases
- 2/3 of which should be satisfied 2003 by
currently funded projects - LCG plans to use the technology emerging from
some of the many Grid projects receiving
substantial national and EU RD funding, and
perhaps later from industry - Today
- many of these projects are led by, or strongly
influenced by HEP - are built on the Globus toolkit
- and form two main groups
- around the (European) DataGrid project
- subscribing to the (US) Virtual Data Toolkit -
VDT - rapidly growing interest investment from other
sciences, industry - HEP (LHC data challenges, BaBar, LCG, ) an early
adopter - Tomorrow
- must remain in the main line leverage the
massive investments being made - increasingly difficult for HEP to influence
direction - expect several major architectural changes before
things mature - LCG must adapt and evolve both in functionality
and in technology
11Deployment
12RC MoU and Requirements Catalog
- The SC2 RTAG on Regional Centre categorisation,
recommended - The GDB should work out the MoUs
- Requirements to be considered
- Quality of service
- Policy of use
- Network connectivity
- Compatibility
- User support and training
- Consultation and problem tracking
- Operating conditions
13Grid Deployment goals of LCG-1
- Production service for Data Challenges in 2H03
2004 - Focused on batch production work
- Experience in close collaboration between the
Regional Centres - Should have wide enough participation to
understand the issues, but not too many initially - Learn how to maintain and operate a global grid
- Focus on a production-quality service and all
that implies - Robustness, fault-tolerance, predictability, and
supportability take precedence over functionality - But minimum functionality to be of value
- This requires
- a middleware support group with integration,
certification, testing, packaging etc.
responsibilities - A support structure
- LCG should be integrated into the sites physics
computing services should not be something
apart - This requires coordination between participating
sites in - Policies and collaborative agreements
- Resource planning and scheduling
14What might LCG-1 look like?
- Users perspective - requires
- Functionality adequate to provide advantage over
not using distributed model - Straightforward to use
- Well defined services
- Advice on how to use the system
- Help with problems
- Failures should be understandable
- Ability to determine status of jobs and data
- Sites perspective
- Integrated into computer centre/IT (inc.
security) infrastructures - Able to support service
- Able to allocate and manage resources local
autonomy where needed - Overall service perspective
- Performance and problem monitoring
- Accounting
- Etc.
15Grid Deployment
- Grid Deployment Board
- Representatives from experiments, Regional
Centres, LCG - Define LCG-1
- Put in place agreements and policies to enable
the deployment and operation of LCG - Coordinates planning of resources for computing
and physics data challenges - Initial meeting Oct 4, Milano
- Grid Deployment Area
- Certification Testing
- System support
- Operations
- User Support
- Resources planning scheduling
16Grid Deployment Board
- 1st Meeting in Milano Oct 4, 2002
- Set up Technical Working groups
- WG1 Define LCG-1 functionality and services,
recommend how to provide them. Define priorities
and schedule for additional functionality - WG2 Define the regional centres in LCG-1, and
the resources that should be available in each.
Schedule for rolling out the infrastructure and
resources. Propose metrics to be used for
allocation, accounting, and reporting. - WG3 Define a straightforward security and
authentication model to be used in LCG-1, and
identify the technical issues. Set up agreements
and MoUs. Propose simple mechanism for
authorization. - WG4 Define ops procedures responsibilities.
Make agreements to ensure coordination of these
activities. Define the requirements for a Grid
Operations Centre to coordinate operational
activities. - WG5 Propose a support model for LCG-1, including
the scope of responsibilities for call
centre/helpdesk, and specify requirements for
problem resolution and tracking. - Follow ups 4 meetings (2 by phone) before end
2002
17Grid Deployment Teams the plan
suppliers integration teams provide tested
releases
common applications s/w
Trillium - US grid middleware
DataGrid middleware
certification, build distribution
LCG infrastructure coordination operation
user support
grid operation
call centre
LCG
fabric operation regional centre A
fabric operation regional centre B
fabric operation regional centre Y
fabric operation regional centre X
18Certification Testing
- Function shared between EDG and LCG
- Groups
-
- Installation Team Group ( iTeam)
- Mostly EDG members, 1 LCG
- Testing group (TSTG)
- Mainly LCG
- Certification Group (CTG)
- Mainly LCG
- Management of Certification and Testing is Zdenek
Sekera (LCG)
19ITeam responsibilities
- the software can be built and packaged without
obvious errors - the software passes the integration test suite
- ensure that all fixes/features introduced into
the software have their entry in the bugzilla,
verify that entry has been updated when the
fix/feature is checked into the software tree - build all standard configurations (with different
compilers, libraries, etc) as defined by the GDB
and test them with the integration test suite - report all problems via bugzilla
- follow up with development problems reported in
bugzilla
20Test Group responsibilities
- Testing basic grid functionality
- Responsible for collecting and creating tests to
provide - testing grid services
- testing security
- testing information
- testing resource brokering
- testing data catalogue and replication
- testing connectivity
- testing configurability
- testing basic grid functionality
- testing error recovery, fault tolerance
- organize and perform complete geographically
distributed tests as defined by GDB - make sure all new features come with the
documentation - maintain the automated test suite
- create and perform destructive tests
- report every problem via bugzilla
21Certification group responsibilities
- Certify that the software satisfies the
functional and stability requirements, including
adequate documentation - setup, configure and maintain certification
testbeds - verify the TSTG tests are complete and OK
- follow up other GDB requirements for the Grid
certification, create appropriate certification
tests - ensure all certification tests run
- pay attention to performance issues
- work with ATG (Application Test Group) to ensure
the complete Grid production testing environment
is valid - create complete release package(s), integrating
the up-to-date documentation (the documentation
will come from other sources such as User Support
Group) - create CDs etc for Grid software distribution
22Certification Test Activities
- Current activities
- Prepare EDG November release
- Recreate EDG, EDT iVDGL interoperability demos
on LCG testbeds - Evaluation of software, GLUE results etc, with
GDB WG1 - Training and experience for new team members
- Certification, testing, validation
- Will be and will remain a significant activity of
LCG - This is what will make LCG a production level
service
23Testbeds and Services
developers testbeds
development testbed a- and ß-testing integrating
and preparing a middleware release
DataGrid
production testbed stable, maintained service
for applications
production testbed stable, maintained service
for applications
demonstration testbed
2002
2003
certification testbed controlled changes,
in-depth application testing
LCG
production service stable, maintained, 24X7
service for applications
24Operations team
- Responsible for operating and maintaining the
grid infrastructure and associated services - Gateways, information services, resource broker
etc. i.e. grid specific services - Provide Grid Operations Centre
- Leverage existing experience (iVDGL, etc.)
- Assemble monitoring, reporting etc. tools
- Authorisation, Authentication services and
infrastructure inc. CAs - Accounting
- Security operations incident response etc.
25Grid Operation
queries monitoring alarms corrective actions
User
Local operation
Local user support
Local site
Call Centre
Grid Operations Centre
Grid information service
Grid operations
Virtual Organisation
Grid logging bookkeeping
Network Operations Centre
26User Support
- Essential for a production service
- Two aspects
- Experiment integration/ consultancy
- Work directly with the experiments computing
projects to ensure efficient use of LCG services,
and optimum use of resources - Act as liaison to ensure experiment specific
issues are resolved - User support
- Helpdesk/call centre operation
- Globally distributed 24x7, ensure single point
of contact for user - Collaborative and distributed operation
- Documentation
- Training
27Resource Planning Scheduling
- Must tie together
- global experiment requirements including some
review process - Regional centre (and other) resource planning
- Constraints e.g. some resources may be
dedicated to specific experiments (if we succeed
this should go away) - Optimise use of resources at centres
- Ensure experiment needs are satisfied
- Try to smooth out peaks of demand sharing of
resources between experiments - Eventually be able to make use of non-HEP
resources - Activity to build a database of requirements and
available resources has begun
28Coordination Collaboration
- There are many opportunities for common
solutions, which should be actively pursued - GLUE
- Schema definitions interoperability work
- HICB JTB, proposed new collaborative activities
- Validation and Test Suites
- Distribution and Meta-Packaging
- Interoperable distribution and configuration
utilities identified as a definite need by all
the recent trans-Atlantic demonstration and
validation work. - Support for this group comes from
- LCG, EDG, EDT, Trillium, DataTAG
- Other opportunities
- Storage interfaces e.g. SRM
- Grid operations centre
- Authentication, authorisation and security
- HEPiX as collaborative vehicle for RC managers,
site coordinators - E.g. certification process for operating
environments upgrade procedures configuration
management helpdesk tools, etc.
29Deployment Summary
- Deploy middleware to support essential
functionality, but goal is to evolve and
incrementally add functionality - Added value is to robustify, support and make
into a 24x7 production service - How?
- Certification test procedure tight feedback
to developers - must develop support agreements with grid
projects to ensure this - Define missing functionality require from
providers - Provide documentation and training
- Provide missing operational services
- Provide a 24x7 Operations and Call Centre
- Guarantee to respond
- Single point of contact for a user
- Make software easy to install facilitate new
centres joining
30Conclusions
- Deployment is a major activity of LCG
- Encompasses all operational and practical aspects
of a grid - Timescales are relatively short for LCG-1
- But there is a lot of work already done that must
be leveraged - Many opportunities for synergy and collaboration
- E.g. certification and testing EDG/LCG
- We will succeed if we use these opportunities