Title: DataGrid WP4: Fabric Management
1DataGrid WP4Fabric Management
- India _at_ LCG workshop
- Maite Barroso, German Cancio
2Outline
- Overview of DataGrid WP4
- Current Status
- Shortcomings
- Node installation
- Checkpointing
- Monitoring QoS correlation engines
3DataGrid work packages
- WP1 Workload Management
- WP2 Grid Data Management
- WP3 Grid Monitoring Services
- WP4 Fabric management
- WP5 Mass Storage Management
- WP6 Integration Testbed Production quality
International Infrastructure - WP7 Network Services
- WP8 High-Energy Physics Applications
- WP9 Earth Observation Science Applications
- WP10 Biology Science Applications
- WP11 Information Dissemination and Exploitation
- WP12 Project Management
4WP4 - Background information
- WP4s objective deliver the necessary tools to
manage a computing fabric providing grid services
on clusters scaling up to thousands of nodes. - Main scope
- User job management (Grid and local)
- Fabric (system administration) management
- Official participants CERN (leading partner),
INFN, NIKHEF, University of Heidelberg, ZIB
(Berlin) and University of Edinburgh/PPARC
5Functionality
- Provision for running Grid jobs
- Authorization according to local policies
- Mapping Grid credential to local ones
- Publication of fabric resources and job
information - Provision for running local jobs
- Sharing of resources according to local policies
- Enterprise system administration - scalable to
O(10K) nodes - Automated installation and maintenance of nodes
- Resource management (batch, interactive)
- Monitoring of events and performance
- Fault tolerance recovery actions
- Fabric Configuration Management
6DataGrid Architecture
Local Application
Local Database
Local Computing
Grid
Grid Application Layer
Data Management
Metadata Management
Object to File Mapping
Job Management
Collective Services
Information Monitoring
Replica Manager
Grid Scheduler
Underlying Grid Services
Computing Element Services
Authorization Authentication and Accounting
Replica Catalog
Storage Element Services
Service Index
SQL Database Services
Grid
Fabric services
Fabric
Node Installation Management
Monitoring and Fault Tolerance
Fabric Storage Management
Configuration Management
Resource Management
WP4 tasks
7WP4 Architecture overview
- - provides the tools for gathering moniotoring
information on fabric nodes - central measurement repository stores all
monitoring information - - fault tolerance correlation engines detect
failures and trigger recovery actions.
- Interface between Grid-wide services and local
fabric - Provides local authentication,
authorization and mapping of grid credentials.
- provides transparent access to different
cluster batch systems - enhanced capabilities
(extended scheduling policies, advanced
reservation, local accounting).
- - provides the tools to install and manage all
software running on the fabric nodes - Agent to install, upgrade, remove and configure
software packages on the nodes. - -bootstrap services and software repositories
- provides a central storage and management of
all fabric configuration information - central
DB and set of protocols and APIs to store and
retrieve information.
8Current Status (I)
- Installation
- Prototype available, based on a tool originally
developed by Edinburgh University LCFG (Local
ConFiGuration system). - Main Features
- automatic installation of O.S.
- installation/upgrade/removal of software packages
- extendible to configure and manage custom
application software - Configuration Management
- High Level Configuration Description Language
declarative way of describing configuration of
computer systems. First draft available. - High Level configuration Language to Low Level
Configuration language Compiler. Alpha prototype
available. - Central Configuration Database (CDB) (central
store for all fabric configuration information).
Being designed.
9Current Status (II)
- Monitoring
- A Monitoring Agent running on each node samples
the metrics that have been configured via
sensors. First version available. - The samples are sent to a central Monitoring
Repository and stored. Currentl prototype simple
DB based on a flat file system. - Fault Tolerance
- Prototype which periodically checks the CPU/chip
set temperatures as well as the fan speeds.
10Current Status (III)
- Resource Management
- Working on first prototype of the Resource
Management Subsystem. Batch system proxies for
both LSF and PBS already available. - Gridification
- Enhancing the Globus gatekeeper with plug-in
authorization and credential mapping components.
11Shortcomings (I)
- Installation of computing nodes
- With LCFG, each node is autonomous
- The configuration database contains profiles for
each node - Each node performs upgrades by pulling
configuration information and the software
packages from the central repositories - This approach is well suited in heterogonous
environments (eg. desktops, small farms) - But, in big Computer Centers, large fractions of
farm nodes are identical in terms of both setup
(HW and SW) and functionality - Such sets of identical nodes could be managed
together - A reference node is set-up, and cloned by the
other slave nodes - Slave nodes only contain minimal specific
information (eg. IP address) - Cloning can be done by pushing changes to the
slave nodes - Upcoming standards (IP multicast) would keep
network traffic low.
12Shortcomings (II)
- Application Checkpointing Library
- Current model does lack communication between
user jobs and changes to the state of the
execution environment - In case of urgent system interventions, running
user jobs have to be terminated ungracefully - The HEP experiments dont offer yet application
checkpointing - Experiments use different software architecture
frameworks - Data access, persistency model and state
management is different - There is a need to investigate how to offer a
generic application checkpointing library - Checkpointing at user level
- Flexible Plug-in design for interaction with
applications - HEP experiments can help by
- providing requirements
- their view on how/where to fit checkpointing
into their software architecture
13Shortcomings (III)
- Quality of Service (QoS) Correlation Engines
- The WP4 monitoring subsystem provides the
necessary infrastructure for - Gathering raw monitoring data on farm nodes
- storing and retrieving monitoring data in a
central repository - framework for correlation engines for analyzing
the raw data - Correlation Engines are needed for e.g.
- Determining load peaks and averages (CPU, I/O) on
batch services - Disk and tape access patterns for mass storage
- Home directory access
- Network acess
- This correlated data would allow for more precise
SLAs based on Quality of Service criterias.
14Contacts and more information
- DataGrid project
- http//cern.ch/eu-datagrid
- DataGRID WP4
- http//cern.ch/hep-proj-grid-fabric