DataGrid WP4: Fabric Management

1 / 14

About This Presentation

Title:

DataGrid WP4: Fabric Management

Description:

Overview of DataGrid WP4. Current Status. Shortcomings. Node installation ... which periodically checks the CPU/chip set temperatures as well as the fan speeds. ... –

Number of Views:48

Avg rating:3.0/5.0

Slides: 15

Provided by: germancan

Category:

more less

Transcript and Presenter's Notes

Title: DataGrid WP4: Fabric Management

1
DataGrid WP4Fabric Management

India _at_ LCG workshop
Maite Barroso, German Cancio

2
Outline

Overview of DataGrid WP4
Current Status
Shortcomings
Node installation
Checkpointing
Monitoring QoS correlation engines

3
DataGrid work packages

WP1 Workload Management
WP2 Grid Data Management
WP3 Grid Monitoring Services
WP4 Fabric management
WP5 Mass Storage Management
WP6 Integration Testbed Production quality
International Infrastructure
WP7 Network Services
WP8 High-Energy Physics Applications
WP9 Earth Observation Science Applications
WP10 Biology Science Applications
WP11 Information Dissemination and Exploitation
WP12 Project Management

4
WP4 - Background information

WP4s objective deliver the necessary tools to
manage a computing fabric providing grid services
on clusters scaling up to thousands of nodes.
Main scope
User job management (Grid and local)
Fabric (system administration) management
Official participants CERN (leading partner),
INFN, NIKHEF, University of Heidelberg, ZIB
(Berlin) and University of Edinburgh/PPARC

5
Functionality

Provision for running Grid jobs
Authorization according to local policies
Mapping Grid credential to local ones
Publication of fabric resources and job
information
Provision for running local jobs
Sharing of resources according to local policies
Enterprise system administration - scalable to
O(10K) nodes
Automated installation and maintenance of nodes
Resource management (batch, interactive)
Monitoring of events and performance
Fault tolerance recovery actions
Fabric Configuration Management

6
DataGrid Architecture
Local Application
Local Database
Local Computing
Grid
Grid Application Layer
Data Management
Metadata Management
Object to File Mapping
Job Management
Collective Services
Information Monitoring
Replica Manager
Grid Scheduler
Underlying Grid Services
Computing Element Services
Authorization Authentication and Accounting
Replica Catalog
Storage Element Services
Service Index
SQL Database Services
Grid
Fabric services
Fabric
Node Installation Management
Monitoring and Fault Tolerance
Fabric Storage Management
Configuration Management
Resource Management
WP4 tasks
7
WP4 Architecture overview

- provides the tools for gathering moniotoring
information on fabric nodes
central measurement repository stores all
monitoring information
- fault tolerance correlation engines detect
failures and trigger recovery actions.

- Interface between Grid-wide services and local
fabric - Provides local authentication,
authorization and mapping of grid credentials.
- provides transparent access to different
cluster batch systems - enhanced capabilities
(extended scheduling policies, advanced
reservation, local accounting).

- provides the tools to install and manage all
software running on the fabric nodes
Agent to install, upgrade, remove and configure
software packages on the nodes.
-bootstrap services and software repositories

- provides a central storage and management of
all fabric configuration information - central
DB and set of protocols and APIs to store and
retrieve information.
8
Current Status (I)

Installation
Prototype available, based on a tool originally
developed by Edinburgh University LCFG (Local
ConFiGuration system).
Main Features
automatic installation of O.S.
installation/upgrade/removal of software packages
extendible to configure and manage custom
application software
Configuration Management
High Level Configuration Description Language
declarative way of describing configuration of
computer systems. First draft available.
High Level configuration Language to Low Level
Configuration language Compiler. Alpha prototype
available.
Central Configuration Database (CDB) (central
store for all fabric configuration information).
Being designed.

9
Current Status (II)

Monitoring
A Monitoring Agent running on each node samples
the metrics that have been configured via
sensors. First version available.
The samples are sent to a central Monitoring
Repository and stored. Currentl prototype simple
DB based on a flat file system.
Fault Tolerance
Prototype which periodically checks the CPU/chip
set temperatures as well as the fan speeds.

10
Current Status (III)

Resource Management
Working on first prototype of the Resource
Management Subsystem. Batch system proxies for
both LSF and PBS already available.
Gridification
Enhancing the Globus gatekeeper with plug-in
authorization and credential mapping components.

11
Shortcomings (I)

Installation of computing nodes
With LCFG, each node is autonomous
The configuration database contains profiles for
each node
Each node performs upgrades by pulling
configuration information and the software
packages from the central repositories
This approach is well suited in heterogonous
environments (eg. desktops, small farms)
But, in big Computer Centers, large fractions of
farm nodes are identical in terms of both setup
(HW and SW) and functionality
Such sets of identical nodes could be managed
together
A reference node is set-up, and cloned by the
other slave nodes
Slave nodes only contain minimal specific
information (eg. IP address)
Cloning can be done by pushing changes to the
slave nodes
Upcoming standards (IP multicast) would keep
network traffic low.

12
Shortcomings (II)

Application Checkpointing Library
Current model does lack communication between
user jobs and changes to the state of the
execution environment
In case of urgent system interventions, running
user jobs have to be terminated ungracefully
The HEP experiments dont offer yet application
checkpointing
Experiments use different software architecture
frameworks
Data access, persistency model and state
management is different
There is a need to investigate how to offer a
generic application checkpointing library
Checkpointing at user level
Flexible Plug-in design for interaction with
applications
HEP experiments can help by
providing requirements
their view on how/where to fit checkpointing
into their software architecture

13
Shortcomings (III)

Quality of Service (QoS) Correlation Engines
The WP4 monitoring subsystem provides the
necessary infrastructure for
Gathering raw monitoring data on farm nodes
storing and retrieving monitoring data in a
central repository
framework for correlation engines for analyzing
the raw data
Correlation Engines are needed for e.g.
Determining load peaks and averages (CPU, I/O) on
batch services
Disk and tape access patterns for mass storage
Home directory access
Network acess
This correlated data would allow for more precise
SLAs based on Quality of Service criterias.