DataGrid WP4: Fabric Management

1 / 14
About This Presentation
Title:

DataGrid WP4: Fabric Management

Description:

Overview of DataGrid WP4. Current Status. Shortcomings. Node installation ... which periodically checks the CPU/chip set temperatures as well as the fan speeds. ... –

Number of Views:48
Avg rating:3.0/5.0
Slides: 15
Provided by: germancan
Category:

less

Transcript and Presenter's Notes

Title: DataGrid WP4: Fabric Management


1
DataGrid WP4Fabric Management
  • India _at_ LCG workshop
  • Maite Barroso, German Cancio

2
Outline
  • Overview of DataGrid WP4
  • Current Status
  • Shortcomings
  • Node installation
  • Checkpointing
  • Monitoring QoS correlation engines

3
DataGrid work packages
  • WP1 Workload Management
  • WP2 Grid Data Management
  • WP3 Grid Monitoring Services
  • WP4 Fabric management
  • WP5 Mass Storage Management
  • WP6 Integration Testbed Production quality
    International Infrastructure
  • WP7 Network Services
  • WP8 High-Energy Physics Applications
  • WP9 Earth Observation Science Applications
  • WP10 Biology Science Applications
  • WP11 Information Dissemination and Exploitation
  • WP12 Project Management

4
WP4 - Background information
  • WP4s objective deliver the necessary tools to
    manage a computing fabric providing grid services
    on clusters scaling up to thousands of nodes.
  • Main scope
  • User job management (Grid and local)
  • Fabric (system administration) management
  • Official participants CERN (leading partner),
    INFN, NIKHEF, University of Heidelberg, ZIB
    (Berlin) and University of Edinburgh/PPARC

5
Functionality
  • Provision for running Grid jobs
  • Authorization according to local policies
  • Mapping Grid credential to local ones
  • Publication of fabric resources and job
    information
  • Provision for running local jobs
  • Sharing of resources according to local policies
  • Enterprise system administration - scalable to
    O(10K) nodes
  • Automated installation and maintenance of nodes
  • Resource management (batch, interactive)
  • Monitoring of events and performance
  • Fault tolerance recovery actions
  • Fabric Configuration Management

6
DataGrid Architecture
Local Application
Local Database
Local Computing
Grid
Grid Application Layer
Data Management
Metadata Management
Object to File Mapping
Job Management
Collective Services
Information Monitoring
Replica Manager
Grid Scheduler
Underlying Grid Services
Computing Element Services
Authorization Authentication and Accounting
Replica Catalog
Storage Element Services
Service Index
SQL Database Services
Grid
Fabric services
Fabric
Node Installation Management
Monitoring and Fault Tolerance
Fabric Storage Management
Configuration Management
Resource Management
WP4 tasks
7
WP4 Architecture overview
  • - provides the tools for gathering moniotoring
    information on fabric nodes
  • central measurement repository stores all
    monitoring information
  • - fault tolerance correlation engines detect
    failures and trigger recovery actions.

- Interface between Grid-wide services and local
fabric - Provides local authentication,
authorization and mapping of grid credentials.
- provides transparent access to different
cluster batch systems - enhanced capabilities
(extended scheduling policies, advanced
reservation, local accounting).
  • - provides the tools to install and manage all
    software running on the fabric nodes
  • Agent to install, upgrade, remove and configure
    software packages on the nodes.
  • -bootstrap services and software repositories

- provides a central storage and management of
all fabric configuration information - central
DB and set of protocols and APIs to store and
retrieve information.
8
Current Status (I)
  • Installation
  • Prototype available, based on a tool originally
    developed by Edinburgh University LCFG (Local
    ConFiGuration system).
  • Main Features
  • automatic installation of O.S.
  • installation/upgrade/removal of software packages
  • extendible to configure and manage custom
    application software
  • Configuration Management
  • High Level Configuration Description Language
    declarative way of describing configuration of
    computer systems. First draft available.
  • High Level configuration Language to Low Level
    Configuration language Compiler. Alpha prototype
    available.
  • Central Configuration Database (CDB) (central
    store for all fabric configuration information).
    Being designed.

9
Current Status (II)
  • Monitoring
  • A Monitoring Agent running on each node samples
    the metrics that have been configured via
    sensors. First version available.
  • The samples are sent to a central Monitoring
    Repository and stored. Currentl prototype simple
    DB based on a flat file system.
  • Fault Tolerance
  • Prototype which periodically checks the CPU/chip
    set temperatures as well as the fan speeds.

10
Current Status (III)
  • Resource Management
  • Working on first prototype of the Resource
    Management Subsystem. Batch system proxies for
    both LSF and PBS already available.
  • Gridification
  • Enhancing the Globus gatekeeper with plug-in
    authorization and credential mapping components.

11
Shortcomings (I)
  • Installation of computing nodes
  • With LCFG, each node is autonomous
  • The configuration database contains profiles for
    each node
  • Each node performs upgrades by pulling
    configuration information and the software
    packages from the central repositories
  • This approach is well suited in heterogonous
    environments (eg. desktops, small farms)
  • But, in big Computer Centers, large fractions of
    farm nodes are identical in terms of both setup
    (HW and SW) and functionality
  • Such sets of identical nodes could be managed
    together
  • A reference node is set-up, and cloned by the
    other slave nodes
  • Slave nodes only contain minimal specific
    information (eg. IP address)
  • Cloning can be done by pushing changes to the
    slave nodes
  • Upcoming standards (IP multicast) would keep
    network traffic low.

12
Shortcomings (II)
  • Application Checkpointing Library
  • Current model does lack communication between
    user jobs and changes to the state of the
    execution environment
  • In case of urgent system interventions, running
    user jobs have to be terminated ungracefully
  • The HEP experiments dont offer yet application
    checkpointing
  • Experiments use different software architecture
    frameworks
  • Data access, persistency model and state
    management is different
  • There is a need to investigate how to offer a
    generic application checkpointing library
  • Checkpointing at user level
  • Flexible Plug-in design for interaction with
    applications
  • HEP experiments can help by
  • providing requirements
  • their view on how/where to fit checkpointing
    into their software architecture

13
Shortcomings (III)
  • Quality of Service (QoS) Correlation Engines
  • The WP4 monitoring subsystem provides the
    necessary infrastructure for
  • Gathering raw monitoring data on farm nodes
  • storing and retrieving monitoring data in a
    central repository
  • framework for correlation engines for analyzing
    the raw data
  • Correlation Engines are needed for e.g.
  • Determining load peaks and averages (CPU, I/O) on
    batch services
  • Disk and tape access patterns for mass storage
  • Home directory access
  • Network acess
  • This correlated data would allow for more precise
    SLAs based on Quality of Service criterias.

14
Contacts and more information
  • DataGrid project
  • http//cern.ch/eu-datagrid
  • DataGRID WP4
  • http//cern.ch/hep-proj-grid-fabric
Write a Comment
User Comments (0)
About PowerShow.com