Lemon - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

Lemon

Description:

Lemon. Computer Monitoring at CERN. Miroslav Siket, German Cancio, David Front, ... Configuration database browser (browses external XML config files) ... – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 17
Provided by: admo6
Category:
Tags: browses | lemon

less

Transcript and Presenter's Notes

Title: Lemon


1
Lemon
  • Computer Monitoring at CERN
  • Miroslav Siket, German Cancio, David Front,
    Maciej Stepniewski
  • Presented by Harry Renshall
  • CERN-IT/FIO-FS

2
Outline
  • Lemon
  • Structure
  • Deployment at CERN
  • Use cases
  • Alarms
  • Web visualization
  • Summary

3
Lemon LHC Era Monitoring
  • Lemon is a software package containing tools for
    monitoring status and performance of computers
  • Distributed monitoring system scalable to 10k
    nodes
  • Provides active monitoring of software and
    hardware in the Computer Center on centrally
    managed clusters
  • Facilitates early error detection and problem
    prevention
  • Provides persistent storage of the monitoring
    data
  • Executes corrective actions and sends
    notifications
  • Offers a framework for further creation of
    sensors for monitoring
  • Most of the functionality is site independent
  • It is used at CERN by
  • System administrators, service managers, cluster
    responsibles
  • Developers and service/data challenges
  • Managers and general users
  • Link http//cern.ch/lemon

4
Lemon - schema
5
Components
  • MSA Monitoring Sensor Agent
  • Spawns multiple Monitoring Sensors (MS) to
    measure data in defined intervals and sends data
    to Monitoring Repository
  • MS - Monitoring Sensor
  • Uses standard C, perl API it is easy to write
    your own sensor
  • Several sensors exist for performance, process,
    hw and sw monitoring, grid VOs job reporting,
    database monitoring, security, alarms (total 260
    metrics)
  • MR Monitoring Repository
  • Stores data in an Oracle (the full history)
    backed up to tape in Castor
  • Flat file version available as well (with most
    functionality preserved)
  • We run two of them on two independent machines
    with two databases with failover (aiming for High
    Availability with Oracle Real Application
    Cluster)
  • LRF - Lemon RRD Framework
  • is used to cache the data in easily accessible
    way (rrd files) for web graphics
  • In connection with Quattor Configuration DB
    provides service and cluster overview
  • RRD stands for Round Robin Database (time aging
    data with predefined binning) developed by
    Tobias Oetiker in ETH, Zurich (http//www.rrdtool.
    org)
  • LAG Lemon Alarm Gateway
  • Generic gateway for alarms

6
Lemon at CERN
  • Lemon monitors about 2200 computers in 100
    clusters
  • On average it collects about 70 metrics from each
    host
  • Part of the ELFms tools
  • Integrated with Sure alarm system
  • Collecting about 1.5 GB/day
  • Integrated with CDB for configuration
  • Leaf (LHC-Era Automated Fabric) for scheduling of
    interventions

Node
Configuration Management
Node Management
  • Configuration
  • Derived from Configuration Database (CDB)
  • individual configuration per cluster/host
  • hierarchical structure
  • monitoring configuration is derived from CDB
  • Leaf tools allow scheduled downtimes,
    interventions, on demand changes
  • Alarm system
  • Sure legacy system receiving alarms from Lemon
  • Integration with new LASER system (LHC alarm
    system) is ongoing

7
Computer Center Overview
  • Entry page displays status overview of the key
    services
  • Allows choosing the individual cluster, rack,
    host or other categories

8
Use(ful) cases (I)
Reboot occurrence history graph
  • Kernel upgrade
  • Kernel version is measured on the boot of the
    machine
  • Automatic tools for upgrading the kernel on a
    cluster retrieve information from Lemon and
    schedule reboot of a machine based on this info
  • Web interface allows monitoring of the progress

9
Use(ful) case (II)
  • Searching for a host
  • High load, network usage,
  • Metric distributions allow identification of
    hosts with problematic performance

10
Integration of Web interface
  • Web interface has been through various plug-ins
    adopted to accommodate additional
    information/links to help management of the
    computer center
  • Examples
  • Configuration database browser (browses external
    XML config files)
  • ITCM (Remedy) ticket external error tracking
    database
  • CC tracker (synoptic view of the computer center)
    XML defined geometry
  • Alarm display
  • Metric information display
  • Raw data grapher (JPgraph)
  • External functionalities are customizable

11
Computer Center display
  • Lemon Web Interface is interfaced with Computer
    Center database of objects
  • Provides search of objects as well as listing
  • Interfaced through the XML defined geometry of
    the computer center
  • Generic design

12
Automatic recovery actions
  • Alarm Sensor
  • For defined values of measured metrics an
    actuator is called with predefined action
  • An example ssh daemon dead action
    /sbin/service sshd start
  • Definition metric X, field Y ! reference value
    Z gt call actuator
  • If success log only
  • Else call action up to max times
  • Each occurrence is logged in the Monitoring
    Repository
  • Already about 70 predefined alarms with automatic
    recovery actions
  • After first month of deployment it reduced number
    of problem tickets by half
  • Correlation engine
  • Allows wide definition of alarms and recovery
    actions (in development)

13
Remedy Ticket tracking
ITCM (Remedy) tickets occurrence
  • Error trending metric with values on number of
    interventions/occurrences of problems
  • Several categories created by
  • Hardware
  • Software
  • Clustered by contract type/cluster
  • Reporting problems whether scheduled or not and
    whether system was rebooted
  • Allows tracking of interventions per type of
    problem
  • Web interface to show the trend

14
Database (Oracle) Monitoring
  • In cooperation with ADC group at CERN we have
    developed a sensor for measuring performance
    entities in Oracle Database
  • Number of logons, cursors, logical and physical
    I/O, user commits, index usage, parse statistics,
  • Allows identification of bottlenecks and gives
    overview of the stability of the system
  • Works on both 9i and 10g version of the Oracle
  • Integration into services/RAC
  • Configuration of service integrated with Oracle
    Enterprise Repository

15
Service challenges, GRID VOs
  • Lemon allows
  • Virtual clusters
  • clusters defined on request by service managers
  • Or defined by scripts updated dynamically on
    demand
  • Or Defined for specific purpose
  • An example Atlas DC04 challenge, Network
    challenges,
  • Clusters defined dynamically
  • An example hosts running GRID jobs on the batch
    cluster belonging to the given Virtual
    Organization
  • Provides hooks in Lemon for defining any dynamic
    grouping of hosts

16
Summary
  • Lemon serves to provide monitoring information
    about the computers in the Computer Center at
    CERN
  • Thanks to its integration with Sure (alarm
    system) it allows fast and easy identification
    and repair of problems. We will convert to a new
    accelerator alarm system this year (LASER).
    Lemon provides LAG (Lemon Alarm Gateway) to feed
    alarms into arbitrary alarm systems.
  • In connection to CDB it allows easier overview of
    services and visualisation of their performance
  • In connection to Remedy (ITCM problem tracking)
    allows an overview of the problems for the given
    service
  • It has been a useful tool for general monitoring
    of performance and also for system administrators
    in debugging problems
  • Lemon is also used and developed elsewhere BARC
    institute in India, Accelerator department at
    CERN, CMS is adopting it for its online farm
    monitoring,
  • Lemon is used for GridIce and can provide data to
    MonAlisa
Write a Comment
User Comments (0)
About PowerShow.com