NGOP Overview - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

NGOP Overview

Description:

SLAC, JLAB, and IN2P3 participated via video conference. ... A set of basic MA's will be deployed with the NGOP system, users are free to write their own. ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 24
Provided by: jimf8
Learn more at: https://www.jlab.org
Category:
Tags: ngop | overview

less

Transcript and Presenter's Notes

Title: NGOP Overview


1
NGOP Overview
  • Jim Fromm
  • Farms and Clustered Systems Group
  • Computing Division
  • Fermilab

2
People
  • Integrated Systems Development Department
  • Don Petravick
  • Krzysztof Genser
  • Jim Fromm
  • Tanya Levshina
  • Igor Mandrichenko
  • Terry Jones
  • Operating Systems Support Dept.
  • Troy Dawson
  • Lisa Giachetti
  • Ken Schumacher
  • Marc Mengel
  • Computing Services Dept.
  • Jeff Mack
  • Rick Thies
  • Rich Thompson

3
Goals
  • NGOP working group charged with the task of
    developing a Distributed Management System (DMS)
    that would scale to the anticipated requirements
    for Run II farms.
  • Future size of farms require that the DMS be
    pro-active. The system should take corrective
    action when possible.
  • Must detect hardware, system, and application
    problems.
  • Problem diagnostics should eliminate noise, or
    false alarms.
  • Should provide tools to do performance analysis.

4
NGOP History
  • Summer 1999 NGOP group created to gather
    requirements for a Distributed
    Management System capable of efficiently
    monitoring Fermilab computing facility for Run
    II.
  • Sept 1999 Requirement gathering completed.
  • Dec 1999 Evaluation of available products
    presented.
  • Jan 2000 Decision to develop a custom DMS made
  • Today Development of prototype underway.
    Completion is expected before year end.

5
We are not alone
  • As computer farms get larger, other HEP sites are
    looking at a similar problem
  • March 2000, CERN and BNL visited Fermilab to
    exchange ideas on lessons learned. SLAC, JLAB,
    and IN2P3 participated via video conference.
  • July 2000 Fermilab visited CERN to follow up on
    the March meetings.

6
Some Terminology
  • Monitored Object is one of the following
  • Host A computer identified by its full domain
    name
  • Cluster A collection of hosts
  • Component An atomic element that has a well
    defined behavior.
  • System A collection of components.
  • Condition A pre-defined state of an Monitored
    Object.
  • Event A description of a detected condition.
  • Action An activity initiated by the NGOP system
    based on an event.
  • Alarm An asynchronous indicator initiated by
    NGOP.
  • Status Shows the level of the monitored element
    functionality.
  • Monitoring Agent A software component that
    generates events based on conditions and performs
    actions.

7
NGOP Requirements
  • Essential Features
  • Should detect hardware, network, system, and
    application errors.
  • System Daemon status (inetd, mbatchd)
  • Unreachable hosts.
  • Security breaches
  • /tmp full.
  • Should run on all Fermilab supported operating
    systems.
  • Scalable to 1000s of hosts.
  • Must be multi-user, must support different
    authorization levels.
  • Provide an interface for user written monitoring
    tools.
  • Generate different levels of alarms
    (Warning,Info, etc).
  • Perform actions based on alarms and events
    (email,page,restart daemon).
  • Provide a hierarchical view of the monitored
    system.
  • Dynamic configuration.
  • Provide monitoring capabilities via a web
    browser, GUI, and command line interface.
  • Provide special states for monitored objects such
    as known bad.
  • Desirable Features
  • Ability to have overlapping clusters.
  • Ability to generate reports based on selection
    criteria.

8
Products Evaluation
  • Some Evaluated Products
  • Patrol
  • Not scalable for centralized monitoring
  • One level of hierarchy
  • No overlapping clusters
  • No filtering of events
  • No GUI/UI
  • Tkined/Scotty
  • Not scaleable for multiple users
  • System monitored only while GUI running
  • Only one level of alarms
  • Nocol
  • No notion of hierarchy or clusters.
  • Web and GUI(curses) interface have limited
    customization.
  • Very limited filtering of events
  • Netlogger
  • Limited off-shelf functionality
  • No customization for monitoring agents
  • Very limited way to create hierarchy.

9
Product Evaluation Summary
  • Many commercial and open-source products try to
    solve the problem in many different ways.
  • None of the evaluated products met the basic
    requirements at Fermilab.
  • Discussion with others who chose the commercial
    route were not encouraging. Many bad experiences
    documented.
  • Decision was made to develop our own custom DMS.

10
Design Summary Key System Components
  • Monitoring AgentMonitors a monitored
    object,generates events based on certain
    conditions.
  • Sensor Agent Similar to a monitoring agent, but
    this process collects performance data and
    generates events at a higher rate than a
    monitoring agent.
  • NGOP Central Server(NCS) The central daemon
    process that gathers events from MAs, provides
    users with requested information, and dumps
    persistent data into the Archive Server.
  • NGOP Configuration File Management Service
    Provides a mechanism to centrally locate system
    configuration and rules. Allows for dynamic
    reconfiguration of system.
  • Archive Server daemon that handles archive
    storage. Provides a means to write, read, and
    query the data.
  • Monitoring Client Communicate with NCS using an
    API to display system status in a meaningful
    manner.

11
NGOP Architecture
Report Generator
Cluster A
Archive Service
Archive
MA
Monitor
MA
Administrator
Central Server
MA
Configuraton File Management Service
Persistent Config.Data
Cluster B
Cluster B1
Monitored Objects Host
Element Cluster System NGOP
Components Sensor Agent
Server Monitoring Agent Monitoring Data
Storage Clients Connections
TCP connection between UDP
Monitored Element
and MA Not implemented in
prototype yet
MA
MA
Action Client
MA
s
S
s
MA
s
Data Analyzer
Router
MA
MA
s
s
s
s
Performance Storage Service
Cluster B2
Performance Data
12
Monitoring Agents The hook into NGOP
  • The monitoring agents (MA) is the process that
    monitors an object, and generates events when a
    condition is met. A message describing this
    event is sent to the NGOP Central Server (NCS).
  • NGOP defines the protocol to exchange information
    with the central server.
  • A set of basic MAs will be deployed with the
    NGOP system, users are free to write their own.
  • An API(C,C,Perl,Python) will be provided to
    allow for development of MAs.
  • MAs should send info to the NCS when
  • When current characteristics of a monitored
    object meet a condition.
  • When the condition is no longer satisfied.
  • Heartbeat messages sent periodically to let the
    NCS know it is still alive.
  • Examples
  • Monitor whether or not a batch system is running.
  • Monitor the size of a file system, issuing alarms
    when it is 90 full.

13
Sensor Agents
  • Sensor Agents send performance data to the
    Performance Storage Service.
  • The rate of this data is expected to be much
    higher than that of the MAs.
  • Examples
  • Monitor the temperature of a computer every
    second.
  • Monitor the CPU utilization continuously.

14
NGOP Central Server
  • NCS is the process that gets messages sent from
    MAs, stores them via the Archive Server, and
    provides monitoring clients (GUI for example)
    requested information.
  • One instance of the NCS will be running in the
    system.
  • NCS must handle many (10,000) MAs, and 50
    clients.
  • NCS should
  • Update object characteristics when MA reports a
    change.
  • Determine if an MA is dead, and forward this info
    along to the relevant monitoring client.
  • Forward event and action messages to the Archive
    Server.
  • Forward event messages to subscribed monitoring
    clients.

15
NGOP Configuration File Management Service
  • Responsible for providing a central repository
    for system configuration and monitoring rules.
  • Allows for dynamic reconfiguration of the system.
  • Configuration files written in xml.
  • Central repository is implemented using CVS in
    the prototype.
  • Only authorized users can update.

16
Rules
  • Rules define the status and the alarm level
    associated with monitored objects.
  • Rules describe the condition that should be
    satisfied in order for a monitored object to have
    status and alarm level.
  • Master rules are stored in the Configuration File
    Management Service (CFMS).
  • Users can create their own rules and store them
    locally. Users with permission can store these
    rules in the CFMS.
  • Dependency rules are a mechanism to filter out
    noise. For example, a batch system can be
    dependent on the power supply. If the power goes
    out on a machine, the fact that the batch system
    is down will not be raised.
  • Alarm/Action rules define the condition that will
    cause an alarm/action to be performed.

17
Monitoring Clients
  • Monitoring clients will be developed with an API
    that allows determination of the status of each
    node in a hierarchy, based on rules and current
    information obtained from the NCS.
  • Monitoring clients will initiate action requests.
  • Monitoring clients determine the state of the
    system and monitored elements based on
    information gathered from the NCS.

18
Archiver/Performance Storage Service
  • The Archive/Performance Storage Service(PSS) is
    responsible for storing and retrieving messages
    generated by the NGOP system. These messages
    represent event, sensor, or action data.
  • Components
  • Archive Server
  • Archive Retriever
  • Performance Storage Subsystem(PSS)
  • PSS Retriever
  • Archive Database Interface
  • Database (Oracle).
  • DBArchiver
  • The PSS is simply another instance of the Archive
    Server.
  • Performance data will need to be consolidated.

19
NGOP Prototype
  • NGOP prototype development is currently underway.
    The
  • prototype consists of the following modules
  • NGOP Central Server
  • Configuration File Management Service
  • Monitoring Agents
  • OS Health Monitors specific system daemons, file
    system existence and size, CPU load, and free
    memory.
  • Ping Agent Monitors node reachability
  • FBSNG Agent Monitors the FBSNG batch system.
  • NGOP Client API
  • Determines the status of the each monitored
    elements based on pre-defined rules and current
    information received from the NGOP Central Server
  • NGOP Monitor
  • Graphical representation of monitored elements
    status.
  • Provides means to see and acknowledge occurred
    events and alarms
  • Provides limited configuration options
  • Archive Server
  • Stores event and action messages to local disk.
  • The Archive Database Interface moves the message
    from local disk to an Oracle database.

20
NGOP Monitor

Alarm
Status Bad
Warning Good
Undefined
Event description
21
NGOP Monitor(event acknowledgment, known-status
modification)
Monitored Element Info
22
NGOP Monitor (Configuration Options)

Default icons for known object types
Default colors for status representation
Selecting elements for top level display
23
Summary
  • Building a DMS is a complex problem.
  • Various commercial and open source systems were
    analyzed. None met the basic requirements for
    the NGOP project at Fermilab.
  • Prototype system is under development.
  • See http//www-isd.fnal.gov/ngop for project
    details.
Write a Comment
User Comments (0)
About PowerShow.com