Title: NGOP Overview
1NGOP Overview
- Jim Fromm
- Farms and Clustered Systems Group
- Computing Division
- Fermilab
2People
- Integrated Systems Development Department
- Don Petravick
- Krzysztof Genser
- Jim Fromm
- Tanya Levshina
- Igor Mandrichenko
- Terry Jones
- Operating Systems Support Dept.
- Troy Dawson
- Lisa Giachetti
- Ken Schumacher
- Marc Mengel
- Computing Services Dept.
- Jeff Mack
- Rick Thies
- Rich Thompson
3Goals
- NGOP working group charged with the task of
developing a Distributed Management System (DMS)
that would scale to the anticipated requirements
for Run II farms. - Future size of farms require that the DMS be
pro-active. The system should take corrective
action when possible. - Must detect hardware, system, and application
problems. - Problem diagnostics should eliminate noise, or
false alarms. - Should provide tools to do performance analysis.
4NGOP History
- Summer 1999 NGOP group created to gather
requirements for a Distributed
Management System capable of efficiently
monitoring Fermilab computing facility for Run
II. - Sept 1999 Requirement gathering completed.
- Dec 1999 Evaluation of available products
presented. - Jan 2000 Decision to develop a custom DMS made
- Today Development of prototype underway.
Completion is expected before year end.
5We are not alone
- As computer farms get larger, other HEP sites are
looking at a similar problem - March 2000, CERN and BNL visited Fermilab to
exchange ideas on lessons learned. SLAC, JLAB,
and IN2P3 participated via video conference. - July 2000 Fermilab visited CERN to follow up on
the March meetings.
6Some Terminology
- Monitored Object is one of the following
- Host A computer identified by its full domain
name - Cluster A collection of hosts
- Component An atomic element that has a well
defined behavior. - System A collection of components.
- Condition A pre-defined state of an Monitored
Object. - Event A description of a detected condition.
- Action An activity initiated by the NGOP system
based on an event. - Alarm An asynchronous indicator initiated by
NGOP. - Status Shows the level of the monitored element
functionality. - Monitoring Agent A software component that
generates events based on conditions and performs
actions.
7NGOP Requirements
- Essential Features
- Should detect hardware, network, system, and
application errors. - System Daemon status (inetd, mbatchd)
- Unreachable hosts.
- Security breaches
- /tmp full.
- Should run on all Fermilab supported operating
systems. - Scalable to 1000s of hosts.
- Must be multi-user, must support different
authorization levels. - Provide an interface for user written monitoring
tools. - Generate different levels of alarms
(Warning,Info, etc). - Perform actions based on alarms and events
(email,page,restart daemon). - Provide a hierarchical view of the monitored
system. - Dynamic configuration.
- Provide monitoring capabilities via a web
browser, GUI, and command line interface. - Provide special states for monitored objects such
as known bad. - Desirable Features
- Ability to have overlapping clusters.
- Ability to generate reports based on selection
criteria.
8Products Evaluation
- Some Evaluated Products
- Patrol
- Not scalable for centralized monitoring
- One level of hierarchy
- No overlapping clusters
- No filtering of events
- No GUI/UI
- Tkined/Scotty
- Not scaleable for multiple users
- System monitored only while GUI running
- Only one level of alarms
- Nocol
- No notion of hierarchy or clusters.
- Web and GUI(curses) interface have limited
customization. - Very limited filtering of events
- Netlogger
- Limited off-shelf functionality
- No customization for monitoring agents
- Very limited way to create hierarchy.
9Product Evaluation Summary
- Many commercial and open-source products try to
solve the problem in many different ways. - None of the evaluated products met the basic
requirements at Fermilab. - Discussion with others who chose the commercial
route were not encouraging. Many bad experiences
documented. - Decision was made to develop our own custom DMS.
10Design Summary Key System Components
- Monitoring AgentMonitors a monitored
object,generates events based on certain
conditions. - Sensor Agent Similar to a monitoring agent, but
this process collects performance data and
generates events at a higher rate than a
monitoring agent. - NGOP Central Server(NCS) The central daemon
process that gathers events from MAs, provides
users with requested information, and dumps
persistent data into the Archive Server. - NGOP Configuration File Management Service
Provides a mechanism to centrally locate system
configuration and rules. Allows for dynamic
reconfiguration of system. - Archive Server daemon that handles archive
storage. Provides a means to write, read, and
query the data. - Monitoring Client Communicate with NCS using an
API to display system status in a meaningful
manner.
11NGOP Architecture
Report Generator
Cluster A
Archive Service
Archive
MA
Monitor
MA
Administrator
Central Server
MA
Configuraton File Management Service
Persistent Config.Data
Cluster B
Cluster B1
Monitored Objects Host
Element Cluster System NGOP
Components Sensor Agent
Server Monitoring Agent Monitoring Data
Storage Clients Connections
TCP connection between UDP
Monitored Element
and MA Not implemented in
prototype yet
MA
MA
Action Client
MA
s
S
s
MA
s
Data Analyzer
Router
MA
MA
s
s
s
s
Performance Storage Service
Cluster B2
Performance Data
12Monitoring Agents The hook into NGOP
- The monitoring agents (MA) is the process that
monitors an object, and generates events when a
condition is met. A message describing this
event is sent to the NGOP Central Server (NCS). - NGOP defines the protocol to exchange information
with the central server. - A set of basic MAs will be deployed with the
NGOP system, users are free to write their own. - An API(C,C,Perl,Python) will be provided to
allow for development of MAs. - MAs should send info to the NCS when
- When current characteristics of a monitored
object meet a condition. - When the condition is no longer satisfied.
- Heartbeat messages sent periodically to let the
NCS know it is still alive. - Examples
- Monitor whether or not a batch system is running.
- Monitor the size of a file system, issuing alarms
when it is 90 full.
13Sensor Agents
- Sensor Agents send performance data to the
Performance Storage Service. - The rate of this data is expected to be much
higher than that of the MAs. - Examples
- Monitor the temperature of a computer every
second. - Monitor the CPU utilization continuously.
14NGOP Central Server
- NCS is the process that gets messages sent from
MAs, stores them via the Archive Server, and
provides monitoring clients (GUI for example)
requested information. - One instance of the NCS will be running in the
system. - NCS must handle many (10,000) MAs, and 50
clients. - NCS should
- Update object characteristics when MA reports a
change. - Determine if an MA is dead, and forward this info
along to the relevant monitoring client. - Forward event and action messages to the Archive
Server. - Forward event messages to subscribed monitoring
clients.
15NGOP Configuration File Management Service
- Responsible for providing a central repository
for system configuration and monitoring rules. - Allows for dynamic reconfiguration of the system.
- Configuration files written in xml.
- Central repository is implemented using CVS in
the prototype. - Only authorized users can update.
16Rules
- Rules define the status and the alarm level
associated with monitored objects. - Rules describe the condition that should be
satisfied in order for a monitored object to have
status and alarm level. - Master rules are stored in the Configuration File
Management Service (CFMS). - Users can create their own rules and store them
locally. Users with permission can store these
rules in the CFMS. - Dependency rules are a mechanism to filter out
noise. For example, a batch system can be
dependent on the power supply. If the power goes
out on a machine, the fact that the batch system
is down will not be raised. - Alarm/Action rules define the condition that will
cause an alarm/action to be performed.
17Monitoring Clients
- Monitoring clients will be developed with an API
that allows determination of the status of each
node in a hierarchy, based on rules and current
information obtained from the NCS. - Monitoring clients will initiate action requests.
- Monitoring clients determine the state of the
system and monitored elements based on
information gathered from the NCS.
18Archiver/Performance Storage Service
- The Archive/Performance Storage Service(PSS) is
responsible for storing and retrieving messages
generated by the NGOP system. These messages
represent event, sensor, or action data. - Components
- Archive Server
- Archive Retriever
- Performance Storage Subsystem(PSS)
- PSS Retriever
- Archive Database Interface
- Database (Oracle).
- DBArchiver
- The PSS is simply another instance of the Archive
Server. - Performance data will need to be consolidated.
19NGOP Prototype
- NGOP prototype development is currently underway.
The - prototype consists of the following modules
- NGOP Central Server
- Configuration File Management Service
- Monitoring Agents
- OS Health Monitors specific system daemons, file
system existence and size, CPU load, and free
memory. - Ping Agent Monitors node reachability
- FBSNG Agent Monitors the FBSNG batch system.
- NGOP Client API
- Determines the status of the each monitored
elements based on pre-defined rules and current
information received from the NGOP Central Server - NGOP Monitor
- Graphical representation of monitored elements
status. - Provides means to see and acknowledge occurred
events and alarms - Provides limited configuration options
- Archive Server
- Stores event and action messages to local disk.
- The Archive Database Interface moves the message
from local disk to an Oracle database.
20 NGOP Monitor
Alarm
Status Bad
Warning Good
Undefined
Event description
21 NGOP Monitor(event acknowledgment, known-status
modification)
Monitored Element Info
22NGOP Monitor (Configuration Options)
Default icons for known object types
Default colors for status representation
Selecting elements for top level display
23Summary
- Building a DMS is a complex problem.
- Various commercial and open source systems were
analyzed. None met the basic requirements for
the NGOP project at Fermilab. - Prototype system is under development.
- See http//www-isd.fnal.gov/ngop for project
details.