Title: Tools for Monitoring the Grid
1Tools for Monitoring the Grid
- Dr Mark Baker
- Distributed Systems Group
- University of Portsmouth
- http//dsg.port.ac.uk/mab/Talks/ICCSA04/
2Outline
- The Grid.
- Monitoring A brief overview.
- GridRM resource monitoring.
- jGMA.
- Summary and Conclusions.
3Enter Grid Technologies
- Infrastructure (middleware) for establishing,
managing, and evolving multi-organisational
federations - Dynamic, autonomous, domain independent,
- On-demand, ubiquitous access to computing, data,
and services. - Mechanisms for creating and managing workflow
within such federations - New capabilities constructed dynamically and
transparently from distributed services, - Service-oriented and virtualisation.
4Elements of the Problem
- Resource sharing
- Computers, storage, sensors, networks,
- Sharing always conditional issues of trust,
policy, negotiation, payment, security, - Coordinated problem solving
- Beyond client-server distributed data analysis,
computation, workflow, collaboration, - Dynamic, multi-institutional virtual orgs
- Community overlays on classic organisational
structures (CERN for example), - Large or small, static or dynamic.
5Why Monitor the Big Picture
- The Grid is a dynamic, heterogeneous, globally
distributed complex system with no - Central authority,
- Means of control,
- Universally accepted means of knowing if a
resource, or service is up and available for use
by a applications. - Lack of knowledge about the status of the
resources and services available in any
distributed system will hamper strategies for
optimal scheduling, allocation and use. - So, like any other distributed system, the Grid
needs monitoring mechanisms and tools to help
avoid it becoming chaotic and unusable.
6Monitoring A Brief Overview
7Some Motivation
- Use hardware and/or software tools to observe the
activities on a given resource, service or
application - Analyse, predict and tune performance,
- Fault detection and identification,
- Identify bottlenecks,
- Scheduling the best resources
- Where to get/put the data.
- Where to execute job.
- Use data as input into higher level services
- Such as, forecasting, autonomic computing,
knowledge, intelligence
8Monitor What?
- Functional reasons
- Resources (computer/network/device) up/down,
- Services functioning, yes, no, partially,
- SLA being met or not,
- QoS being fulfilled,
- Policies being followed,
- Non-functional reasons
- Typically performance related
- Bandwidth/latency,
- CPU/memory,disk,
- Device performance,
- And so on.
9Functionality Required
- Data acquisition
- e.g., Instrumentation, harvesting,
pre-processing, and data delivery, - Data processing
- e.g., Analysis, presentation, visualization,
problem detection, problem location, and control. - Active and passive monitoring
- Read-write functionality
- Management/configuration tasks, and gathering
- Read-only
- Just gathering...
- Classes of monitored entities
- Resources typically hardware,
- Services providing services to applications and
clients. - Applications the normal.
10Monitoring Projects
- Many monitoring projects with a range of goals
and purposes. - Broadly speaking they monitor
- Applications,
- Services,
- Resources.
- In the first instance we are interested in
monitoring resources and services - Main difference is the quantity of data produced
that needs to be passed around a system.
11Important Features
- Wide-area monitoring.
- Scalable architecture.
- Standardised protocols and APIs.
- Standard security mechanisms.
- Runtime extensibility.
- Normalised view of heterogeneous data sources.
- Filtering/fusing of data?
- Multi-Agent API monitoring.
- Search capability (class, functionality,
capability). - Software dependencies.
- Available, active and supported.
- Open source.
12Motivation
Note 2 Custom agent or sensor must be installed
on monitored resources.
- No current system was, or quite had the idealised
features wanted for our purposes so we were
motivated to develop our own system.
13The Ideal Situation
- A scalable wide-area monitoring framework.
- An standards-based approach that is independent
of underlying middleware. - A secure system.
- A means of harvesting heterogeneous data and
presenting it in a homogeneous form. - Asynchronous/synchronous, push/pull data
movement. - An extensible infrastructure that can be bound to
diverse data sources (legacy, and emerging ones). - A dynamic, configurable, useable system.
14GridRM
- A data gathering framework for monitoring and
managing the Grid
15Background
- We wanted a ubiquitous framework that provides
information about the health and status of Grid
resources and services. - Gathering resource information, such as
- Compute (nodes, CPU, memory),
- Network (inter-site communications links, network
devices), - Sensors (specialised devices, Web cam,
microphone), - Software services (information services,
schedulers). - NOTE Not applications
- Need a generic system that does not need yet
another local agent, but can utilise whatever
exists - SNMP, Network Weather Service, NetLogger,
Ganglia, /proc, MDS,
16Overview
- GridRM is a resource monitoring system that
provides - A scalable monitoring framework for the Grid.
- An standards-based information system that is
independent of underlying middleware. - A means of harvesting heterogeneous data and
presenting it in a homogeneous form. - Data collection push/pull, streaming/events,
- An extensible infrastructure that can be bound to
diverse data sources (legacy, and emerging ones). - A means of gathering data that can be used for a
range of purposes such as scheduling, autonomic
computing, verification of SLA, and creating
higher-level knowledge which is the interesting
area
17GridRM Structure
- Global layer of peer-related gateways
- Which in turn have a local layer that interacts
with the local data sources, and/or a hierarchy
child gateways.
18GridRM Architecture
19GridRM Global Layer
20jGMA
- Needed a lightweight implementation of the GGF
Grid Monitoring Architecture in Java. - Others
- R-GMA,
- pyGMA,
- Autopilot, MDS, NWS, CODE
- Found that existing systems were heavyweight,
complex or not standalone. - Decided to produce our own version
- Aims
- GMA compliant,
- Easy to install and use,
- Easy to program and extend,
- Java-based.
21jGMA Architecture
- GMA Compliance
- 21 features,
- GGF document is only a guide,
- It is very easy to claim to be compliant,
- For now jGMA is GMA like.
22GridRM Local Layer
23Data Management Issues
JSP
WS
OGSA
Scheduler
KB Agent
Global Layer
jGMA
Local Layer
Agent API
SNMP
NWS
NetL
/proc
Ganglia
XYZ Agent
SNMP Agent
NWS Agent
NetL Agent
/proc Agent
Ganglia Agent
24Data Management Issues
- Need to produce
- A simple and expressive API,
- Device drivers and manager for each Agent,
- A means of describing the monitored data
- Implies an XML-based schema and an ontology.
Resource Markup Language
Ontology and Schema
API
Agent API
Agent Driver Manager
Driver Manager
Common Agent API
Agent Devices
XYZ Agent
SNMP Agent
NWS Agent
NetL Agent
Ganglia Agent
/proc Agent
25GridRM Detailed View
26Solutions Simple API
- Producing an API is fairly simple, but creating
one that will be taken up and accepted is another
matter. - We are using an API based on JDBC from Java.
- Example of API
- Agent Driver Interface
- forName(GridRM.sql.agent.NWSDriver)
- forName(GridRM.sql.agent.SNMPv1Driver)
- Connection Interface
- String agentURL GridRMNWS/barney5550/PerfDat
a - Connection con DriverManager.getConnection(agent
URL) - Statement Interface
- Statement stmt con.createStatement()
- ResultSet rs stmt.executeQuery(get CPU table)
- Manipulating Results
- ResultSet is another interface contains a handful
of methods for manipulating the data returned
from the agent.
27Solutions Naming Schema
- No single naming schema for this area at the
moment. - We needed something that can markup the
information that can be gathered by the local
agents - Static and dynamic information
- Name/IP/OS/Processor/NIC/
- CPU load/memory available/disk space/network/
- Did not want to produce our own schema, so choose
an emerging one that is increasingly being used -
Grid Laboratory Uniform Environment (GLUE)
schema - A schema that defines the attributes of computer
system resources (CE/NE/) - Others, CIM, UNICORE, etc..
28Local Layer - Detail
29GridRM Drivers and Manager
- The GridRM Driver Manager gets data from the
Agent API and translates it into something that
the local agents can understand - The Driver Manager also provide other
functionality that is particular to GridRM such
as configuration, caching, streaming or
pushing/pulling data to/from clients. - The driver manager includes a simple low-level
API to interact with the local agents based on
a common sub-set of information that can be
retrieved from all the agents.
30Local Layer Use of SQL
- SQL used extensively throughout the framework.
- All resources are seen as databases and queried
using SQL. - Resource queries enter the framework as SQL
syntax. - Pluggable resource drivers are implemented as
JDBC drivers - Translate SQL requests into native protocol.
- Normalise results according to selected schema.
- Framework benefits from a single, flexible
approach to resource interaction. - Makes for a simple, extensible framework.
31GridRM Portal
- The GridRM Portal (gridrm.org) is a demonstration
of gateways, data sources, SQL and data
normalisation. - An example of the use of GridRM, particularly its
ability to discover and utilise resource data. - An example of a GridRM client which
- Allows the use of GridRM with no knowledge of the
underlying technologies. - Hides details like SQL, XML, etc
- Provides an abstraction everyone can use
clickerity click!).
32GridRM Portal
33Portal Navigation (1)
- Provides a view of GridRM Gateways through a
hierarchy of maps. - Point and click navigation, colour coded status.
34Portal Navigation (2)
- Textual equivalent to maps
- Data in XML, presentation in XSLT and CSS.
- XML means not just for human consumption!
- Portal provides a single point of access for
non-GMA aware clients to monitor the Grid.
35Portal Navigation (3)
- Navigation via the map or textually leads to
gateway metadata for the selected site(s) - Remote access and administration for gateway and
underlying data source drivers. - Portal security infrastructure not completed, so
place holder currently in position.
36Portal Simple Queries
- Query resources across configured gateways.
- Pre-defined query options point click!
- A Web form shields the user from SQL syntax and
other low-level details. - An example query
- Retrieve details of all compute resources
- The user is authorised to use,
- From the selected Grid sites,
- That meet specified constraints for
- Memory, processor load, architecture.
37Portal Simple Query Form (2)
38Portal Simple Query Results (1)
Highlighting the drivers used to retrieve
results.
GLUE formatted data
Controlling gateways exposed.
39Portal Simple Query Results (2)
- Comparative view of resource processor loads from
the query in the previous slide. - Missing bars indicate idle machines.
40Portal Technologies Exposed
- The portals Advanced Queries Page shows
- All data sources are viewed as relational
databases. - SQL is the uniform way of interacting with these
diverse data sources. - How data sources can be located across gateways.
- How data sources can be queried dynamically to
determine the types and format of data they can
provide. - GLUE plays an important role in
- Providing a consistent view of the way data is
organised at a resource. - Ensuring a homogeneous view of data is achieved
from heterogeneous sources. - Contrasts with the simplicity of the
point-and-click queries!
41Portal SQL Interface (1)
- Aim Query a given data source, using SQL, for
performance attributes, but without knowing the
monitoring protocol it natively supports. - The SQL commands can
- show gateways Get a list of registered gateways.
- use gateway Select a gateway to query DSG
Workgroup Gateway. - show datasources Get a list of all data sources
registered with the gateway. - use datasource Select a particular data source -
homer.dsg.port.ac.uk - show databases Retrieve a list of naming schemas
supported by this data source driver. - use GLUECE_host Query a host resource data using
the GLUE Computing Element schema.
42Portal SQL Interface (2)
- show tables Observe the way the naming schema
partitions resource data. - desc architecture Describe what an architecture
table will provide. - select from Architecture Read the
architecture type for this data source. - select Last5min, Last15min from ProcessorLoad
Read the 5 and 15 minute load averages.
43Portal SQL Interface (3)
- show gateways Get a list of registered gateways
- We need to find the list of gateways currently
available to us. - Enter show gateways into the command line
44Portal SQL Interface (4)
- The list of gateways is returned.
- Select a gateway to query with the use gateway
syntax - We enter the DSG gateway.
45Portal SQL Interface (5)
- The show datasources command returns a list of
all data sources registered with the gateway
- The gateway is selected and we now want to see
data sources
- We choose the homer data source
46Portal SQL Interface ()
- Keep going and eventually
47Portal SQL Interface (6)
- Query the resource driver for Architecture and
CPU load averages for last 5 and 15 minute - select from Architecture
- select Last5min, Last15min from ProcessorLoad
48Normalisation (1)
- GridRM clients can
- Query an arbitrary data source,
- Retrieve useful resource data in a standard way
- Without knowing anything about the underlying
resource's protocol or operation. - Pre-requisites
- Appropriate drivers exist,
- A translation schema exists.
- Clearly data returned from a GridRM query can be
very different from the native data produced. - Examples are given here to highlight some of the
work that goes on within the GridRM drivers...
49Normalisation (2)
- Generic GridRM Syntax
- We have seen that regardless of the underlying
data source, the format of the GridRM SQL request
and result remain the same. - Behind the scenes a GridRM data source driver
translates and submits the SQL command into the
format required by the data source's native
protocol. - Example
- Query a resources memory capacity and
availability, using the GLUE CE Schema - gt use GLUECE_hostgt select from MainMemory
- What happens next depends on the underlying data
source
50Normalisation (3)
- Given a range of data sources
- The granularity of native requests can differ
greatly, - SNMP has fine granularity
- Distinct data items can be selected and returned
to the driver. - Whereas Ganglia responds to a single generic
request - An XML document describes an entire Grid or
cluster. - On an individual basis GridRM drivers request,
fuse and filter data as appropriate for their
associated data source.
51Normalisation (4)
- SNMP requests could be formed of
- Multiple get, getNext, getBulk requests ,
- To obtain appropriate data for the GridRM
response. - Ganglia requests are of the form
- telnet lthostnamegt 8649 (gmond)
- telnet lthostnamegt 8651 (gmetad)
- /proc requests open a specified file(s).
- NetLogger requests read from a file.
- Drivers use mappings to translate the incoming
SQL request into an appropriate form that can be
used to query the data source.
52Normalisation (5)
- Native results could be of the form
- SNMP single result string, e.g. CPU Load over
Last 5 minutes
gtgt snmpget -c public 192.168.100.2.1.3.6.1.4.1.202
1.10.1.3.2 UCD-SNMP-MIBlaLoad.2 STRING 0.02
53Normalisation (6)
- or multiple SNMP result strings, e.g. CPU Load
over last 1,5,15 minutes as integer and floating
point values
gtgt snmpwalk -c public 192.168.100.2.1.3.6.1.4.1.20
21.10.1 UCD-SNMP-MIBlaConfig.3 STRING
14.00 UCD-SNMP-MIBlaConfig.2 STRING
14.00 UCD-SNMP-MIBlaConfig.3 STRING
14.00 UCD-SNMP-MIBlaLoadInt.1 INTEGER
8 UCD-SNMP-MIBlaLoadInt.2 INTEGER
2 UCD-SNMP-MIBlaLoadInt.3 INTEGER
1 UCD-SNMP-MIBlaLoadFloat.1 Opaque Float
0.080000 UCD-SNMP-MIBlaLoadFloat.2 Opaque
Float 0.020000 UCD-SNMP-MIBlaLoadFloat.3
Opaque Float 0.010000
54Normalisation (7)
- Other native results could be of the form
- Ganglia
- telnet lthostnamegt 8649 (gmond)
55Normalisation (6)
- Normalised Data
- On retrieving native data, the driver normalises
the data according to rules in the translation
schema. - A simple example would be to read memory values
natively in Kbytes and convert into Mbytes as
required by the selected translation schema. - Clearly more complex normalisations occur in line
with the driver response and the translation
schema. - e.g. select a single item out of a page of
returned information.
56Normalisation (7)
- Normalised results are encoded in XML (GLUE) when
they leave a gateway on route through the GridRM
infrastructure.
57Normalisation (8)
- Normalised resulting XML contains additional
metadata - That refers to the schema (i.e. GLUECE) used for
data normalisation, - To promote abstraction, SQL data types describe
schema field value types, - To determine the meaning of the returned value,
reference to the translation schema is required - This requirement is under review
- i.e. could add elements of the translation schema
metadata into the query result.
58Solutions Naming Schema
- No single naming schema for this area at the
moment. - We needed something that can markup the
information that can be gathered by the local
agents - Static and dynamic information
- Name/IP/OS/Processor/NIC/
- CPU load/memory available/disk space/network/
- Did not want to produce our own schema, so choose
an emerging one that is increasingly being used -
Grid Laboratory Uniform Environment (GLUE)
schema - A schema that defines the attributes of computer
system resources (CE/NE/) - Others, CIM, UNICORE, etc..
59GridRM Drivers and Manager
- The GridRM Driver Manager gets data from the
agent API and translates it into something that
the local agents can understand. - The Driver Manager also provide other
functionality that is particular to GridRM such
as configuration, caching, streaming or
pushing/pulling data to/from clients. - The driver manager includes a simple low-level
API to interact with the local agents based on
a common sub-set of information that can be
retrieved from all the agents.
60Local Layer Use of SQL
- SQL used extensively throughout the framework.
- All resources are seen as databases and queried
using SQL. - Resource queries enter the framework as SQL
syntax. - Pluggable resource drivers are implemented as
JDBC drivers - Translate SQL requests into native protocol.
- Normalise results according to selected schema.
- Framework benefits from a single, flexible
approach to resource interaction. - Makes for a simple, extensible framework.
61Portal Current GUI
62GridRM GUI
Homogeneous view of the data sources
63Summary
- Heterogeneous information returned from a diverse
range of possible data sources. - Need to harvest data into a homogeneous form
- Hide underlying complexity from clients.
- Provide data in a format that meets a clients
requirements. - Combine legacy resources with modern cluster and
Grid information servers to provide - An over-arching Grid information system.
- Independent of particular middleware and
services. - GridRM promotes homogeneity through
- JDBC-like data source driver,
- Standard SQL syntax,
- The GLUE naming schemas,
- Request translation and result normalisation,
- GLUE XML encoded results.
64Summary
- If we want to provide a global infrastructure
that applications developers and users can safely
and effectively use, then we need to provide a
variety of monitoring tools and services. - We need these these to
- Analyse, predict and tune performance,
- For fault detection and identification,
- Identify bottlenecks,
- Scheduling the best resources
- Use data as input into higher level services
- Monitoring, like security, needs to be built into
systems from the design stage. - Without monitoring the dream of the Grid cannot
be realised.
65Future Work
- Provide an example of a job submission system
using GridRM, several options - Other schedulers, Condor, NGE,
- Further security
- Integrate UK e-Science certificates for resource
access control. - Secure driver propagation, i.e, code signing,
trust mechanisms. - Performance and scalability testing.
- More translation schema for different resource
- DBMS, telescope, surf conditions!
- Use of portlet technologies to provide a better
Web interface - GridSphere?!. - Integrate KBS Intelligent Agents, and virtual
registry.
66Virtual Meta-Registry
67Testbed Status (1)
- Currently have 12 sites across 8 countries
- 5 gateways operational
- 1 installed but with problems,
- 2 Being set up (OS, firewall issues),
- 4 waiting for accounts to be provided.
68Testbed Status (2)
- List of sites and status http//gridrm.org/testbe
d.html - World domination is nigh!!
69Acknowledgements
- DSG students
- GridRM, Garry.Smith_at_computer.org,
- jGMA, Matthew.Grove_at_port.ac.uk
70More Information
- GridRM, http//gridrm.org
- jGMA, http//dsg.port.ac.uk/projects/jGMA/
- GMA WG, http//www-didc.lbl.gov/GGF-PERF/GMA-WG/
- GLUE, http//www.hicb.org/glue/glue.htm
- JDBC, http//java.sun.com/products/jdbc/
- Ganglia, http//ganglia.sourceforge.net/
- SNMP, http//net-snmp.sourceforge.net/
- NetLogger, http//www-didc.lbl.gov/NetLogger/
- Network Weather Service, http//nws.cs.ucsb.edu