Tools for Monitoring the Grid - PowerPoint PPT Presentation

1 / 69
About This Presentation
Title:

Tools for Monitoring the Grid

Description:

Textual equivalent to maps: Data in XML, presentation in XSLT and CSS. ... Navigation via the map or textually leads to gateway metadata for the selected site(s) ... – PowerPoint PPT presentation

Number of Views:59
Avg rating:3.0/5.0
Slides: 70
Provided by: MarkB153
Category:

less

Transcript and Presenter's Notes

Title: Tools for Monitoring the Grid


1
Tools for Monitoring the Grid
  • Dr Mark Baker
  • Distributed Systems Group
  • University of Portsmouth
  • http//dsg.port.ac.uk/mab/Talks/ICCSA04/

2
Outline
  • The Grid.
  • Monitoring A brief overview.
  • GridRM resource monitoring.
  • jGMA.
  • Summary and Conclusions.

3
Enter Grid Technologies
  • Infrastructure (middleware) for establishing,
    managing, and evolving multi-organisational
    federations
  • Dynamic, autonomous, domain independent,
  • On-demand, ubiquitous access to computing, data,
    and services.
  • Mechanisms for creating and managing workflow
    within such federations
  • New capabilities constructed dynamically and
    transparently from distributed services,
  • Service-oriented and virtualisation.

4
Elements of the Problem
  • Resource sharing
  • Computers, storage, sensors, networks,
  • Sharing always conditional issues of trust,
    policy, negotiation, payment, security,
  • Coordinated problem solving
  • Beyond client-server distributed data analysis,
    computation, workflow, collaboration,
  • Dynamic, multi-institutional virtual orgs
  • Community overlays on classic organisational
    structures (CERN for example),
  • Large or small, static or dynamic.

5
Why Monitor the Big Picture
  • The Grid is a dynamic, heterogeneous, globally
    distributed complex system with no
  • Central authority,
  • Means of control,
  • Universally accepted means of knowing if a
    resource, or service is up and available for use
    by a applications.
  • Lack of knowledge about the status of the
    resources and services available in any
    distributed system will hamper strategies for
    optimal scheduling, allocation and use.
  • So, like any other distributed system, the Grid
    needs monitoring mechanisms and tools to help
    avoid it becoming chaotic and unusable.

6
Monitoring A Brief Overview
7
Some Motivation
  • Use hardware and/or software tools to observe the
    activities on a given resource, service or
    application
  • Analyse, predict and tune performance,
  • Fault detection and identification,
  • Identify bottlenecks,
  • Scheduling the best resources
  • Where to get/put the data.
  • Where to execute job.
  • Use data as input into higher level services
  • Such as, forecasting, autonomic computing,
    knowledge, intelligence

8
Monitor What?
  • Functional reasons
  • Resources (computer/network/device) up/down,
  • Services functioning, yes, no, partially,
  • SLA being met or not,
  • QoS being fulfilled,
  • Policies being followed,
  • Non-functional reasons
  • Typically performance related
  • Bandwidth/latency,
  • CPU/memory,disk,
  • Device performance,
  • And so on.

9
Functionality Required
  • Data acquisition
  • e.g., Instrumentation, harvesting,
    pre-processing, and data delivery,
  • Data processing
  • e.g., Analysis, presentation, visualization,
    problem detection, problem location, and control.
  • Active and passive monitoring
  • Read-write functionality
  • Management/configuration tasks, and gathering
  • Read-only
  • Just gathering...
  • Classes of monitored entities
  • Resources typically hardware,
  • Services providing services to applications and
    clients.
  • Applications the normal.

10
Monitoring Projects
  • Many monitoring projects with a range of goals
    and purposes.
  • Broadly speaking they monitor
  • Applications,
  • Services,
  • Resources.
  • In the first instance we are interested in
    monitoring resources and services
  • Main difference is the quantity of data produced
    that needs to be passed around a system.

11
Important Features
  • Wide-area monitoring.
  • Scalable architecture.
  • Standardised protocols and APIs.
  • Standard security mechanisms.
  • Runtime extensibility.
  • Normalised view of heterogeneous data sources.
  • Filtering/fusing of data?
  • Multi-Agent API monitoring.
  • Search capability (class, functionality,
    capability).
  • Software dependencies.
  • Available, active and supported.
  • Open source.

12
Motivation
Note 2 Custom agent or sensor must be installed
on monitored resources.
  • No current system was, or quite had the idealised
    features wanted for our purposes so we were
    motivated to develop our own system.

13
The Ideal Situation
  • A scalable wide-area monitoring framework.
  • An standards-based approach that is independent
    of underlying middleware.
  • A secure system.
  • A means of harvesting heterogeneous data and
    presenting it in a homogeneous form.
  • Asynchronous/synchronous, push/pull data
    movement.
  • An extensible infrastructure that can be bound to
    diverse data sources (legacy, and emerging ones).
  • A dynamic, configurable, useable system.

14
GridRM
  • A data gathering framework for monitoring and
    managing the Grid

15
Background
  • We wanted a ubiquitous framework that provides
    information about the health and status of Grid
    resources and services.
  • Gathering resource information, such as
  • Compute (nodes, CPU, memory),
  • Network (inter-site communications links, network
    devices),
  • Sensors (specialised devices, Web cam,
    microphone),
  • Software services (information services,
    schedulers).
  • NOTE Not applications
  • Need a generic system that does not need yet
    another local agent, but can utilise whatever
    exists
  • SNMP, Network Weather Service, NetLogger,
    Ganglia, /proc, MDS,

16
Overview
  • GridRM is a resource monitoring system that
    provides
  • A scalable monitoring framework for the Grid.
  • An standards-based information system that is
    independent of underlying middleware.
  • A means of harvesting heterogeneous data and
    presenting it in a homogeneous form.
  • Data collection push/pull, streaming/events,
  • An extensible infrastructure that can be bound to
    diverse data sources (legacy, and emerging ones).
  • A means of gathering data that can be used for a
    range of purposes such as scheduling, autonomic
    computing, verification of SLA, and creating
    higher-level knowledge which is the interesting
    area

17
GridRM Structure
  • Global layer of peer-related gateways
  • Which in turn have a local layer that interacts
    with the local data sources, and/or a hierarchy
    child gateways.

18
GridRM Architecture
19
GridRM Global Layer
20
jGMA
  • Needed a lightweight implementation of the GGF
    Grid Monitoring Architecture in Java.
  • Others
  • R-GMA,
  • pyGMA,
  • Autopilot, MDS, NWS, CODE
  • Found that existing systems were heavyweight,
    complex or not standalone.
  • Decided to produce our own version
  • Aims
  • GMA compliant,
  • Easy to install and use,
  • Easy to program and extend,
  • Java-based.

21
jGMA Architecture
  • GMA Compliance
  • 21 features,
  • GGF document is only a guide,
  • It is very easy to claim to be compliant,
  • For now jGMA is GMA like.

22
GridRM Local Layer
23
Data Management Issues
JSP
WS
OGSA
Scheduler
KB Agent
Global Layer
jGMA
Local Layer
Agent API
SNMP
NWS
NetL
/proc
Ganglia

XYZ Agent
SNMP Agent
NWS Agent
NetL Agent
/proc Agent
Ganglia Agent
24
Data Management Issues
  • Need to produce
  • A simple and expressive API,
  • Device drivers and manager for each Agent,
  • A means of describing the monitored data
  • Implies an XML-based schema and an ontology.

Resource Markup Language
Ontology and Schema
API
Agent API
Agent Driver Manager
Driver Manager
Common Agent API
Agent Devices
XYZ Agent
SNMP Agent
NWS Agent
NetL Agent
Ganglia Agent
/proc Agent
25
GridRM Detailed View
26
Solutions Simple API
  • Producing an API is fairly simple, but creating
    one that will be taken up and accepted is another
    matter.
  • We are using an API based on JDBC from Java.
  • Example of API
  • Agent Driver Interface
  • forName(GridRM.sql.agent.NWSDriver)
  • forName(GridRM.sql.agent.SNMPv1Driver)
  • Connection Interface
  • String agentURL GridRMNWS/barney5550/PerfDat
    a
  • Connection con DriverManager.getConnection(agent
    URL)
  • Statement Interface
  • Statement stmt con.createStatement()
  • ResultSet rs stmt.executeQuery(get CPU table)
  • Manipulating Results
  • ResultSet is another interface contains a handful
    of methods for manipulating the data returned
    from the agent.

27
Solutions Naming Schema
  • No single naming schema for this area at the
    moment.
  • We needed something that can markup the
    information that can be gathered by the local
    agents
  • Static and dynamic information
  • Name/IP/OS/Processor/NIC/
  • CPU load/memory available/disk space/network/
  • Did not want to produce our own schema, so choose
    an emerging one that is increasingly being used -
    Grid Laboratory Uniform Environment (GLUE)
    schema
  • A schema that defines the attributes of computer
    system resources (CE/NE/)
  • Others, CIM, UNICORE, etc..

28
Local Layer - Detail
29
GridRM Drivers and Manager
  • The GridRM Driver Manager gets data from the
    Agent API and translates it into something that
    the local agents can understand
  • The Driver Manager also provide other
    functionality that is particular to GridRM such
    as configuration, caching, streaming or
    pushing/pulling data to/from clients.
  • The driver manager includes a simple low-level
    API to interact with the local agents based on
    a common sub-set of information that can be
    retrieved from all the agents.

30
Local Layer Use of SQL
  • SQL used extensively throughout the framework.
  • All resources are seen as databases and queried
    using SQL.
  • Resource queries enter the framework as SQL
    syntax.
  • Pluggable resource drivers are implemented as
    JDBC drivers
  • Translate SQL requests into native protocol.
  • Normalise results according to selected schema.
  • Framework benefits from a single, flexible
    approach to resource interaction.
  • Makes for a simple, extensible framework.

31
GridRM Portal
  • The GridRM Portal (gridrm.org) is a demonstration
    of gateways, data sources, SQL and data
    normalisation.
  • An example of the use of GridRM, particularly its
    ability to discover and utilise resource data.
  • An example of a GridRM client which
  • Allows the use of GridRM with no knowledge of the
    underlying technologies.
  • Hides details like SQL, XML, etc
  • Provides an abstraction everyone can use
    clickerity click!).

32
GridRM Portal
33
Portal Navigation (1)
  • Provides a view of GridRM Gateways through a
    hierarchy of maps.
  • Point and click navigation, colour coded status.

34
Portal Navigation (2)
  • Textual equivalent to maps
  • Data in XML, presentation in XSLT and CSS.
  • XML means not just for human consumption!
  • Portal provides a single point of access for
    non-GMA aware clients to monitor the Grid.

35
Portal Navigation (3)
  • Navigation via the map or textually leads to
    gateway metadata for the selected site(s)
  • Remote access and administration for gateway and
    underlying data source drivers.
  • Portal security infrastructure not completed, so
    place holder currently in position.

36
Portal Simple Queries
  • Query resources across configured gateways.
  • Pre-defined query options point click!
  • A Web form shields the user from SQL syntax and
    other low-level details.
  • An example query
  • Retrieve details of all compute resources
  • The user is authorised to use,
  • From the selected Grid sites,
  • That meet specified constraints for
  • Memory, processor load, architecture.

37
Portal Simple Query Form (2)
38
Portal Simple Query Results (1)
Highlighting the drivers used to retrieve
results.
GLUE formatted data
Controlling gateways exposed.
39
Portal Simple Query Results (2)
  • Comparative view of resource processor loads from
    the query in the previous slide.
  • Missing bars indicate idle machines.

40
Portal Technologies Exposed
  • The portals Advanced Queries Page shows
  • All data sources are viewed as relational
    databases.
  • SQL is the uniform way of interacting with these
    diverse data sources.
  • How data sources can be located across gateways.
  • How data sources can be queried dynamically to
    determine the types and format of data they can
    provide.
  • GLUE plays an important role in
  • Providing a consistent view of the way data is
    organised at a resource.
  • Ensuring a homogeneous view of data is achieved
    from heterogeneous sources.
  • Contrasts with the simplicity of the
    point-and-click queries!

41
Portal SQL Interface (1)
  • Aim Query a given data source, using SQL, for
    performance attributes, but without knowing the
    monitoring protocol it natively supports.
  • The SQL commands can
  • show gateways Get a list of registered gateways.
  • use gateway Select a gateway to query DSG
    Workgroup Gateway.
  • show datasources Get a list of all data sources
    registered with the gateway.
  • use datasource Select a particular data source -
    homer.dsg.port.ac.uk
  • show databases Retrieve a list of naming schemas
    supported by this data source driver.
  • use GLUECE_host Query a host resource data using
    the GLUE Computing Element schema.

42
Portal SQL Interface (2)
  • show tables Observe the way the naming schema
    partitions resource data.
  • desc architecture Describe what an architecture
    table will provide.
  • select from Architecture Read the
    architecture type for this data source.
  • select Last5min, Last15min from ProcessorLoad
    Read the 5 and 15 minute load averages.

43
Portal SQL Interface (3)
  • show gateways Get a list of registered gateways
  • We need to find the list of gateways currently
    available to us.
  • Enter show gateways into the command line

44
Portal SQL Interface (4)
  • The list of gateways is returned.
  • Select a gateway to query with the use gateway
    syntax
  • We enter the DSG gateway.

45
Portal SQL Interface (5)
  • The show datasources command returns a list of
    all data sources registered with the gateway
  • The gateway is selected and we now want to see
    data sources
  • We choose the homer data source

46
Portal SQL Interface ()
  • Keep going and eventually

47
Portal SQL Interface (6)
  • Query the resource driver for Architecture and
    CPU load averages for last 5 and 15 minute
  • select from Architecture
  • select Last5min, Last15min from ProcessorLoad

48
Normalisation (1)
  • GridRM clients can
  • Query an arbitrary data source,
  • Retrieve useful resource data in a standard way
  • Without knowing anything about the underlying
    resource's protocol or operation.
  • Pre-requisites
  • Appropriate drivers exist,
  • A translation schema exists.
  • Clearly data returned from a GridRM query can be
    very different from the native data produced.
  • Examples are given here to highlight some of the
    work that goes on within the GridRM drivers...

49
Normalisation (2)
  • Generic GridRM Syntax
  • We have seen that regardless of the underlying
    data source, the format of the GridRM SQL request
    and result remain the same.
  • Behind the scenes a GridRM data source driver
    translates and submits the SQL command into the
    format required by the data source's native
    protocol.
  • Example
  • Query a resources memory capacity and
    availability, using the GLUE CE Schema
  • gt use GLUECE_hostgt select from MainMemory
  • What happens next depends on the underlying data
    source

50
Normalisation (3)
  • Given a range of data sources
  • The granularity of native requests can differ
    greatly,
  • SNMP has fine granularity
  • Distinct data items can be selected and returned
    to the driver.
  • Whereas Ganglia responds to a single generic
    request
  • An XML document describes an entire Grid or
    cluster.
  • On an individual basis GridRM drivers request,
    fuse and filter data as appropriate for their
    associated data source.

51
Normalisation (4)
  • SNMP requests could be formed of
  • Multiple get, getNext, getBulk requests ,
  • To obtain appropriate data for the GridRM
    response.
  • Ganglia requests are of the form
  • telnet lthostnamegt 8649 (gmond)
  • telnet lthostnamegt 8651 (gmetad)
  • /proc requests open a specified file(s).
  • NetLogger requests read from a file.
  • Drivers use mappings to translate the incoming
    SQL request into an appropriate form that can be
    used to query the data source.

52
Normalisation (5)
  • Native results could be of the form
  • SNMP single result string, e.g. CPU Load over
    Last 5 minutes

gtgt snmpget -c public 192.168.100.2.1.3.6.1.4.1.202
1.10.1.3.2 UCD-SNMP-MIBlaLoad.2 STRING 0.02
53
Normalisation (6)
  • or multiple SNMP result strings, e.g. CPU Load
    over last 1,5,15 minutes as integer and floating
    point values

gtgt snmpwalk -c public 192.168.100.2.1.3.6.1.4.1.20
21.10.1 UCD-SNMP-MIBlaConfig.3 STRING
14.00 UCD-SNMP-MIBlaConfig.2 STRING
14.00 UCD-SNMP-MIBlaConfig.3 STRING
14.00 UCD-SNMP-MIBlaLoadInt.1 INTEGER
8 UCD-SNMP-MIBlaLoadInt.2 INTEGER
2 UCD-SNMP-MIBlaLoadInt.3 INTEGER
1 UCD-SNMP-MIBlaLoadFloat.1 Opaque Float
0.080000 UCD-SNMP-MIBlaLoadFloat.2 Opaque
Float 0.020000 UCD-SNMP-MIBlaLoadFloat.3
Opaque Float 0.010000
54
Normalisation (7)
  • Other native results could be of the form
  • Ganglia
  • telnet lthostnamegt 8649 (gmond)

55
Normalisation (6)
  • Normalised Data
  • On retrieving native data, the driver normalises
    the data according to rules in the translation
    schema.
  • A simple example would be to read memory values
    natively in Kbytes and convert into Mbytes as
    required by the selected translation schema.
  • Clearly more complex normalisations occur in line
    with the driver response and the translation
    schema.
  • e.g. select a single item out of a page of
    returned information.

56
Normalisation (7)
  • Normalised results are encoded in XML (GLUE) when
    they leave a gateway on route through the GridRM
    infrastructure.

57
Normalisation (8)
  • Normalised resulting XML contains additional
    metadata
  • That refers to the schema (i.e. GLUECE) used for
    data normalisation,
  • To promote abstraction, SQL data types describe
    schema field value types,
  • To determine the meaning of the returned value,
    reference to the translation schema is required
  • This requirement is under review
  • i.e. could add elements of the translation schema
    metadata into the query result.

58
Solutions Naming Schema
  • No single naming schema for this area at the
    moment.
  • We needed something that can markup the
    information that can be gathered by the local
    agents
  • Static and dynamic information
  • Name/IP/OS/Processor/NIC/
  • CPU load/memory available/disk space/network/
  • Did not want to produce our own schema, so choose
    an emerging one that is increasingly being used -
    Grid Laboratory Uniform Environment (GLUE)
    schema
  • A schema that defines the attributes of computer
    system resources (CE/NE/)
  • Others, CIM, UNICORE, etc..

59
GridRM Drivers and Manager
  • The GridRM Driver Manager gets data from the
    agent API and translates it into something that
    the local agents can understand.
  • The Driver Manager also provide other
    functionality that is particular to GridRM such
    as configuration, caching, streaming or
    pushing/pulling data to/from clients.
  • The driver manager includes a simple low-level
    API to interact with the local agents based on
    a common sub-set of information that can be
    retrieved from all the agents.

60
Local Layer Use of SQL
  • SQL used extensively throughout the framework.
  • All resources are seen as databases and queried
    using SQL.
  • Resource queries enter the framework as SQL
    syntax.
  • Pluggable resource drivers are implemented as
    JDBC drivers
  • Translate SQL requests into native protocol.
  • Normalise results according to selected schema.
  • Framework benefits from a single, flexible
    approach to resource interaction.
  • Makes for a simple, extensible framework.

61
Portal Current GUI
62
GridRM GUI
Homogeneous view of the data sources
63
Summary
  • Heterogeneous information returned from a diverse
    range of possible data sources.
  • Need to harvest data into a homogeneous form
  • Hide underlying complexity from clients.
  • Provide data in a format that meets a clients
    requirements.
  • Combine legacy resources with modern cluster and
    Grid information servers to provide
  • An over-arching Grid information system.
  • Independent of particular middleware and
    services.
  • GridRM promotes homogeneity through
  • JDBC-like data source driver,
  • Standard SQL syntax,
  • The GLUE naming schemas,
  • Request translation and result normalisation,
  • GLUE XML encoded results.

64
Summary
  • If we want to provide a global infrastructure
    that applications developers and users can safely
    and effectively use, then we need to provide a
    variety of monitoring tools and services.
  • We need these these to
  • Analyse, predict and tune performance,
  • For fault detection and identification,
  • Identify bottlenecks,
  • Scheduling the best resources
  • Use data as input into higher level services
  • Monitoring, like security, needs to be built into
    systems from the design stage.
  • Without monitoring the dream of the Grid cannot
    be realised.

65
Future Work
  • Provide an example of a job submission system
    using GridRM, several options
  • Other schedulers, Condor, NGE,
  • Further security
  • Integrate UK e-Science certificates for resource
    access control.
  • Secure driver propagation, i.e, code signing,
    trust mechanisms.
  • Performance and scalability testing.
  • More translation schema for different resource
  • DBMS, telescope, surf conditions!
  • Use of portlet technologies to provide a better
    Web interface - GridSphere?!.
  • Integrate KBS Intelligent Agents, and virtual
    registry.

66
Virtual Meta-Registry
67
Testbed Status (1)
  • Currently have 12 sites across 8 countries
  • 5 gateways operational
  • 1 installed but with problems,
  • 2 Being set up (OS, firewall issues),
  • 4 waiting for accounts to be provided.

68
Testbed Status (2)
  • List of sites and status http//gridrm.org/testbe
    d.html
  • World domination is nigh!!

69
Acknowledgements
  • DSG students
  • GridRM, Garry.Smith_at_computer.org,
  • jGMA, Matthew.Grove_at_port.ac.uk

70
More Information
  • GridRM, http//gridrm.org
  • jGMA, http//dsg.port.ac.uk/projects/jGMA/
  • GMA WG, http//www-didc.lbl.gov/GGF-PERF/GMA-WG/
  • GLUE, http//www.hicb.org/glue/glue.htm
  • JDBC, http//java.sun.com/products/jdbc/
  • Ganglia, http//ganglia.sourceforge.net/
  • SNMP, http//net-snmp.sourceforge.net/
  • NetLogger, http//www-didc.lbl.gov/NetLogger/
  • Network Weather Service, http//nws.cs.ucsb.edu
Write a Comment
User Comments (0)
About PowerShow.com