Extensible Scalable Monitoring for Clusters of Computers - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Extensible Scalable Monitoring for Clusters of Computers

Description:

Snapshot. Experience. Conclusion & Future Work. 5. Problem: ... Implementation Snapshot. 13. Experience. Configuration information should be in database ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 16
Provided by: erica180
Category:

less

Transcript and Presenter's Notes

Title: Extensible Scalable Monitoring for Clusters of Computers


1
Extensible Scalable Monitoring for Clusters of
Computers
  • Eric Anderson
  • U.C. Berkeley
  • Summer 1997 NOW Retreat

2
Overall Problem
  • Monitoring a cluster of cooperating computers
  • Different from client-server where only servers
    matter
  • Requires substantial information from all
    machines
  • 100s-1000s of nodes
  • Client-server becomes subset of this problem

3
Problems Solutions
  • Cluster software and hardware is constantly
    evolving
  • Monitoring software must be extensible and
    flexible
  • Use relational tables
  • Failures will occur in the cluster
  • Monitoring software must detect and recover from
    failures
  • Use timestamps for weak synchronization
  • Scalability needed to hundreds of nodes
  • Need to efficiently transfer data from sources to
    sinks
  • Use hierarchy hybrid push-pull protocol
  • Need to display statistics and information from
    all nodes
  • Use statistical aggregation color,shade to
    minimize info. loss

4
Overview
  • Details of solutions
  • Handling evolving software
  • Detecting and recovering from failures
  • Scaling data management
  • Scaling visualization
  • Implementation
  • Architecture
  • Programs
  • Snapshot
  • Experience
  • Conclusion Future Work

5
Problem Clusters Evolve
  • Solution Relational tables
  • Increases flexibility by decoupling data users
    from data providers
  • Increases extensibility by structuring data into
    independent tables
  • Increases extensibility by allowing additional
    columns in tables without breaking old programs
  • Retains performance through transparent use of
    indicies
  • Improvement over tree structures in previous
    systems

6
Problem Failures Occur
  • Solution Use timestamps
  • Loss of periodic updates to timestamps allow
    remote nodes to detect failures
  • Timestamps allow weak synchronization between
    databases
  • Better availability during failures, simpler
    recovery
  • Timestamps allow stale data to be eliminated
  • Only requires purges run every so often rather
    than relying on programs to clean up after
    themselves
  • Reasons 2 3 are useful even in normal operation

7
Problem Scalable Data Access
  • Solution Hierarchy efficient protocol
  • Hierarchy allows
  • Batching of data from different nodes (all data
    from routers)
  • Specialization to particular data (all data on
    processes)
  • Efficient protocol (Hybrid of push/pull)
  • Sink sends (SQL select command, interval, count )
    to source
  • Changed data is extracted via SQL every interval
    seconds and forwarded to the sink count times
  • Sink can cancel requests at any time
  • Achieves the best of pull and push protocols in
    terms of wasted data transfers, freshness, and
    network bandwidth

8
Problem Scalable Visualization
  • Solution Statistical aggregation use of shade
    color to minimize information loss
  • Aggregate across similar variables (average load
    of 10 machines) show dispersion (std. dev.) as
    shade
  • Aggregate across variables from one node
    (utilization maxdisk,network,cpu)
  • Both forms of aggregation at the same time
    hierarchical aggregation
  • Use color to draw attention to special things
    (nodes down) to limit visual overload

9
Implementation Architecture
10
Implementation Details
  • Databases are MiniSQL
  • Freely available with source code
  • Implements subset of SQL
  • Forwarder implements source part of hybrid
    protocol
  • Using polling to get data from database
  • Joinpush implements merging part of hierarchy
  • Control of merge sources external to the program
  • Both forwarder joinpush implemented in threaded
    C
  • Simpler implementation for blocking operations
  • Could be merged in with the database

11
Implementation Details, cont.
  • Gather implemented in perl
  • Simpler to add new data sources, but would like
    threading
  • Somewhat inefficient, might re-implement in C
  • Javaserver implemented in perl
  • Easier to extend with additional aggregation
    forms
  • Application level proxy because Java cant access
    network
  • Javaclient implemented in Java
  • Allows clients to run in browser anywhere in the
    world
  • Weak feedback to javaserver to control
    information displayed

12
Implementation Snapshot
13
Experience
  • Configuration information should be in database
  • Had them in random files database collects it
    together
  • Reset-world operation very important
  • Puts system in known state
  • Useful for default destination of statistics of
    remote database
  • Minimizes load on monitored nodes
  • Potentially reduces fault tolerance
  • Browser user interface very useful
  • Limitations of Java very obnoxious

14
Conclusion
  • Four problems solutions important for any
    cluster monitoring system
  • Evolution inherent in uses of clusters
  • Independent failures occur in all clusters
  • Scalability of data management needed for large
    clusters
  • Scalability of visualization also needed for
    large clusters
  • Implementation works, and initially useful,
    further deployment needed
  • Experience identified problems, places for
    improvements.

15
Future Work
  • Automatic identification of statistics relevant
    to problems
  • Expect to be able to use Boolean disjunction
    learning algorithms
  • Tracking of long term trends and statistical
    measures
  • Self tuning of specialized databases based on
    usage
  • Addition of notification, repair components
  • Gathering of more statistics (via SNMP for
    example)
  • Distribution of system to external sites
Write a Comment
User Comments (0)
About PowerShow.com