Keeping a Hawkeye on The Grid - PowerPoint PPT Presentation

1 / 35
About This Presentation
Title:

Keeping a Hawkeye on The Grid

Description:

Title: Keeping a Hawkeye on The Grid Last modified by: JMS Document presentation format: On-screen Show Other titles: Times New Roman Comic Sans MS Gothic StarSymbol ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 36
Provided by: nescAcUk55
Category:

less

Transcript and Presenter's Notes

Title: Keeping a Hawkeye on The Grid


1
Keeping a Hawkeyeon The Grid
  • Nick LeRoy
  • Computer Sciences Department
  • University of Wisconsin-Madison
  • nleroy_at_cs.wisc.edu
  • http//www.cs.wisc.edu/condor/hawkeye

2
The Grid Idea
  • Large scale distributed computing
  • Solve massive computational problems

3
The Grid Reality
  • Sites go down
  • Updates need to by synchronized
  • Firewalls get in the way
  • Human errors occur
  • Separate administrative domains cause
    inconsistencies

4
The Emperor's New Grid
Grid
5
Some Observations
  • A lot of these problems can't be solved by
    technology alone
  • Human errors
  • Separate administrative domains
  • Detecting problems is often quite difficult and
    time consuming

6
More Observations
  • Can't fix problems before we're aware of them
  • Impossible to prevent all classes of problems

7
So...
  • Often more cost effective to detect work around
    problems
  • Even when prevention is possible
  • Detection is always required
  • Automation is our friend

8
Watching the Grid
  • We need a monitoring system
  • That can automate detection
  • That is flexible
  • That is easy to deploy
  • That can alert us of problems
  • In a timely manner

9
Hawkeye
  • Hawkeye is a monitoring tool
  • Designed for grid and distributed applications
  • Provides Automated detection
  • Very Flexible
  • Is Easy to deploy
  • Provides Timely Alerts of problems

10
Watching the Grid
Grid
11
Hawkeye Uses
  • Can be used for
  • Monitoring system load, I/O, usage, etc.
  • Watching for run-away processes
  • Monitoring the health of your pool
  • Watching the health of grid site
  • ...

12
Some Details about Hawkeye
  • Distributed monitoring system
  • Uses a push data model
  • Built on Condor technology
  • Uses ClassAds match-making
  • Every Condor has Hawkeye built-in
  • Stable, production quality

13
Hawkeye UI
  • Alert you when things go wrong
  • When virtually any condition is found
  • When various problems are found
  • My checkpoint server's disk is full
  • Joe has had a CVS lock for 20 minutes
  • Help you visualize what's going on
  • Plotting via RRDT

14
Why would I run Hawkeye?
  • Make system administration easier
  • Simplify pool maintenance
  • Condor
  • Other batch system
  • Scalable solution

15
Hawkeye Architecture
16
Hawkeye Monitoring Agent
Hawkeye Job Manager
ClassAd
17
Hawkeye ClassAds
  • Hawkeye uses Condor ClassAds to represent
    collected data
  • Schema-free data representation
  • Provides matching mechanism
  • Represent whatever data you gather in a way that
    works best for you

18
Hawkeye ClassAds
  • Example ClassAd snippet
  • RAM_MemFree 841932800
  • RAM_MemShared 0
  • RAM_MemTotal 1055367168
  • RAM_SwapCached 0
  • RAM_SwapFree 2147483647
  • RAM_SwapTotal 2147483647

19
Hawkeye Modules
  • Current library of modules monitor
  • Processes, CPU usage, etc.
  • RAM, I/O, VM Statistics, etc.
  • Disk space
  • CVS repository
  • GASS Cache statistics
  • etc.

20
Hawkeye and Condor
  • Hawkeye has Condor specific tools
  • Developed to help us run our pool

21
Condor Node Module
  • Run on each node of the pool
  • Watches the Condor daemons
  • Monitors multiple virtual machines
  • Can identify run-away or orphaned jobs / processes

22
Condor Pool Module
  • Run on just one host
  • Reports overall pool health
  • Watches for absent nodes
  • Lots of data on
  • Job Submitters
  • Running Jobs
  • CPUs in the pool

23
Other Condor Modules
  • Checkpoint server module
  • Watch of checkpoints, disk space, etc.
  • Job history module
  • Number and types of jobs, etc.

24
Custom Hawkeye Modules
  • Hawkeye allows you to run your own custom
    modules to gather data
  • Simple text to stdout
  • Can be a shell One liner
  • Can be a 100 line perl program
  • All current modules are in perl
  • Can be 10k-line C program

25
Hawkeye Alerts
  • Hawkeye allows you in set your own custom
    alerts
  • On attributes generated by standard and/or custom
    modules
  • Flexible, uses ClassAd Match-making
  • Used to generate dynamic web pages

26
Hawkeye Matchmaking
  • Hawkeye alerts are done using ClassAd
    match-making.

27
Sample Alert Trigger
  • AlertTrigger ( MyType "Pool"
    Absent.count gt 5 )
  • AlertSeverity ( Absent.count gt 5 ) ? 1
    0
  • Name "Absent Nodes"
  • AlertText StrCat(Absent.count,
  • " machines
    are missing in ",
  • Name)

28
Advanced Trigger Tool
  • ClassAd based trigger system with state
  • Example Take some action if a machine has been
    heavily loaded for a certain amount of time
  • Much more flexible
  • You specify the action to take
  • Maintains state information

29
More on Advanced Trigger
  • Both current and previous state can be used in
    generating a trigger
  • Example Send me an email when the system has
    been heavily loaded for a specified time, but
    don't flood my inbox with them...

30
Hawkeye Extras
  • Currently available
  • Tool to set up a Condor to easily install run
    Hawkeye modules
  • In development
  • Grid Exerciser module
  • Data plotting tool

31
Hawkeye at UW
  • Currently at UW CS, we're using Hawkeye
    extensively
  • To monitor our 1400-CPU Condor cluster
  • To aid in detecting and correcting cluster
    problems
  • Hawkeye is one of our main tools for pool
    administration

32

33
(No Transcript)
34
(No Transcript)
35
What is the status of Hawkeye?
  • Version 1.0 Release Candidate 5 RC5
  • Version 1.0 real soon
  • Available from http//cs.wisc.edu/condor/hawkeye
  • Get help
  • Condor condor-admin_at_cs.wisc.edu
Write a Comment
User Comments (0)
About PowerShow.com