Globus Heartbeat Monitor and MicroGrids - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Globus Heartbeat Monitor and MicroGrids

Description:

... filter the information (false positives) and also take ... What level of false positives are acceptable? What latencies for failure detection are useful? ... – PowerPoint PPT presentation

Number of Views:65

Avg rating:3.0/5.0

Slides: 27

Provided by: Andre524

Category:

more less

Transcript and Presenter's Notes

Title: Globus Heartbeat Monitor and MicroGrids

1
Globus Heartbeat Monitor and MicroGrids

Announcements/Review
HW2, Due Saturday, 2/6
Project assignment part I, out today, due 2/17
(Wednesday)
Last Time
Grid Security Architecture Issues
What are the unique challenges for Grid Security
A Framework for Grid Security
Perspective
Global Access to Secondary Storage
Motivation and Requirements
Grid Applications and Distributed Filesystems
Gass operations and performance

2
Todays Outline

Globus Heartbeat Monitor
Motivation, Mechanisms, Capability
MicroGrid Environments
Motivation
Goals
Elements

3
Fault Tolerance in Grid Applications

What are the requirements?
What are the existing mechanisms?
The Heartbeat Monitor

4
Requirements

Run large HPDC applications reliably
Requirements depend on the nature of the
application
Time to detection of failures
Time to recover, appropriate method of recovery
Fundamental aspects
Mechanism for detecting failures (system,
application process, service)
Mechanisms for recovery

5
Failure Detector Requirements

Scalability (key Grid requirement)
1000s to 1000000s of nodes
Flexibility (key Grid requirement)
Correct policy depends intimately on the
application
Accuracy
Timeliness
Low overhead

6
Failure Detection and Recovery Mechanisms

Heartbeat Monitors
Algorithms for Agreeing on an action
Process replication (Active or Passive)
Active replica takes over
Logging and Reconstruction
Checkpoint and restart
Snaphots of process state
Reload in case of failure and recompute
Idempotence and decoupled interaction

7
Grid Application Examples

Distributed Supercomputing
Real-time Distributed Instrumentation
Data-Intensive Computing
Tele-Immersion

8
Distributed Supercomputing
230Mflops 115Mflops
LHSF 512MB 400MB Output 3.5 C90 Hours
ASY
250 Mflops - Cray 1280Mflops -- MPP machine
LOG-D 512MB 1.5 C90 Hours

Excessive computational requirements, met only by
combining multiple high capacity resources
(national/world supercomputer)
Applications exploit capacity, special
performance character, and significant bandwidth
hierarchies to deliver performance over wide area
networks

9
Distributed Supercomputing Fault Tolerance

Intermediate results stored on files
independent failures can simply result in reruns
loose coupling amongst the sites
Intermediate results network transient
completely or in part
tighter coupling, failures will require reruns in
multiple or all locations
What overhead is tolerable for fault tolerance in
this type of application?

10
Real-time Distributed Instrumentation

Instruments Advanced Light source, Particle
Accelerator, electron microscope, MRI machine
Computation for data processing, control, remote
data storage and processing

11
RTDI Fault Tolerance

Online processing of data
failure of sensor inputs find another or fail
application?
failure of repository defer commit to archive?
failure of computational or network resources
find another set of resources?
gt things can continue with and things cannot
gt loss of data?
real - time fault detection and recovery (hard)
What overhead is tolerable?

12
Data Intensive Computing

Manipulating 10s of Terabytes, coupling to high
speed computation
Data Mining, Sequence Matching, Cross data set
analyses

13
Data Intensive Fault Tolerance

Large data archives -- may be replicated or not
can substitute if possible, but I/O capability
may or may not be available
Compute and network resources are the easy things
to deal with
Often the applications are data parallel and
intermediate results may be
amenable to checkpoint and restart
simple interfaces may support this
What overhead is tolerable?

14
Tele-immersion

Combination of immersive virtual reality over a
network where any element can be remote
Avatars, natural/artificial interaction, many
modes
Computationally and network intensive

15
Tele-immersion Fault Tolerance

Endpoint Failures -gt how to continue?
Networks and Computational resources
would like to fail over
perhaps reconnect is good enough (if infrequent)
What overhead is tolerable?

16
Globus Heartbeat Monitor

Simple set of tools for process monitoring
Local monitors (observe and generate heartbeats),
use simple local process monitoring mechanisms
Applications register with the local monitor if
they are going to be monitored
APIs for applications to be notified of process
monitoring events
Data collector API which allows applications to
filter and handle the monitoring events
Applications decide and implement appropriate
action based on the events received
Applications must filter the information (false
positives) and also take appropriate global
action (dealing with zombies, etc.)

17
Implementation

One monitoring process per Gusto resources
UDP packets (lower latency)
10 second heartbeat intervals (empirically
chosen)
System Oriented Monitoring (part of the testbed)
seems this is clearly important
minimum status for schedulers, application
planners, etc.

18
HBM Statistics

Testbed spanning Midwest to Southern California
almost national
Heartbeat period
source period of 10 seconds, 1 overhead
interarrival distributions with significant
weight over 200 seconds
Claims
35 seconds, false positives of 1 in 100s
240 seconds, false positives of 1 in 1,000,000s
Are these numbers stable? What level of false
positives are acceptable? What latencies for
failure detection are useful?

19
Is Application Fault tolerance a good idea?

Imposes the complexity of management on the
application (or library for the application
domain)
Enables customized lighter weight solutions that
allow more appropriate action to be taken
What useful capabilities does the Globus HBM
provide?
Other perspectives?

20
MicroGrid Software Environments

Motivation
Reduced Experimentation effort
How much work is it to configure an experiment?
What resources will be made available to just
experiment?
How many folks can realistically experiment at
national (or Global) scale?
All of this in an environment we depend on for
many things?
Enable virtualization of a set of Grid
resources, thereby enabling experiments to be run
at significantly lower effort
lower entry barrier to experimentation (enable
graduate students!)
accelerate the rate of progress in development
and knowledge of the issues in how to build
computational grids

21
MicroGrids (cont)

Motivation (cont)
Scientific study how do we know what will work
in a grid environment?
Real grids are uncontrollable environments
events cannot be produced (coverage)
events may not be reproducible
separation of instrumentation impact on system
(e.g. perturbation) may not be possible
For example, how can we study the dynamical
properties of a new dynamic resource management
algorithm? What if it makes the system unstable?
How can we study catastrophic network events?
(Inet backbone failure)
How can we study the impact of correlated events
(like 1,000,000 folks subscribing to the
superbowl multicast)?

22
A Framework for MicroGrids
Application Program
Virtual Grid
Virtual machine Interface
Actual system configuration

Extend Virtual Machine idea to Virtual Grids
(we call them MicroGrids)
Implement control mechanisms to enable emulation
(near real time)
Interface to directory and information mechanisms
to enable real grid software and real
applications to run
Build scripting, instrumentation, and logging
tools
Host on high performance clusters (emulate almost
anything)

23
Dimensions of Grid Modeling

Network Attributes and Communication Performance
connectivity, security, max rate, latency
Computing Power, Memory Capacity, Storage
Power/Capacity
type, organization (parallel/sequential), power,
etc.
Directory structures and resource management
static configuration information and dynamic
information
scripting capability for dynamic resource
evolution, reproducibility, other dynamic models
for space exploration (e.g. randomization, etc.)
Security structures
trust domain, protocol configuration,
authentication services
Instrumentation and Simulation
collect perforance information for grid researher
when emulation is not quite possible

24
Initial Project Assignments

Organize yourselves into groups of 2-4 students
to build the a key element of a MicroGrid
environment
if distribution of interest is unbalanced, will
resolve choices by a lottery
develop an initial design for resource control
(construction of a virtual grid for that
attribute
develop an initial design for interfacing to
Globus grid applications
Example communication -- interface to Nexus
communication services
Example computing -- interface to scheduling /
process initiation, and allocation services
Example directory structure -- interface to MDS
services for static and dynamic information
Desirable design attributes portability,
flexibility, simplicity, and of course reality.

25
Summary

Globus Heartbeat Monitor
Basic failure detection mechanism
Based on UDP heartbeats and Application event
interpretation
Long time scales for detection and response
Leaves much of the management and policy to the
applications
MicroGrid Environments
Enabling convenient and scientific grid
experimentation
Key elements communication, computation/memory/IO
, directory information services, security
Elements of a basic testbed as the course project

26
Next Time