Title: Globus Heartbeat Monitor and MicroGrids
1Globus Heartbeat Monitor and MicroGrids
- Announcements/Review
- HW2, Due Saturday, 2/6
- Project assignment part I, out today, due 2/17
(Wednesday) - Last Time
- Grid Security Architecture Issues
- What are the unique challenges for Grid Security
- A Framework for Grid Security
- Perspective
- Global Access to Secondary Storage
- Motivation and Requirements
- Grid Applications and Distributed Filesystems
- Gass operations and performance
2Todays Outline
- Globus Heartbeat Monitor
- Motivation, Mechanisms, Capability
- MicroGrid Environments
- Motivation
- Goals
- Elements
3Fault Tolerance in Grid Applications
- What are the requirements?
- What are the existing mechanisms?
- The Heartbeat Monitor
4Requirements
- Run large HPDC applications reliably
- Requirements depend on the nature of the
application - Time to detection of failures
- Time to recover, appropriate method of recovery
- Fundamental aspects
- Mechanism for detecting failures (system,
application process, service) - Mechanisms for recovery
5Failure Detector Requirements
- Scalability (key Grid requirement)
- 1000s to 1000000s of nodes
- Flexibility (key Grid requirement)
- Correct policy depends intimately on the
application - Accuracy
- Timeliness
- Low overhead
6Failure Detection and Recovery Mechanisms
- Heartbeat Monitors
- Algorithms for Agreeing on an action
- Process replication (Active or Passive)
- Active replica takes over
- Logging and Reconstruction
- Checkpoint and restart
- Snaphots of process state
- Reload in case of failure and recompute
- Idempotence and decoupled interaction
7Grid Application Examples
- Distributed Supercomputing
- Real-time Distributed Instrumentation
- Data-Intensive Computing
- Tele-Immersion
8Distributed Supercomputing
230Mflops 115Mflops
LHSF 512MB 400MB Output 3.5 C90 Hours
ASY
250 Mflops - Cray 1280Mflops -- MPP machine
LOG-D 512MB 1.5 C90 Hours
- Excessive computational requirements, met only by
combining multiple high capacity resources
(national/world supercomputer) - Applications exploit capacity, special
performance character, and significant bandwidth
hierarchies to deliver performance over wide area
networks
9Distributed Supercomputing Fault Tolerance
- Intermediate results stored on files
- independent failures can simply result in reruns
- loose coupling amongst the sites
- Intermediate results network transient
- completely or in part
- tighter coupling, failures will require reruns in
multiple or all locations - What overhead is tolerable for fault tolerance in
this type of application?
10Real-time Distributed Instrumentation
- Instruments Advanced Light source, Particle
Accelerator, electron microscope, MRI machine - Computation for data processing, control, remote
data storage and processing
11 RTDI Fault Tolerance
- Online processing of data
- failure of sensor inputs find another or fail
application? - failure of repository defer commit to archive?
- failure of computational or network resources
find another set of resources? - gt things can continue with and things cannot
- gt loss of data?
- real - time fault detection and recovery (hard)
- What overhead is tolerable?
12Data Intensive Computing
- Manipulating 10s of Terabytes, coupling to high
speed computation - Data Mining, Sequence Matching, Cross data set
analyses
13Data Intensive Fault Tolerance
- Large data archives -- may be replicated or not
- can substitute if possible, but I/O capability
may or may not be available - Compute and network resources are the easy things
to deal with - Often the applications are data parallel and
intermediate results may be - amenable to checkpoint and restart
- simple interfaces may support this
- What overhead is tolerable?
14Tele-immersion
- Combination of immersive virtual reality over a
network where any element can be remote - Avatars, natural/artificial interaction, many
modes - Computationally and network intensive
15Tele-immersion Fault Tolerance
- Endpoint Failures -gt how to continue?
- Networks and Computational resources
- would like to fail over
- perhaps reconnect is good enough (if infrequent)
- What overhead is tolerable?
16Globus Heartbeat Monitor
- Simple set of tools for process monitoring
- Local monitors (observe and generate heartbeats),
use simple local process monitoring mechanisms - Applications register with the local monitor if
they are going to be monitored - APIs for applications to be notified of process
monitoring events - Data collector API which allows applications to
filter and handle the monitoring events - Applications decide and implement appropriate
action based on the events received - Applications must filter the information (false
positives) and also take appropriate global
action (dealing with zombies, etc.)
17Implementation
- One monitoring process per Gusto resources
- UDP packets (lower latency)
- 10 second heartbeat intervals (empirically
chosen) - System Oriented Monitoring (part of the testbed)
- seems this is clearly important
- minimum status for schedulers, application
planners, etc.
18HBM Statistics
- Testbed spanning Midwest to Southern California
- almost national
- Heartbeat period
- source period of 10 seconds, 1 overhead
- interarrival distributions with significant
weight over 200 seconds - Claims
- 35 seconds, false positives of 1 in 100s
- 240 seconds, false positives of 1 in 1,000,000s
- Are these numbers stable? What level of false
positives are acceptable? What latencies for
failure detection are useful?
19Is Application Fault tolerance a good idea?
- Imposes the complexity of management on the
application (or library for the application
domain) - Enables customized lighter weight solutions that
allow more appropriate action to be taken - What useful capabilities does the Globus HBM
provide? - Other perspectives?
20MicroGrid Software Environments
- Motivation
- Reduced Experimentation effort
- How much work is it to configure an experiment?
- What resources will be made available to just
experiment? - How many folks can realistically experiment at
national (or Global) scale? - All of this in an environment we depend on for
many things? - Enable virtualization of a set of Grid
resources, thereby enabling experiments to be run
at significantly lower effort - lower entry barrier to experimentation (enable
graduate students!) - accelerate the rate of progress in development
and knowledge of the issues in how to build
computational grids
21MicroGrids (cont)
- Motivation (cont)
- Scientific study how do we know what will work
in a grid environment? - Real grids are uncontrollable environments
- events cannot be produced (coverage)
- events may not be reproducible
- separation of instrumentation impact on system
(e.g. perturbation) may not be possible - For example, how can we study the dynamical
properties of a new dynamic resource management
algorithm? What if it makes the system unstable? - How can we study catastrophic network events?
(Inet backbone failure) - How can we study the impact of correlated events
(like 1,000,000 folks subscribing to the
superbowl multicast)?
22A Framework for MicroGrids
Application Program
Virtual Grid
Virtual machine Interface
Actual system configuration
- Extend Virtual Machine idea to Virtual Grids
(we call them MicroGrids) - Implement control mechanisms to enable emulation
(near real time) - Interface to directory and information mechanisms
to enable real grid software and real
applications to run - Build scripting, instrumentation, and logging
tools - Host on high performance clusters (emulate almost
anything)
23Dimensions of Grid Modeling
- Network Attributes and Communication Performance
- connectivity, security, max rate, latency
- Computing Power, Memory Capacity, Storage
Power/Capacity - type, organization (parallel/sequential), power,
etc. - Directory structures and resource management
- static configuration information and dynamic
information - scripting capability for dynamic resource
evolution, reproducibility, other dynamic models
for space exploration (e.g. randomization, etc.) - Security structures
- trust domain, protocol configuration,
authentication services - Instrumentation and Simulation
- collect perforance information for grid researher
- when emulation is not quite possible
24Initial Project Assignments
- Organize yourselves into groups of 2-4 students
to build the a key element of a MicroGrid
environment - if distribution of interest is unbalanced, will
resolve choices by a lottery - develop an initial design for resource control
(construction of a virtual grid for that
attribute - develop an initial design for interfacing to
Globus grid applications - Example communication -- interface to Nexus
communication services - Example computing -- interface to scheduling /
process initiation, and allocation services - Example directory structure -- interface to MDS
services for static and dynamic information - Desirable design attributes portability,
flexibility, simplicity, and of course reality.
25Summary
- Globus Heartbeat Monitor
- Basic failure detection mechanism
- Based on UDP heartbeats and Application event
interpretation - Long time scales for detection and response
- Leaves much of the management and policy to the
applications - MicroGrid Environments
- Enabling convenient and scientific grid
experimentation - Key elements communication, computation/memory/IO
, directory information services, security - Elements of a basic testbed as the course project
26Next Time
- No assigned readings at this time.