Globus Heartbeat Monitor and MicroGrids - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Globus Heartbeat Monitor and MicroGrids

Description:

... filter the information (false positives) and also take ... What level of false positives are acceptable? What latencies for failure detection are useful? ... – PowerPoint PPT presentation

Number of Views:65
Avg rating:3.0/5.0
Slides: 27
Provided by: Andre524
Category:

less

Transcript and Presenter's Notes

Title: Globus Heartbeat Monitor and MicroGrids


1
Globus Heartbeat Monitor and MicroGrids
  • Announcements/Review
  • HW2, Due Saturday, 2/6
  • Project assignment part I, out today, due 2/17
    (Wednesday)
  • Last Time
  • Grid Security Architecture Issues
  • What are the unique challenges for Grid Security
  • A Framework for Grid Security
  • Perspective
  • Global Access to Secondary Storage
  • Motivation and Requirements
  • Grid Applications and Distributed Filesystems
  • Gass operations and performance

2
Todays Outline
  • Globus Heartbeat Monitor
  • Motivation, Mechanisms, Capability
  • MicroGrid Environments
  • Motivation
  • Goals
  • Elements

3
Fault Tolerance in Grid Applications
  • What are the requirements?
  • What are the existing mechanisms?
  • The Heartbeat Monitor

4
Requirements
  • Run large HPDC applications reliably
  • Requirements depend on the nature of the
    application
  • Time to detection of failures
  • Time to recover, appropriate method of recovery
  • Fundamental aspects
  • Mechanism for detecting failures (system,
    application process, service)
  • Mechanisms for recovery

5
Failure Detector Requirements
  • Scalability (key Grid requirement)
  • 1000s to 1000000s of nodes
  • Flexibility (key Grid requirement)
  • Correct policy depends intimately on the
    application
  • Accuracy
  • Timeliness
  • Low overhead

6
Failure Detection and Recovery Mechanisms
  • Heartbeat Monitors
  • Algorithms for Agreeing on an action
  • Process replication (Active or Passive)
  • Active replica takes over
  • Logging and Reconstruction
  • Checkpoint and restart
  • Snaphots of process state
  • Reload in case of failure and recompute
  • Idempotence and decoupled interaction

7
Grid Application Examples
  • Distributed Supercomputing
  • Real-time Distributed Instrumentation
  • Data-Intensive Computing
  • Tele-Immersion

8
Distributed Supercomputing
230Mflops 115Mflops
LHSF 512MB 400MB Output 3.5 C90 Hours
ASY
250 Mflops - Cray 1280Mflops -- MPP machine
LOG-D 512MB 1.5 C90 Hours
  • Excessive computational requirements, met only by
    combining multiple high capacity resources
    (national/world supercomputer)
  • Applications exploit capacity, special
    performance character, and significant bandwidth
    hierarchies to deliver performance over wide area
    networks

9
Distributed Supercomputing Fault Tolerance
  • Intermediate results stored on files
  • independent failures can simply result in reruns
  • loose coupling amongst the sites
  • Intermediate results network transient
  • completely or in part
  • tighter coupling, failures will require reruns in
    multiple or all locations
  • What overhead is tolerable for fault tolerance in
    this type of application?

10
Real-time Distributed Instrumentation
  • Instruments Advanced Light source, Particle
    Accelerator, electron microscope, MRI machine
  • Computation for data processing, control, remote
    data storage and processing

11
RTDI Fault Tolerance
  • Online processing of data
  • failure of sensor inputs find another or fail
    application?
  • failure of repository defer commit to archive?
  • failure of computational or network resources
    find another set of resources?
  • gt things can continue with and things cannot
  • gt loss of data?
  • real - time fault detection and recovery (hard)
  • What overhead is tolerable?

12
Data Intensive Computing
  • Manipulating 10s of Terabytes, coupling to high
    speed computation
  • Data Mining, Sequence Matching, Cross data set
    analyses

13
Data Intensive Fault Tolerance
  • Large data archives -- may be replicated or not
  • can substitute if possible, but I/O capability
    may or may not be available
  • Compute and network resources are the easy things
    to deal with
  • Often the applications are data parallel and
    intermediate results may be
  • amenable to checkpoint and restart
  • simple interfaces may support this
  • What overhead is tolerable?

14
Tele-immersion
  • Combination of immersive virtual reality over a
    network where any element can be remote
  • Avatars, natural/artificial interaction, many
    modes
  • Computationally and network intensive

15
Tele-immersion Fault Tolerance
  • Endpoint Failures -gt how to continue?
  • Networks and Computational resources
  • would like to fail over
  • perhaps reconnect is good enough (if infrequent)
  • What overhead is tolerable?

16
Globus Heartbeat Monitor
  • Simple set of tools for process monitoring
  • Local monitors (observe and generate heartbeats),
    use simple local process monitoring mechanisms
  • Applications register with the local monitor if
    they are going to be monitored
  • APIs for applications to be notified of process
    monitoring events
  • Data collector API which allows applications to
    filter and handle the monitoring events
  • Applications decide and implement appropriate
    action based on the events received
  • Applications must filter the information (false
    positives) and also take appropriate global
    action (dealing with zombies, etc.)

17
Implementation
  • One monitoring process per Gusto resources
  • UDP packets (lower latency)
  • 10 second heartbeat intervals (empirically
    chosen)
  • System Oriented Monitoring (part of the testbed)
  • seems this is clearly important
  • minimum status for schedulers, application
    planners, etc.

18
HBM Statistics
  • Testbed spanning Midwest to Southern California
  • almost national
  • Heartbeat period
  • source period of 10 seconds, 1 overhead
  • interarrival distributions with significant
    weight over 200 seconds
  • Claims
  • 35 seconds, false positives of 1 in 100s
  • 240 seconds, false positives of 1 in 1,000,000s
  • Are these numbers stable? What level of false
    positives are acceptable? What latencies for
    failure detection are useful?

19
Is Application Fault tolerance a good idea?
  • Imposes the complexity of management on the
    application (or library for the application
    domain)
  • Enables customized lighter weight solutions that
    allow more appropriate action to be taken
  • What useful capabilities does the Globus HBM
    provide?
  • Other perspectives?

20
MicroGrid Software Environments
  • Motivation
  • Reduced Experimentation effort
  • How much work is it to configure an experiment?
  • What resources will be made available to just
    experiment?
  • How many folks can realistically experiment at
    national (or Global) scale?
  • All of this in an environment we depend on for
    many things?
  • Enable virtualization of a set of Grid
    resources, thereby enabling experiments to be run
    at significantly lower effort
  • lower entry barrier to experimentation (enable
    graduate students!)
  • accelerate the rate of progress in development
    and knowledge of the issues in how to build
    computational grids

21
MicroGrids (cont)
  • Motivation (cont)
  • Scientific study how do we know what will work
    in a grid environment?
  • Real grids are uncontrollable environments
  • events cannot be produced (coverage)
  • events may not be reproducible
  • separation of instrumentation impact on system
    (e.g. perturbation) may not be possible
  • For example, how can we study the dynamical
    properties of a new dynamic resource management
    algorithm? What if it makes the system unstable?
  • How can we study catastrophic network events?
    (Inet backbone failure)
  • How can we study the impact of correlated events
    (like 1,000,000 folks subscribing to the
    superbowl multicast)?

22
A Framework for MicroGrids
Application Program
Virtual Grid
Virtual machine Interface
Actual system configuration
  • Extend Virtual Machine idea to Virtual Grids
    (we call them MicroGrids)
  • Implement control mechanisms to enable emulation
    (near real time)
  • Interface to directory and information mechanisms
    to enable real grid software and real
    applications to run
  • Build scripting, instrumentation, and logging
    tools
  • Host on high performance clusters (emulate almost
    anything)

23
Dimensions of Grid Modeling
  • Network Attributes and Communication Performance
  • connectivity, security, max rate, latency
  • Computing Power, Memory Capacity, Storage
    Power/Capacity
  • type, organization (parallel/sequential), power,
    etc.
  • Directory structures and resource management
  • static configuration information and dynamic
    information
  • scripting capability for dynamic resource
    evolution, reproducibility, other dynamic models
    for space exploration (e.g. randomization, etc.)
  • Security structures
  • trust domain, protocol configuration,
    authentication services
  • Instrumentation and Simulation
  • collect perforance information for grid researher
  • when emulation is not quite possible

24
Initial Project Assignments
  • Organize yourselves into groups of 2-4 students
    to build the a key element of a MicroGrid
    environment
  • if distribution of interest is unbalanced, will
    resolve choices by a lottery
  • develop an initial design for resource control
    (construction of a virtual grid for that
    attribute
  • develop an initial design for interfacing to
    Globus grid applications
  • Example communication -- interface to Nexus
    communication services
  • Example computing -- interface to scheduling /
    process initiation, and allocation services
  • Example directory structure -- interface to MDS
    services for static and dynamic information
  • Desirable design attributes portability,
    flexibility, simplicity, and of course reality.

25
Summary
  • Globus Heartbeat Monitor
  • Basic failure detection mechanism
  • Based on UDP heartbeats and Application event
    interpretation
  • Long time scales for detection and response
  • Leaves much of the management and policy to the
    applications
  • MicroGrid Environments
  • Enabling convenient and scientific grid
    experimentation
  • Key elements communication, computation/memory/IO
    , directory information services, security
  • Elements of a basic testbed as the course project

26
Next Time
  • No assigned readings at this time.
Write a Comment
User Comments (0)
About PowerShow.com