Reflections on Reliability Issues in OGSA - PowerPoint PPT Presentation

1 / 10
About This Presentation
Title:

Reflections on Reliability Issues in OGSA

Description:

hiltunen_at_research.att.com. Disclaimer: This presentation does not reflect the views of AT&T. ... cases (any partner may fail at any time of the protocol ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 11
Provided by: mattihi
Category:

less

Transcript and Presenter's Notes

Title: Reflections on Reliability Issues in OGSA


1
Reflections on Reliability Issues in OGSA
  • Matti Hiltunen
  • ATT Labs - Research
  • Florham Park, NJ, USA
  • hiltunen_at_research.att.com

Disclaimer This presentation does not reflect
the views of ATT.
2
Business Requirements for Grid Computing
  • Grid applications
  • Traditional batch applications (HPC)
  • Interactive applications
  • Existing COTS components
  • JVMs, web servers, application servers, etc
  • Virtual machines
  • Application specific requirements
  • E.g., very high availability required for
    customer facing applications (e-commerce).
  • Cost /minute of unavailability

3
Current approach (before grid)
  • Each business service/application implemented on
    its own silo of designated resources.
  • Typically, no single point of failure.
  • Constructing highly available services is
    expensive!
  • Different services fail independently from one
    another

Router
Firewall
Loadbalancer
Web server
Router
Firewall
Loadbalancer
Web server
4
Cheaper or Faster (from Globus Alliance)
5
Future approach (with grid)
Different business services/applications share
(some) resources. All applications depend
on OGSA being reliable (otherwise, shared
resources are unavailable). Different
applications no more quite as independent
(compete for same resources).
OGSA
6
OGSA A complicated architecture
  • LOTS of different services
  • Job Manager, Execution Planning Service,
    Candidate Set Generator, Reservation, Deployment
    and Configuration, Security, Information, Naming
    Services,
  • To start the execution of an application on a
    grid, half a dozen (or more) services may need to
    interact.
  • Possible failures
  • Server (hardware or OS) running the service
    crashes
  • More servers gt more failures
  • More services on one server gt one service may
    affect another (e.g., memory leaks)
  • Service fails (software bug)
  • more services gt more code gt more bugs
  • each service smaller gt more manageable gt
    hopefully fewer bugs
  • Service interaction failures
  • More services gt more possibilities for
    interaction failures.
  • Current commercial grid solutions are much
    simpler!

7
  • OGSA Architecture where the failure of a
    service you have never heard of prevents you from
    running your applications?

OGSA Architecture where an unknown failure of
unknown service prevents you from running your
applications?
8
What can standards do?
  • Not very much
  • standards define interfaces and protocols
  • reliability of OGSA services is more of an
    implementation issue.
  • reliability may provide differentiation between
    different standards compliant implementations.
  • What can be standardized for each OGSA service
  • error messages and processing of error messages.
    (A invokes B which invokes C, C fails, what kind
    of error message will A get.)
  • protocols that completely handle error cases (any
    partner may fail at any time of the protocol
    execution or timeouts may occur).
  • rich monitoring interfaces.
  • tracking interfaces APIs for submitting test
    requests that will be tracked through the OGSA
    architecture.
  • Fault tolerance related services
  • Monitoring Service, replications checkpointing,
    etc
  • Help with fault tolerance of jobs, but not the
    fault tolerance of OGSA services themselves.

9
Suggestions (Implementation)
  • Fast failure detection and recovery mechanisms.
  • End-to-end monitoring (submit test jobs).
  • Combine or co-locate OGSA services on same
    fault-tolerant servers to reduce cost of fault
    tolerance.
  • Hardware fault tolerance, OS-level transparent
    replication, service-level replication.
  • Highly replicated service implementation (P2P
    style).
  • Failure aware service do not give up too easily
    (e.g., locate alternate service provider).
  • Planned graceful degradation path.

10
Conclusions
  • Reliability is a MUST for enterprise use of grid
    computing.
  • Complexity of OGSA is scary from reliability
    point of view.
  • Standards have limited role with reliability of
    OGSA services.
  • Implementations and/or deployments of OGSA
    services must be highly available.
Write a Comment
User Comments (0)
About PowerShow.com