Reflections on Reliability Issues in OGSA - PowerPoint PPT Presentation

1 / 10

About This Presentation

Title:

Reflections on Reliability Issues in OGSA

Description:

hiltunen_at_research.att.com. Disclaimer: This presentation does not reflect the views of AT&T. ... cases (any partner may fail at any time of the protocol ... – PowerPoint PPT presentation

Number of Views:36

Avg rating:3.0/5.0

Slides: 11

Provided by: mattihi

Category:

more less

Transcript and Presenter's Notes

Title: Reflections on Reliability Issues in OGSA

1
Reflections on Reliability Issues in OGSA

Matti Hiltunen
ATT Labs - Research
Florham Park, NJ, USA
hiltunen_at_research.att.com

Disclaimer This presentation does not reflect
the views of ATT.
2
Business Requirements for Grid Computing

Grid applications
Traditional batch applications (HPC)
Interactive applications
Existing COTS components
JVMs, web servers, application servers, etc
Virtual machines
Application specific requirements
E.g., very high availability required for
customer facing applications (e-commerce).
Cost /minute of unavailability

3
Current approach (before grid)

Each business service/application implemented on
its own silo of designated resources.
Typically, no single point of failure.
Constructing highly available services is
expensive!
Different services fail independently from one
another

Router
Firewall
Loadbalancer
Web server
Router
Firewall
Loadbalancer
Web server
4
Cheaper or Faster (from Globus Alliance)
5
Future approach (with grid)
Different business services/applications share
(some) resources. All applications depend
on OGSA being reliable (otherwise, shared
resources are unavailable). Different
applications no more quite as independent
(compete for same resources).
OGSA
6
OGSA A complicated architecture

LOTS of different services
Job Manager, Execution Planning Service,
Candidate Set Generator, Reservation, Deployment
and Configuration, Security, Information, Naming
Services,
To start the execution of an application on a
grid, half a dozen (or more) services may need to
interact.
Possible failures
Server (hardware or OS) running the service
crashes
More servers gt more failures
More services on one server gt one service may
affect another (e.g., memory leaks)
Service fails (software bug)
more services gt more code gt more bugs
each service smaller gt more manageable gt
hopefully fewer bugs
Service interaction failures
More services gt more possibilities for
interaction failures.
Current commercial grid solutions are much
simpler!

OGSA Architecture where the failure of a
service you have never heard of prevents you from
running your applications?

OGSA Architecture where an unknown failure of
unknown service prevents you from running your
applications?
8
What can standards do?

Not very much
standards define interfaces and protocols
reliability of OGSA services is more of an
implementation issue.
reliability may provide differentiation between
different standards compliant implementations.
What can be standardized for each OGSA service
error messages and processing of error messages.
(A invokes B which invokes C, C fails, what kind
of error message will A get.)
protocols that completely handle error cases (any
partner may fail at any time of the protocol
execution or timeouts may occur).
rich monitoring interfaces.
tracking interfaces APIs for submitting test
requests that will be tracked through the OGSA
architecture.
Fault tolerance related services
Monitoring Service, replications checkpointing,
etc
Help with fault tolerance of jobs, but not the
fault tolerance of OGSA services themselves.

9
Suggestions (Implementation)

Fast failure detection and recovery mechanisms.
End-to-end monitoring (submit test jobs).
Combine or co-locate OGSA services on same
fault-tolerant servers to reduce cost of fault
tolerance.
Hardware fault tolerance, OS-level transparent
replication, service-level replication.
Highly replicated service implementation (P2P
style).
Failure aware service do not give up too easily
(e.g., locate alternate service provider).
Planned graceful degradation path.

10
Conclusions