Title: Reflections on Reliability Issues in OGSA
1Reflections on Reliability Issues in OGSA
- Matti Hiltunen
- ATT Labs - Research
- Florham Park, NJ, USA
- hiltunen_at_research.att.com
Disclaimer This presentation does not reflect
the views of ATT.
2Business Requirements for Grid Computing
- Grid applications
- Traditional batch applications (HPC)
- Interactive applications
- Existing COTS components
- JVMs, web servers, application servers, etc
- Virtual machines
- Application specific requirements
- E.g., very high availability required for
customer facing applications (e-commerce). - Cost /minute of unavailability
3Current approach (before grid)
- Each business service/application implemented on
its own silo of designated resources. - Typically, no single point of failure.
- Constructing highly available services is
expensive! - Different services fail independently from one
another
Router
Firewall
Loadbalancer
Web server
Router
Firewall
Loadbalancer
Web server
4Cheaper or Faster (from Globus Alliance)
5Future approach (with grid)
Different business services/applications share
(some) resources. All applications depend
on OGSA being reliable (otherwise, shared
resources are unavailable). Different
applications no more quite as independent
(compete for same resources).
OGSA
6OGSA A complicated architecture
- LOTS of different services
- Job Manager, Execution Planning Service,
Candidate Set Generator, Reservation, Deployment
and Configuration, Security, Information, Naming
Services, - To start the execution of an application on a
grid, half a dozen (or more) services may need to
interact. - Possible failures
- Server (hardware or OS) running the service
crashes - More servers gt more failures
- More services on one server gt one service may
affect another (e.g., memory leaks) - Service fails (software bug)
- more services gt more code gt more bugs
- each service smaller gt more manageable gt
hopefully fewer bugs - Service interaction failures
- More services gt more possibilities for
interaction failures. - Current commercial grid solutions are much
simpler!
7- OGSA Architecture where the failure of a
service you have never heard of prevents you from
running your applications?
OGSA Architecture where an unknown failure of
unknown service prevents you from running your
applications?
8What can standards do?
- Not very much
- standards define interfaces and protocols
- reliability of OGSA services is more of an
implementation issue. - reliability may provide differentiation between
different standards compliant implementations. - What can be standardized for each OGSA service
- error messages and processing of error messages.
(A invokes B which invokes C, C fails, what kind
of error message will A get.) - protocols that completely handle error cases (any
partner may fail at any time of the protocol
execution or timeouts may occur). - rich monitoring interfaces.
- tracking interfaces APIs for submitting test
requests that will be tracked through the OGSA
architecture. - Fault tolerance related services
- Monitoring Service, replications checkpointing,
etc - Help with fault tolerance of jobs, but not the
fault tolerance of OGSA services themselves.
9Suggestions (Implementation)
- Fast failure detection and recovery mechanisms.
- End-to-end monitoring (submit test jobs).
- Combine or co-locate OGSA services on same
fault-tolerant servers to reduce cost of fault
tolerance. - Hardware fault tolerance, OS-level transparent
replication, service-level replication. - Highly replicated service implementation (P2P
style). - Failure aware service do not give up too easily
(e.g., locate alternate service provider). - Planned graceful degradation path.
10Conclusions
- Reliability is a MUST for enterprise use of grid
computing. - Complexity of OGSA is scary from reliability
point of view. - Standards have limited role with reliability of
OGSA services. - Implementations and/or deployments of OGSA
services must be highly available.