Fault Tolerance Computing - PowerPoint PPT Presentation

1 / 34

About This Presentation

Title:

Fault Tolerance Computing

Description:

In terms of money, time, and lives. ... If one component fails, there is a spare to take over. How the spare knows when to take over? ... – PowerPoint PPT presentation

Number of Views:1164

Avg rating:3.0/5.0

Slides: 35

Provided by: steve965

Category:

more less

Transcript and Presenter's Notes

Title: Fault Tolerance Computing

1
Fault Tolerance Computing

Adnan Agbaria

2
System Model and Basic Concepts
3
Staff

Dr. Adnan Agbaria
adnan_at_il.ibm.com
Office Hours
Right after the class
Monday 830-1000
Course URL
http//cs.haifa.ac.il/courses/ftc

4
Materials

Textbooks
Distributed Systems 2nd edition Sape Mullender
(Editor), ACM Press Frontier Series, Addison
Wesley
K. Berman. Building Secure and Reliable Network
Applications. Manning Publishing Company and
Prentice Hall, December, 1996.
J.-C. Laprie. Dependability Basic Concepts and
Terminology. Springer_Verlag, 1992.
P. Jalote. Fault Tolerance in Distributed
Systems. Prentice-Hall, Inc., 1994.
Research papers
See the list at the web site

5
Grading and Prerequisites

Grading
Participation 20
Presentation 40
Home Assignments 40
Prerequisites
Operating systems
Networking
Algorithms

6
Course Outline

Definition and basic concepts
Replications
Group Communication and Virtual Synchrony
Consensus and Byzantine Agreement
Checkpoint/Restart Basic concepts
Distributed Checkpointing

7
Course Outline (Contd)

Student presentations Replications
Student presentations Failure detection
Student presentations Group communication and
virtual synchrony
Student presentations Distributed checkpointing
Network computer security and intrusion tolerance

8
Student Presentations

Every student should send me an email
Two preferred papers to present
1st paper is the most wanted!
Each presentation is
30 min presentation
15 Q and A
Homework questions may include materials from the
presentations as well as from the lecturers
presentations.

9
Outline

Motivation
Concepts, definitions, notations, and system
model
Fault model
Synchronous and asynchronous models
Time in distributed systems

10
Motivation

The system downtime cost is very high
4 billions annually the estimated cost of
system downtime in North American companies
(source Computer Economics Infocorp.
Consulting)
Availability is still low
"despite the Internet driving a significantly
increased desire for continuous availability,
through 2005, fewer than 20 percent of
mission-critical Web-based applications will
achieve it. Around 40 percent will achieve high
availability at lower cost" (Source The Gartner
Group)

11
Motivation (Contd)

The impact of failures is VERY costly.
In terms of money, time, and lives.
Examples bank, air control, telephone systems,
weather forecasting, etc.
There is no way to prevent failures
So what we can do Fault tolerance
Goals
High availability and reliability.
Ways
Fault tolerance

12
Distributed System - Definition

A distributed system consists of a collection of
autonomous computers, connected through a network
and distribution middleware, which enables
computers to coordinate their activities and to
share the resources of the system, so that users
perceive the system as a single, integrated
computing facility.

13
Why Distributed Systems?

Information and Hardware Sharing
Scalability
Availability
Fault Tolerance
Price/performance

14
Types of Distributed Systems

Client/Server
Web (HTTP), NFS, Automatic Teller Machines
Group computing
Distributed/replicated servers
Pub/sub and messaging based system
Collaborative computing (CSCW)
Teleteaching, telemedicine, video-conferencing,
Lotus Notes, shared windows sessions
Parallel (cluster) computing in distributed
environments
Message passing interface (MPI)
Distributed Shared Memory

15
System Model

Distributed system with n processes,
Denoted by P1, P2,,Pn
Each process has local memory and CPU.
Processes communicate via asynchronous network by
send/receive events.
Processes are asynchronous too
Dont share a global clock.

16
Basic Events
Computation
Send(M)
Network
Receive(M)
Recovery
17
A Drawing Conception
P1
m2
m3
m1
P2
18
Fault Tolerance

Hardware, software and networks fail!
Source of failures
Human, radiation, etc.
There are
Intentional faults Mainly caused by attacks,
viruses/worms, etc.
Non-intentional faults Mainly caused due to
Bugs in the code
Incorrect configuration and deployment
Environment
The rate of failures is still too high
Impossible to prevent failures!
So, What we can do?

19
Fault Tolerance (Contd)

Distributed systems must maintain availability
even at low levels of hardware/software/network
reliability.
Fault tolerance is achieved, mainly, by
Recovery
Recover the machine, system, or application upon
any failure
Failures should be detected.
Where to recover from?
Start everything from scratch, or
Restart from a pre-captured state.
Cost and performance

20
Fault Tolerance (Contd)

Redundancy
If one component fails, there is a spare to take
over.
How the spare knows when to take over?
How often we update the spare?
How many spare do we need?
Cost and performance.
Self stabilization
If the system is in a faulty state, it should
detect and go back to the normal state.
How does the system do that?
We are not consider this technique in the course.

21
Reliability

Means that the system is continuously produce
correct services.
The reliability R(t) of a system SYS can be
expressed as
R(t) Prob(SYS is fully functioning in 0,t)
A metric for reliability R(t) is MTTF, the Mean
Time To Failure

22
Availability

Means that the system produces services when it
is required from authorized use
The availability A(t) of a system SYS can be
expressed as
A(t) Prob(SYS is fully functioning at time t)
A metric for the average, steady-state
availability is

23
Failure, Error, and Fault

Failure transition from proper to improper
service
Error that part of system state that is liable
to lead to subsequent failure
Fault the hypothesized cause of error(s)

Activation
Propagation
Causation
Fault
Error
Failure
Fault
24
Failure Types

Crash Fail-stop mode. The process does not
active
Omission Fail to send/receive a message
Transit Temporally failure that may affect the
system functionality
Byzantine Exhibits random behavior of the
process
Malicious Intention failure that usually caused
by attacks

25
Means to attain Availability and Reliability

Fault prevention
Try to prevent faults before happening.
Examples
Using of strongly-typed programming language
Firewalls for preventing intrusions (for
intrusion tolerance)
Fault tolerance
Handling failures and trying to continue provide
correct functionality.
Examples
Checkpoint/Restart, Replication, and
Self-Stabilization.
Replication with Byzantine Agreement (for
intrusion tolerance)
Fault Detection
Detecting and removing the faults.
Examples
Timeout-based detection This is for detecting
crash failure.
Anomaly-based detection Intrusion Detection
Systems (IDSs)

26
Correct vs. Faulty

Look at a complete run (execution)
external observers view
A process that does not fail in a run is correct
in that run
Otherwise, the process is faulty in the run
a process that fails any time in the run is
faulty throughout the entire run

27
Threshold Failure Model

t out of n processes may fail
t is usually given as a function of n, e.g.,
t lt n
2t lt n
3t lt n

28
Examples
29
A Database System

Transactions
Initiate a connection with the bank server and
ask for a financial transaction on your account.
The Server update the database.
The Server send a confirmation message to the
user.

2
3
1
30
A Database System (Contd)

Possible Failures
The server may crash
One of the connection may be down (cut).

2
3
1
31
Synchronous vs. Asynchronous

Synchrony assumptions
Message latency is bounded
Processes have synchronized clocks
Processing times are bounded
Asynchrony no assumptions
Asynchronous models are more practice.
The Internet is an asynchrony system.

32
Example The Coordinated Attack Problem

Definition
Two armies (red and blue) surround a town.
The two armies want to coordinate to attack the
town,
Victory is achieved if and only if two armies
attack simultaneously. Otherwise, the attack army
will be defeated.
The generals (red and blue) communicate by
messengers.
messengers can be captured (message loss) and/or
can take arbitrarily long.

33
The Coordinated Attack Problem (Contd)

There is no solution for the problem in the
asynchronous model.
There is a solution in the synchronous model (?)
Can we add some requirements in the asynch model
to solve the problem?

34
Time in Distributed Systems

Logical time
Causality
Similar to visible knowledge that advances at the
speed of light
Wall clock time / real time / global time
Clock skew
The rate in which local clocks drift w.r.t. each
other. Depends on clocks quality, but also on
temperature, magnetic field, etc.
It is possible to obtain time from GPS, or radio
clocks, but the latency (both of the signal and
handling the signal inside the computer) can vary
a bit. Also, may not be available when there is
no line of site to the sky.