Title: Fault Tolerance Computing
1Fault Tolerance Computing
2System Model and Basic Concepts
3Staff
- Dr. Adnan Agbaria
- adnan_at_il.ibm.com
- Office Hours
- Right after the class
- Monday 830-1000
- Course URL
- http//cs.haifa.ac.il/courses/ftc
4Materials
- Textbooks
- Distributed Systems 2nd edition Sape Mullender
(Editor), ACM Press Frontier Series, Addison
Wesley - K. Berman. Building Secure and Reliable Network
Applications. Manning Publishing Company and
Prentice Hall, December, 1996. - J.-C. Laprie. Dependability Basic Concepts and
Terminology. Springer_Verlag, 1992. - P. Jalote. Fault Tolerance in Distributed
Systems. Prentice-Hall, Inc., 1994. - Research papers
- See the list at the web site
5Grading and Prerequisites
- Grading
- Participation 20
- Presentation 40
- Home Assignments 40
- Prerequisites
- Operating systems
- Networking
- Algorithms
6Course Outline
- Definition and basic concepts
- Replications
- Group Communication and Virtual Synchrony
- Consensus and Byzantine Agreement
- Checkpoint/Restart Basic concepts
- Distributed Checkpointing
7Course Outline (Contd)
- Student presentations Replications
- Student presentations Failure detection
- Student presentations Group communication and
virtual synchrony - Student presentations Distributed checkpointing
- Network computer security and intrusion tolerance
8Student Presentations
- Every student should send me an email
- Two preferred papers to present
- 1st paper is the most wanted!
- Each presentation is
- 30 min presentation
- 15 Q and A
- Homework questions may include materials from the
presentations as well as from the lecturers
presentations.
9Outline
- Motivation
- Concepts, definitions, notations, and system
model - Fault model
- Synchronous and asynchronous models
- Time in distributed systems
10Motivation
- The system downtime cost is very high
- 4 billions annually the estimated cost of
system downtime in North American companies
(source Computer Economics Infocorp.
Consulting) - Availability is still low
- "despite the Internet driving a significantly
increased desire for continuous availability,
through 2005, fewer than 20 percent of
mission-critical Web-based applications will
achieve it. Around 40 percent will achieve high
availability at lower cost" (Source The Gartner
Group)
11Motivation (Contd)
- The impact of failures is VERY costly.
- In terms of money, time, and lives.
- Examples bank, air control, telephone systems,
weather forecasting, etc. - There is no way to prevent failures
- So what we can do Fault tolerance
- Goals
- High availability and reliability.
- Ways
- Fault tolerance
12Distributed System - Definition
- A distributed system consists of a collection of
autonomous computers, connected through a network
and distribution middleware, which enables
computers to coordinate their activities and to
share the resources of the system, so that users
perceive the system as a single, integrated
computing facility.
13Why Distributed Systems?
- Information and Hardware Sharing
- Scalability
- Availability
- Fault Tolerance
- Price/performance
14Types of Distributed Systems
- Client/Server
- Web (HTTP), NFS, Automatic Teller Machines
- Group computing
- Distributed/replicated servers
- Pub/sub and messaging based system
- Collaborative computing (CSCW)
- Teleteaching, telemedicine, video-conferencing,
Lotus Notes, shared windows sessions - Parallel (cluster) computing in distributed
environments - Message passing interface (MPI)
- Distributed Shared Memory
15System Model
- Distributed system with n processes,
- Denoted by P1, P2,,Pn
- Each process has local memory and CPU.
- Processes communicate via asynchronous network by
send/receive events. - Processes are asynchronous too
- Dont share a global clock.
16Basic Events
Computation
Send(M)
Network
Receive(M)
Recovery
17A Drawing Conception
P1
m2
m3
m1
P2
18Fault Tolerance
- Hardware, software and networks fail!
- Source of failures
- Human, radiation, etc.
- There are
- Intentional faults Mainly caused by attacks,
viruses/worms, etc. - Non-intentional faults Mainly caused due to
- Bugs in the code
- Incorrect configuration and deployment
- Environment
- The rate of failures is still too high
- Impossible to prevent failures!
- So, What we can do?
19Fault Tolerance (Contd)
- Distributed systems must maintain availability
even at low levels of hardware/software/network
reliability. - Fault tolerance is achieved, mainly, by
- Recovery
- Recover the machine, system, or application upon
any failure - Failures should be detected.
- Where to recover from?
- Start everything from scratch, or
- Restart from a pre-captured state.
- Cost and performance
20Fault Tolerance (Contd)
- Redundancy
- If one component fails, there is a spare to take
over. - How the spare knows when to take over?
- How often we update the spare?
- How many spare do we need?
- Cost and performance.
- Self stabilization
- If the system is in a faulty state, it should
detect and go back to the normal state. - How does the system do that?
- We are not consider this technique in the course.
21Reliability
- Means that the system is continuously produce
correct services. - The reliability R(t) of a system SYS can be
expressed as - R(t) Prob(SYS is fully functioning in 0,t)
- A metric for reliability R(t) is MTTF, the Mean
Time To Failure
22Availability
- Means that the system produces services when it
is required from authorized use - The availability A(t) of a system SYS can be
expressed as - A(t) Prob(SYS is fully functioning at time t)
- A metric for the average, steady-state
availability is
23Failure, Error, and Fault
- Failure transition from proper to improper
service - Error that part of system state that is liable
to lead to subsequent failure - Fault the hypothesized cause of error(s)
Activation
Propagation
Causation
Fault
Error
Failure
Fault
24Failure Types
- Crash Fail-stop mode. The process does not
active - Omission Fail to send/receive a message
- Transit Temporally failure that may affect the
system functionality - Byzantine Exhibits random behavior of the
process - Malicious Intention failure that usually caused
by attacks
25Means to attain Availability and Reliability
- Fault prevention
- Try to prevent faults before happening.
- Examples
- Using of strongly-typed programming language
- Firewalls for preventing intrusions (for
intrusion tolerance) - Fault tolerance
- Handling failures and trying to continue provide
correct functionality. - Examples
- Checkpoint/Restart, Replication, and
Self-Stabilization. - Replication with Byzantine Agreement (for
intrusion tolerance) - Fault Detection
- Detecting and removing the faults.
- Examples
- Timeout-based detection This is for detecting
crash failure. - Anomaly-based detection Intrusion Detection
Systems (IDSs)
26Correct vs. Faulty
- Look at a complete run (execution)
- external observers view
- A process that does not fail in a run is correct
in that run - Otherwise, the process is faulty in the run
- a process that fails any time in the run is
faulty throughout the entire run
27Threshold Failure Model
- t out of n processes may fail
- t is usually given as a function of n, e.g.,
- t lt n
- 2t lt n
- 3t lt n
28Examples
29A Database System
- Transactions
- Initiate a connection with the bank server and
ask for a financial transaction on your account. - The Server update the database.
- The Server send a confirmation message to the
user.
2
3
1
30A Database System (Contd)
- Possible Failures
- The server may crash
- One of the connection may be down (cut).
2
3
1
31Synchronous vs. Asynchronous
- Synchrony assumptions
- Message latency is bounded
- Processes have synchronized clocks
- Processing times are bounded
- Asynchrony no assumptions
- Asynchronous models are more practice.
- The Internet is an asynchrony system.
32Example The Coordinated Attack Problem
- Definition
- Two armies (red and blue) surround a town.
- The two armies want to coordinate to attack the
town, - Victory is achieved if and only if two armies
attack simultaneously. Otherwise, the attack army
will be defeated. - The generals (red and blue) communicate by
messengers. - messengers can be captured (message loss) and/or
can take arbitrarily long.
33The Coordinated Attack Problem (Contd)
- There is no solution for the problem in the
asynchronous model. - There is a solution in the synchronous model (?)
- Can we add some requirements in the asynch model
to solve the problem?
34Time in Distributed Systems
- Logical time
- Causality
- Similar to visible knowledge that advances at the
speed of light - Wall clock time / real time / global time
- Clock skew
- The rate in which local clocks drift w.r.t. each
other. Depends on clocks quality, but also on
temperature, magnetic field, etc. - It is possible to obtain time from GPS, or radio
clocks, but the latency (both of the signal and
handling the signal inside the computer) can vary
a bit. Also, may not be available when there is
no line of site to the sky.