Fault Tolerance I - PowerPoint PPT Presentation

About This Presentation
Title:

Fault Tolerance I

Description:

RAID: Use of extra disks containing redundant information. ... Second RAID Level ... The third RAID level is similar to the second RAID level except that splitting ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 77
Provided by: michae77
Category:
Tags: fault | raid | tolerance

less

Transcript and Presenter's Notes

Title: Fault Tolerance I


1
Fault Tolerance (I)
2
Topics
  • Basic concepts
  • Physical Redundancy
  • Information Redundancy
  • Timing Redundancy
  • RAID

3
Readings
  • Tannenbaum 7.1,7.2

4
Introduction
  • A characteristic feature of distributed systems
    that distinguishes them from single-machine
    systems is the notion of partial failure
  • A partial failure may happen when one component
    in a distributed system fails.
  • This failure may affect the proper operation of
    other components, while at the same time leaving
    other components unaffected.

5
Introduction
  • An important goal in design is to construct the
    system in such a way that it can automatically
    recover from partial failures without seriously
    affecting the overall performance.
  • The distributed system should continue to operate
    in an acceptable way while repairs are being
    made.

6
By the way.
  • Computing systems are not very reliable
  • OS crashes frequently (Windows), buggy software,
    unreliable hardware, software/hardware
    incompatibilities
  • Until recently computer users were tech savvy
  • Could depend on users to reboot, troubleshoot
    problems

7
By the way.
  • Computing systems are not very reliable (cont)
  • Growing popularity of Internet/World Wide Web
  • Novice users
  • Need to build more reliable/dependable systems
  • Example what if your TV (or car) broke down
    every day?
  • Users dont want to restart TV or fix it (by
    opening it up)
  • Need to make computing systems more reliable

8
Characterizing Dependable Systems
  • Dependable systems are characterized by
  • Availability
  • This refers to the percentage of time system may
    be used immediately
  • Reliability
  • Mean time to failure (MTTF) I.e., mean time
    between failures.
  • Safety
  • How serious is the impact of a failure
  • Maintainability
  • How long does it take to repair the system
  • Security

9
Characterizing Dependable Systems
  • Availability and reliability are not the same
    thing.
  • If a system goes down for a millisecond every
    hour, it has an availability of over 99.9999
    percent, but it is still highly unreliable.
  • A system that never crashes but is shut down for
    two weeks every August has high reliability but
    only 96 percent availability.

10
Definitions
  • A system fails when it does not perform according
    to its specification.
  • An error is part of a system state that may lead
    to a failure.
  • A fault is the cause of an error.

11
Definitions
  • Types of Faults
  • Transient
  • Occur once and then disappear.
  • If the operation is repeated, the fault goes
    away.
  • Example Bird flying through the beam of a
    microwave transmitter may cause lost bits on some
    network (not to mention a roasted bird).

12
Definitions
  • Types of Faults (continued)
  • Intermittent
  • Occurs and then vanishes of its own accord, then
    reappears, etc
  • A loose connector will often cause a intermittent
    fault.
  • Permanent
  • Continues to exist until the faulty component is
    repaired.
  • Burnt-out chips, software bugs, and disk head
    crashes.
  • A fault tolerant system does not fail in the
    presence of faults.

13
Server Failure Models
14
Server Failure Models
  • Crash Failure (fail-stop)
  • A server halts, but is working correctly until it
    halts.
  • Example An OS that comes to a grinding halt and
    for which there is only one solution reboot

15
Server Failure Models
  • Omission Failure
  • This occurs when a server fails to respond to
    incoming requests or fails to receive incoming
    messages or fails to send messages.
  • There are many reasons for an omission failure
    including
  • The connection between a client and a server has
    been correctly established, but there was no
    thread to listen to incoming requests.
  • A send buffer overflows The server may need to
    be prepared that the client will reissue its
    previous request.
  • An infinite loop where each iteration causes a
    forked process.

16
Server Failure Models
  • Timing Failures
  • A servers response lies outside the specified
    time interval.
  • An e-commerce site may state that the response to
    a user should be no more than 5 seconds (actually
    this is too long).
  • In a video-on-demand application, the client is
    to receive frames at 25 frames per second give or
    take 2 frames.
  • Timing failures are very difficult to deal with.

17
Server Failure Models
  • Response Failure
  • A servers response is incorrect a wrong reply
    to a request is returned or when a server reacts
    unexpectedly to an incoming request.
  • Example A search engine that systematically
    returns web pages not related to any of the used
    search terms.
  • Example A server receives a message that it
    cannot recognize.

18
Server Failure Models
  • Arbitrary (Byzantine) Failures
  • Arbitrary failures occur
  • Server is producing output it should never have
    produced, but which cannot be detected as being
    incorrect.
  • A faulty server may even be maliciously working
    together with other servers to produce
    intentionally wrong answers.

19
Server Failure Models
  • Ideally, we want fail-stop processes.
  • A fail-stop process will simply stop producing
    output in such a way that its halting can be
    detected by other processes.
  • The server may be so friendly to announce it is
    about to crash.
  • The reality is that processes are not that
    friendly.
  • We rely on other processes to detect the failure.

20
Server Failure Models
  • Problem How to tell the difference between a
    process that has halted and a process that is
    just slow.
  • Timeouts are great but theoretically you cannot
    place an exact time on when to expect a response.
  • If most of the time, the timeout interval is too
    high then you are delaying the system from
    reacting to the failure.

21
Failure Masking by Redundancy
  • If a system is to be fault tolerant, the best it
    can do is to try to hide the occurrence of
    failures from other processes.
  • Key technique Use redundancy
  • Types of redundancy
  • Information redundancy
  • Physical redundancy
  • Time redundancy

22
Physical Redundancy
  • Extra equipment or processes are added to make it
    possible for the system as a whole to tolerate
    the loss or malfunctioning of some components.
  • Physical redundancy can thus be done in either
    hardware or in software.
  • Examples in hardware
  • Aircraft 747s have 4 engines but fly on 3.
  • Space shuttle Has 5 computers
  • Electronic circuits

23
Physical Redundancy
  • Triple modular redundancy.

24
Physical Redundancy
  • For electronic circuits, each device is
    replicated three times.
  • Following each stage in the circuit is a
    triplicated voter.
  • Each voter is a circuit that has three inputs and
    one output.
  • If two or three of the inputs are the same, the
    output is equal to that input.
  • If all three inputs are different, the output is
    undefined.
  • This kind of design is known as TMR (Triple
    Modular Redundancy).

25
Physical Redundancy
  • TMR can be applied to any hardware unit.
  • The TMR can completely mask the failure of one
    hardware unit.
  • No explicit actions need to be performed for
    error detection, recovery, etc
  • Particularly suitable for transient failures if
    we assume the basic TMR scheme (one voter, three
    replicas).

26
Physical Redundancy
  • This scheme cant handle the failure of two
    units.
  • Once an unit fails, it is essential that both
    units should continue to work correctly.
  • The TMR scheme depends critically on the voting
    element. The voting element is typically a
    simple circuit and highly reliable circuits of
    this complexity can be built.
  • The failure of a single voter cannot be tolerated.

27
Physical Redundancy
  • The TMR approach can be generalized to
    replicating N units. This is called the NMR
    approach.
  • The larger N is then the higher the number of
    faults that can be completely masked.

28
Physical Redundancy
  • The basic TMR/NMR scheme is often complemented
    with sparing.
  • Sparing is often referred to as stand-by
    redundancy since the redundant or spare units
    usually are not operating online.
  • The restoring organ for sparing is a switch.
  • An error detector is also required to determine
    when the on-line unit has failed.
  • Failed units may be replaced by a spare.

29
Physical Redundancy
  • Some reliability results
  • Overall reliability decreases when the degree of
    redundancy is increased above a certain amount.
  • TMR provides the least potential for reliability
    improvement.
  • NMR systems with spares provide the highest
    reliability.

30
Information Redundancy
  • Coding is often used in information redundancy.
  • Coding has been extensively used for improving
    the reliability of communication.
  • The basic idea is to add check bits to the
    information bits such that errors in some bits
    can be detected, and if possible corrected.
  • The process of adding check bits to information
    bits is called encoding.
  • The reverse process of extracting information
    from the encoded data is called decoding.

31
Information Redundancy
  • Detectability/Correctability of a Code
  • A code defines a set of words that are possible
    for that code.
  • The Hamming distance of a code is the minimum
    number of bit positions in which any two words in
    the code differ.
  • If d is the Hamming distance, D is the number of
    bit errors that it can detect and C is the
    number of bit errors it can correct, then the
    following relation is always true
  • d C D 1 with D ? C

32
Information Redundancy
  • Detectability/Correctability of a Code
  • Lets say that you have a code that looks like
    this
  • 000
  • 001
  • 010
  • 011
  • 100
  • 101
  • 110
  • 111
  • Hamming distance is one. You cant detect an
    error.
  • Why? Lets say that a fault transforms 001 to
    011. How do you know this is a fault vs the
    possibility that 011 is correct?

33
Information Redundancy
  • Detectability/Correctability of a Code
  • On the otherhand, lets say that you have the
    following code of 3 codewords
  • 0000 0011 1100
  • If a fault changes one bit in a correct word it
    will result in a word that is not in the above
    list. This is not true if two bits are changed.
    Hence, the above code can only tolerate one
    fault.
  • You cant correct. Lets say that 0000 changes
    to 0010. You know there is an error, but how do
    you know that this should go to 0000 and not 0011.

34
Information Redundancy
  • Simple Parity Bits
  • Simple parity bits have been in common use in
    computer systems for many years.
  • The parity bit is selected so that the total
    number of 1s in the codeword is odd (even) for
    an odd-parity (even-parity) code.
  • This means that the Hamming distance is 2.
  • The parity bit can only detect single bit errors.

35
Information Redundancy
  • Simple Parity Bits
  • Example (assume odd-parity)
  • Codeword is 000 the parity bit is 1
  • Codeword is 001 the parity bit is 0
  • Codeword is 010 the parity bit is 0
  • Lets say that 000 is transferred as 0001. The
    parity bit is set as 1 which results in a odd
    number of ones (remember we are only interested
    in an odd number of ones).

36
Information Redundancy
  • Simple Parity Bits
  • All errors involving an odd number of bits can be
    detected because such errors will produce an
    incorrect parity.

37
Information Redundancy
  • Hamming Codes
  • Multiple parity bits are added such that each
    parity bit is a parity of a subset of information
    bits. The code can detect and also correct
    errors.
  • Widely used in semiconductor memory and in disk
    arrays.

38
Information Redundancy
  • Hamming Codes
  • Parity bits occupy the bit positions 1,2,4,.
    (power of 2) in the encoding. The remaining are
    the data positions.
  • Let k be the number of parity bits.
  • Let m be the number of data bits.
  • The word length of the encoded word is mk.

39
Information Redundancy
  • Hamming Codes Example
  • Let k 3 and m 4
  • Bits in positions 1,2,4 are the parity bits.
    Label these as c1,c2 and c3.
  • Bits in positions 3,5,6,7 are the data bits.
    Label these as d1,d2,d3 and d4.
  • The value of parity bits is defined by the
    following relations
  • c1 d1?d2?d4
  • c2 d1?d3?d4
  • c3 d2?d3?d4

0 000 4 100 1 001
d1 5 101 d2 2 010
6 110 d3 3 011
7 111 d4
40
Information Redundancy
  • Hamming Codes Example
  • Let the word to be transmitted be 1011.
  • 001 010 011 100 101 110 111

c1 d1?d2?d4 c2
d1?d3?d4 c3 d2?d3?d4
41
Information Redundancy
  • Hamming Codes Example
  • How do we come up with these relations?
  • A Hamming code generator computes the check bits
    according to the following scheme.
  • The binary representation of the position number
    j is jk-1 ... j1 j0.
  • The value of a check bit is chosen to give odd
    (or even) parity over all bit positions j such
    that ji 1.
  • Thus each bit of the data word participates in
    several different check bits.

42
Information Redundancy
  • Hamming Codes Example
  • Assume the word transferred is 1111.

001 010 011 100 101 110 111
Transmitted improperly was originally a zero.
43
Information Redundancy
  • Hamming Codes Example
  • Location of bits in error
  • The check bits obtained from the relationship
    give above are XORed with the actual check bits
    obtained from the code.
  • c1 d1?d2?d4 1
  • c2 d1?d3?d4 1
  • c3 d2?d3?d4 1
  • e1 c1?c1 0 ? 1 1
  • e2 c2?c2 1 ? 1 0
  • e3 c3?c3 0 ? 1 1
  • Correction is done by simply complementing the
    bit.

If each error bit is 0, no error else the error
location bits specify the location of the bit in
error d2 is common to c1 and c3 as well as c1
and c3.
44
Information Redundancy
  • Hamming Codes Example
  • The use of Hamming codes becomes more efficient,
    in terms of numbers of bits needed relative to
    the number of data bits, as the word size
    increases.
  • If the data word length is 8 bits, the number of
    check bits will be 4. This overhead is 50.
  • If the word length is 84 bits, the number of
    check bits will be 7 giving an overhead of 9
    percent.

45
Information Redundancy
  • Cyclic Redundancy Code (CRC)
  • These codes are applied to a block of data,
    rather than independent words.
  • CRCs are commonly used in detecting errors in
    data communication.
  • A sequence of bits is represented as a polynomial
    (generator polynomial).

46
Information Redundancy
  • Cyclic Redundancy Code (CRC)
  • If the kth bit is 1, then the polynomial contains
    xk.
  • Example1100101101
  • x9 x8 x5 x3 x2 1
  • Encoding
  • To the data bit sequence, add (k1) bits in the
    end.
  • The extended data sequence is divided (modula 2)
    by the generator polynomial.
  • The final remainder is added to the data sequence
    to form the encoded data.

47
Information Redundancy
  • Cyclic Redundancy Code (CRC)
  • Decoding
  • The extra (k1) bits are just discarded to obtain
    the original data bits.
  • Error checking The data bits are again divided
    by the generator polynomial and the final
    remainder is checked with last (k1) bits of the
    received data.
  • If there is a difference, an error has occurred.

48
Information Redundancy
  • Cyclic Redundancy Code (CRC)
  • Through proper selection of the generating
    polynomial CRC codes will
  • Detect all single bit errors in the data stream
  • Detect all double bit errors in the data stream
  • Detect any odd number of errors in the data
    stream
  • Detect any burst error for which the length of
    the burst is less than the length of the
    generating polynomial
  • Detect most all larger burst errors

49
Time Redundancy
  • An action is performed and if the need arises, it
    is performed again.
  • Example If a transaction aborts, it can be
    redone with no harm.
  • This is especially useful when the faults are
    transient or intermittent.

50
Case Study
  • Lets look at RAID(Redundant Array of Inexpensive
    Disks).
  • Motivation
  • Improve disk access time by using arrays of disks
  • Disks are getting inexpensive.
  • Lower cost disks
  • Less capacity.
  • But cheaper, smaller, and lower power.

51
Disk Organization 1
  • Interleaving disks.
  • Supercomputing applications.
  • Transfer of large blocks of data at high rates.

...
Grouped read single read spread over multiple
disks
52
Disk Organization 1
  • What is interleaving?
  • Assume you have 4 disks.
  • Byte interleaving means that byte N is on disk (N
    mod 4).
  • Block interleaving means that block N is on disk
    (N mod 4).
  • All reads and writes involve all disks, which is
    great for large transfers

53
Disk Organization 2
  • Independent disks.
  • Transaction processing applications.
  • Database partitioned across disks.
  • Concurrent access to independent items.

...
Write
Read
54
Problem Reliability
  • Disk unreliability causes frequent backups.
  • Fault tolerance is needed, otherwise disk arrays
    are too unreliable to be useful.
  • RAID Use of extra disks containing redundant
    information.
  • Similar to redundant transmission of data.

55
RAID Levels
  • Different levels provide different reliability,
    cost, and performance.
  • The mean time to failure (MTTF) is a function of
    total number of disks, number of data disks in a
    group (G), number of check disks per group (C),
    and number of groups.
  • The number C is determined by RAID level.

56
First RAID Level
  • Mirrors
  • Most expensive approach.
  • All disks duplicated (G1 and C1).
  • Every write to data disk results in write to
    check disk.
  • Reads can be from either disk.
  • Double cost and half capacity.

57
Second RAID Level
  • Data is split at the bit level and spread over
    data and redundancy (check) disks.
  • Redundant bits are computed using Hamming code
    and placed in the redundancy disk.
  • Interleave data across disks in a group.
  • Add enough check disks to detect/correct error.
  • Single parity disk detects single error.
  • Makes sense for large data transfers.
  • Small transfers mean all disks must be accessed
    (to check if data is correct).

58
Third and Fourth RAID Level
  • The third RAID level is similar to the second
    RAID level except that splitting of data is at
    the byte level. There is one parity disk.
  • The fourth RAID level is similar to the third
    RAID level except that splitting of data is at
    the block level. There is one parity disk
  • The fifth RAID level is similar to the fourth
    RAID level except that check bits are distributed
    across multiple disks.
  • There are 8 RAID levels.

59
Process Resilience
  • The key approach to tolerating a faulty process
    is to organize several identical processes in a
    group.
  • Design issues include the following
  • When a message is sent to the group itself, all
    members of the group receive it.
  • Dealing with process groups

60
Problems of Agreement
  • A set of processes need to agree on a value
    (decision), after one or more processes have
    proposed what that value (decision) should be
  • Examples
  • mutual exclusion, election, transactions
  • Processes may be correct, crashed, or they may
    exhibit arbitrary (Byzantine) failures
  • Messages are exchanged on an one-to-one basis,
    and they are not signed

61
Problems of Agreement
  • The general goal of distributed agreement
    algorithms is to have all the nonfaulty processes
    reach consensus on some issue and to establish
    that consensus within a finite number of steps.
  • What if processes exhibit Byzantine failures.
  • This is often compared to armies in the Byzantine
    Empire in which there many conspiracies, intrigue
    and untruthfulness were alleged to be common in
    ruling circles.

62
The Two-Army Problem
  • How can two perfect processes reach agreement
    about 1 bit of information ?
  • over an unreliable communication channel
  • Red army 5000 troops
  • Blue army 1, 2 3000 troops each
  • How can the blue armies reach agreement on when
    to attack ?
  • Their only means of communication is by sending
    messengers
  • that may be captured by the enemy !
  • No solution!

63
The Two-Army Problem
  • Proof by contradiction Assume there is a
    solution with a minimum number of messages
  • Suppose commander of blue army 1 is General
    Alexander and the command of the blue army 2 is
    General Bonaparte.
  • General Alexander sends a message to General
    Bonaparte reading I have a plan lets attack at
    dawn tomorrow.
  • The messenger gets through and Bonaparte sends
    him back a message with a note saying Splendid
    idea, Alex. See you at dawn tomorrow.
  • The messenger gets back.

64
The Two-Army Problem
  • Proof by contradiction (cont)
  • Alexander wants to make sure that Bonaparte does
    know that the messenger got back safely so that
    Bonaparte is confident that Alexander will
    attack.
  • Alexander tells the messenger to go tell
    Bonaparte that his message arrived and the battle
    is set.
  • The messenger gets through, but now Bonaparte
    worries that Alexander does not know if the
    acknowledgement got through.
  • Bonaparte acknowledges the acknowledgement.
  • Etc etc etc

65
History Lesson The Byzantine Empire
  • Time 330-1453 AD.
  • Place Balkans and Modern Turkey.
  • Endless conspiracies, intrigue, and
    untruthfullness were alleged to be common
    practice in the ruling circles of the day.
  • That is it was typical for intentionally wrong
    and malicious activity to occur among the ruling
    group. A similar occurance can surface in a DS,
    and is known as byzantine failure.
  • Question how do we deal with such malicious
    group members within a distributed system?

66
Byzantine Generals Problem
  • Now assume that the communication is perfect but
    the processes are not.
  • This problem also occurs in military settings and
    is called the Byzantine Generals Problem.
  • We still have the red army, but n blue generals.
  • Communication is done pairwise by phone it is
    instantaneous and perfect.
  • m of the generals are traitors (faulty) and are
    actively trying to prevent the loyal generals
    from reaching agreement by feeding them incorrect
    and contradictory information.
  • Is agreement still possible?

67
Byzantine Generals Problem
  • We will illustrate by example where there are 4
    generals, where one is a traitor (analogous to a
    faulty process).
  • Step 1
  • Every general sends a (reliable) message to every
    other general announcing his troop strength.
  • Loyal generals tell the truth.
  • Traitors tell every other general a different
    lie.
  • Example general 1 reports 1K troops, general 2
    reports 2K troops, general 3 lies to everyone
    (giving x, y, z respectively) and general 4
    reports 4K troops.

68
Byzantine Generals Problem
69
Byzantine Generals Problem
  • Step 2
  • The results of the announcements of step 1 are
    collected together in the form of vectors.

70
Byzantine Generals Problem
71
Byzantine Generals Problem
  • Step 3
  • Consists of every general passing his vector from
    the previous step to every other general.
  • Each general gets three vectors from each other
    general.
  • General 3 hasnt stopped lying. He invents 12 new
    values a through l.

72
Byzantine Generals Problem
73
Byzantine Generals Problem
  • Step 4
  • Each general examines the ith element of each of
    the newly received vectors.
  • If any value has a majority, that value is put
    into the result vector.
  • If no value has a majority, the corresponding
    element of the result vector is marked UNKNOWN.

74
Byzantine Generals Problem
  • The same as in previous example, except now with
    2 loyal generals and one traitor.

75
Byzantine Generals Problem
  • With m faulty processes, agreement is possible
    only if 2m1 processes function correctly
  • The total is 3m1.
  • If messages cannot be guaranteed to be delivered
    within a known, finite time, no agreement is
    possible if even one process is faulty.
  • Why? Slow processes are indistinguishable from
    crashed ones.

76
Byzantine Generals Problem
  • Let f be the number of faults to be tolerated.
  • The algorithm needs f1 rounds.
  • In each round, a process sends to all the other
    processes the values that it received in the
    previous round. The number of message sent is on
    the order of
  • O(Nf1) where N is the number of generals.
  • If you do not assume Byzantine faults then you
    need a lot less infrastructure.
Write a Comment
User Comments (0)
About PowerShow.com