From Distributed Systems, Chapter 2 - PowerPoint PPT Presentation

1 / 23
About This Presentation
Title:

From Distributed Systems, Chapter 2

Description:

Theorists may provide oversimplified models, and not much can be learned from them. ... Therefore, models for asynchronous problems are suitable for synchronous ... – PowerPoint PPT presentation

Number of Views:94
Avg rating:3.0/5.0
Slides: 24
Provided by: ccGa
Category:

less

Transcript and Presenter's Notes

Title: From Distributed Systems, Chapter 2


1
What Good Are Models and What Are Models Good For?
  • From Distributed Systems, Chapter 2
  • Fred Schneider
  • Edited by Sape Mullender
  • Presentation by Scott McManus
  • February 14, 2007

2
Overview
  • Distributed systems are tough to design and
    understand.
  • Goal is to develop intuitions for constructing
    them.
  • Concepts and goals of modeling are defined.
  • Poor intuitions for even simple distributed
    systems exist, and an example is discussed.
  • Models for distributed systems' attributes are
    given.
  • Synchronous versus Asynchronous Systems
  • Failure Modes

3
Two Traditional Approaches
  • Experimental Observation
  • Build and observe, gather experience, and build
    similar things.
  • This doesn't necessarily explain why something
    works.
  • Modeling and Analysis
  • Simplify and postulate rules for a model.
  • Analyze the model and infer characteristics.

4
Tension Between Two Approaches
  • Similar to theory versus practice argument.
  • Experimental observations may not be addressing
    the right problem or incorrectly generalizing.
  • Theorists may provide oversimplified models, and
    not much can be learned from them.

5
Good Models
  • Model
  • Collection of attributes set of rules for
    attribute interaction
  • Accuracy
  • Model analysis yields output similar to object
    being modeled.
  • Tractability
  • Degree to which analysis is possible.
  • Accurate and tractable models are difficult to
    define.

6
Two Key Problems in Modeling
  • Feasibility
  • What classes of problems can be solved?
  • Avoid wasted effort on unsolvable problems.
  • Cost
  • How expensive are solutions to a solvable
    problem?
  • E.g., Avoid protocols that are expensive or slow.

7
Coordination Problem
  • At publication time (1993), education typically
    stressed single-processor computation and
    algorithm analysis.
  • This does not match up with goals for distributed
    systems.
  • The coordination problem is a simple example of
    where intuition can go wrong.

8
Coordination Problem (Continued)
  • Problem Two processors communicate to one
    another. Neither one can fail, but the channel
    can fail. Devise a protocol where one of two
    actions are possible, but both processors take
    the same action and neither takes both actions.
  • Proof is by infinite descent there must be a
    shortest handshake in a protocol that solves the
    Coordination Problem, but this message cannot be
    acknowledged as being received. So this message
    is useless, making a shorter handshake. Then a
    contradiction is formed.

9
Coordination Problem (Continued)
  • This problem had a simple model underlying a
    simple problem that most would think is feasible.
  • All protocols between two processors are
    equivalent to a series of messages.
  • Actions taken by a process depend only on the
    sequence of messages it has received.
  • The point is that modeling at the right
    granularity can give rise to analysis that may
    yield nonintuitive results.
  • Model can now be changed and reanalyzed.

10
Synchronous versus Asynchronous Systems
  • Asynchronous
  • No assumptions are made about timing.
  • Synchronous
  • Relative speeds bounded so that a specific time
    ordering is induced.
  • All systems are asynchronous just don't make
    assumptions about them.
  • Therefore, models for asynchronous problems are
    suitable for synchronous problems (but the
    converse will likely not be true).

11
Synchronous versus Asynchronous Systems Election
Protocols
  • Asserting a system is synchronous can make
    solutions less costly, but at the cost of
    flexibility.
  • An election protocol is example of the tradeoffs
    in this model.
  • Definition All processors have a unique id and
    need to elect a leader. All processors start
    simultaneously and use broadcasts.
  • The goal is to elect a leader among the
    processors.

12
Election Protocols (Cont.)
  • Asynchronous Solution
  • Each process sends its identity and user id.
  • Each process selects the process with the lowest
    user id.
  • Synchronous Solution
  • Wait an amount of time based on the user id.
  • Each process selects the leader based on first
    message received.

13
Election Protocols
  • Again, by limiting requirements on system, the
    model yields a more efficient solution.
  • Model also avoids case where channel may fail
    intermittently, making the solution less costly.

14
Failure Models
  • A system is t-fault tolerant if it can satisfy
    its specification provided at most t components
    are faulty.
  • This is a model that contradicted typical system
    analyses.
  • Analyses were typically statistical in nature,
    such as the Mean Time Between Failures (MTBF).
  • By using the t-fault tolerant model and using
    probabilities of component failures, the same
    information is derived.

15
Failure Modes (Cont.)
  • The t-fault tolerant model is good in the sense
    that it can be used to predict the same
    information as empirical observation.
  • However, attribute failures to components can be
    tricky.
  • In networking, the sender, receiver, or channel
    can be at fault.
  • Cost will depend heavily on what is being
    replicated so that faults can be accepted.
    (Claim Failures can only be avoided by using
    replication.)

16
Failure Models
  • There are four common failure models, in order of
    the least disruptive to most disruptive.
  • Failstop A processor fails and stays in that
    state. A failstop failure can be detected by
    other processors.
  • Crash This is the same as a failstop, but it may
    not be detectable.
  • Message Loss Failures A processor fails due to
    failures in receiving and/or sending.
  • Byzantine The processor fails by exhibiting
    arbitrary behavior. This is the most likely
    scenario.

17
Failure Models (Cont.)
  • Why use these general models instead of defining
    failures based on the component (e.g.,
    radiation-induced bit errors in memory)?
  • Analyzing all failures of a component and their
    interactions will likely be infeasible.
  • Matter of taste in abstractions
  • Sending/receiving messages is more general than
    what may be irrelevant behavior in a component.

18
Fault Tolerance and Distributed Systems
  • Coexistence of the two is necessary
  • More components means a higher probability of
    failure in a single component in a distributed
    system.
  • Fault tolerance is likewise dependent on
    techniques used in distributed systems (e.g.,
    replication, physical isolation of resources).

19
Fault Tolerance and Distributed Systems (Cont.)
  • Replication is necessary in distributed systems.
  • Replication in space
  • Components are physically and electrically
    separated.
  • Replication in time
  • A device repeats the same computation and
    compares results.
  • Only valid for transient failures.

20
What happens when failures are detected in
replication?
  • A big tradeoff in cost and flexibility occurs
    based on the failure model assumed.
  • Byzantine failures
  • For a majority voting scheme, t errors must be
    outvoted by t1 components. So a t-fault tolerant
    system requires 2t1 components. (This is similar
    to arguments in coding theory.)
  • Failstop model
  • Each processor can be detected as having failed.
    So if t components have stopped, only 1 needs to
    have not failed. So t-fault tolerant systems
    require t1 components.

21
Which Model When?
  • Key attributes of a problem must be known in
    order to find dimensions of problem.
  • Programs can treat a model as an interface
    definition or specification.
  • Critical applications may assume a Byzantine
    failure model.
  • Failure model can usually be relaxed.
  • When model doesn't fit, a program may be setup to
    induce the kind of failure mode.
  • E.g., forcing a failstop when sanity tests on
    components fail rather than waiting for Byzantine
    faults.

22
Models as Limiting Cases
  • Models should accept the bounds of the real
    systems.
  • So there arent component failure cases that will
    almost never happen, and there arent very basic
    faults that are excluded.
  • Cost and feasibility are then easier to derive.
  • The model is more inclusive of the systems
    components.

23
Questions and Discussion
  • Text uses term processors, but they are not
    necessarily interchangeable with processes.
  • Processor failure is equivalent to an entire node
    failing.
  • Failure models do not cover all functionality.
  • It may be within specifications to have a certain
    amount of expected downtime even when services
    are functionally duplicated. This is true in some
    fields, such as telecommunications and
    networking.
  • Replication may only be necessary for continuous
    operation.
Write a Comment
User Comments (0)
About PowerShow.com