Title: From Distributed Systems, Chapter 2
1What Good Are Models and What Are Models Good For?
- From Distributed Systems, Chapter 2
- Fred Schneider
- Edited by Sape Mullender
- Presentation by Scott McManus
- February 14, 2007
2Overview
- Distributed systems are tough to design and
understand. - Goal is to develop intuitions for constructing
them. - Concepts and goals of modeling are defined.
- Poor intuitions for even simple distributed
systems exist, and an example is discussed. - Models for distributed systems' attributes are
given. - Synchronous versus Asynchronous Systems
- Failure Modes
3Two Traditional Approaches
- Experimental Observation
- Build and observe, gather experience, and build
similar things. - This doesn't necessarily explain why something
works. - Modeling and Analysis
- Simplify and postulate rules for a model.
- Analyze the model and infer characteristics.
4Tension Between Two Approaches
- Similar to theory versus practice argument.
- Experimental observations may not be addressing
the right problem or incorrectly generalizing. - Theorists may provide oversimplified models, and
not much can be learned from them.
5Good Models
- Model
- Collection of attributes set of rules for
attribute interaction - Accuracy
- Model analysis yields output similar to object
being modeled. - Tractability
- Degree to which analysis is possible.
- Accurate and tractable models are difficult to
define.
6Two Key Problems in Modeling
- Feasibility
- What classes of problems can be solved?
- Avoid wasted effort on unsolvable problems.
- Cost
- How expensive are solutions to a solvable
problem? - E.g., Avoid protocols that are expensive or slow.
7Coordination Problem
- At publication time (1993), education typically
stressed single-processor computation and
algorithm analysis. - This does not match up with goals for distributed
systems. - The coordination problem is a simple example of
where intuition can go wrong.
8Coordination Problem (Continued)
- Problem Two processors communicate to one
another. Neither one can fail, but the channel
can fail. Devise a protocol where one of two
actions are possible, but both processors take
the same action and neither takes both actions. - Proof is by infinite descent there must be a
shortest handshake in a protocol that solves the
Coordination Problem, but this message cannot be
acknowledged as being received. So this message
is useless, making a shorter handshake. Then a
contradiction is formed.
9Coordination Problem (Continued)
- This problem had a simple model underlying a
simple problem that most would think is feasible. - All protocols between two processors are
equivalent to a series of messages. - Actions taken by a process depend only on the
sequence of messages it has received. - The point is that modeling at the right
granularity can give rise to analysis that may
yield nonintuitive results. - Model can now be changed and reanalyzed.
10Synchronous versus Asynchronous Systems
- Asynchronous
- No assumptions are made about timing.
- Synchronous
- Relative speeds bounded so that a specific time
ordering is induced. - All systems are asynchronous just don't make
assumptions about them. - Therefore, models for asynchronous problems are
suitable for synchronous problems (but the
converse will likely not be true).
11Synchronous versus Asynchronous Systems Election
Protocols
- Asserting a system is synchronous can make
solutions less costly, but at the cost of
flexibility. - An election protocol is example of the tradeoffs
in this model. - Definition All processors have a unique id and
need to elect a leader. All processors start
simultaneously and use broadcasts. - The goal is to elect a leader among the
processors.
12 Election Protocols (Cont.)
- Asynchronous Solution
- Each process sends its identity and user id.
- Each process selects the process with the lowest
user id. - Synchronous Solution
- Wait an amount of time based on the user id.
- Each process selects the leader based on first
message received.
13Election Protocols
- Again, by limiting requirements on system, the
model yields a more efficient solution. - Model also avoids case where channel may fail
intermittently, making the solution less costly.
14Failure Models
- A system is t-fault tolerant if it can satisfy
its specification provided at most t components
are faulty. - This is a model that contradicted typical system
analyses. - Analyses were typically statistical in nature,
such as the Mean Time Between Failures (MTBF). - By using the t-fault tolerant model and using
probabilities of component failures, the same
information is derived.
15Failure Modes (Cont.)
- The t-fault tolerant model is good in the sense
that it can be used to predict the same
information as empirical observation. - However, attribute failures to components can be
tricky. - In networking, the sender, receiver, or channel
can be at fault. - Cost will depend heavily on what is being
replicated so that faults can be accepted.
(Claim Failures can only be avoided by using
replication.)
16Failure Models
- There are four common failure models, in order of
the least disruptive to most disruptive. - Failstop A processor fails and stays in that
state. A failstop failure can be detected by
other processors. - Crash This is the same as a failstop, but it may
not be detectable. - Message Loss Failures A processor fails due to
failures in receiving and/or sending. - Byzantine The processor fails by exhibiting
arbitrary behavior. This is the most likely
scenario.
17Failure Models (Cont.)
- Why use these general models instead of defining
failures based on the component (e.g.,
radiation-induced bit errors in memory)? - Analyzing all failures of a component and their
interactions will likely be infeasible. - Matter of taste in abstractions
- Sending/receiving messages is more general than
what may be irrelevant behavior in a component.
18Fault Tolerance and Distributed Systems
- Coexistence of the two is necessary
- More components means a higher probability of
failure in a single component in a distributed
system. - Fault tolerance is likewise dependent on
techniques used in distributed systems (e.g.,
replication, physical isolation of resources).
19Fault Tolerance and Distributed Systems (Cont.)
- Replication is necessary in distributed systems.
- Replication in space
- Components are physically and electrically
separated. - Replication in time
- A device repeats the same computation and
compares results. - Only valid for transient failures.
20What happens when failures are detected in
replication?
- A big tradeoff in cost and flexibility occurs
based on the failure model assumed. - Byzantine failures
- For a majority voting scheme, t errors must be
outvoted by t1 components. So a t-fault tolerant
system requires 2t1 components. (This is similar
to arguments in coding theory.) - Failstop model
- Each processor can be detected as having failed.
So if t components have stopped, only 1 needs to
have not failed. So t-fault tolerant systems
require t1 components.
21Which Model When?
- Key attributes of a problem must be known in
order to find dimensions of problem. - Programs can treat a model as an interface
definition or specification. - Critical applications may assume a Byzantine
failure model. - Failure model can usually be relaxed.
- When model doesn't fit, a program may be setup to
induce the kind of failure mode. - E.g., forcing a failstop when sanity tests on
components fail rather than waiting for Byzantine
faults.
22Models as Limiting Cases
- Models should accept the bounds of the real
systems. - So there arent component failure cases that will
almost never happen, and there arent very basic
faults that are excluded. - Cost and feasibility are then easier to derive.
- The model is more inclusive of the systems
components.
23Questions and Discussion
- Text uses term processors, but they are not
necessarily interchangeable with processes. - Processor failure is equivalent to an entire node
failing. - Failure models do not cover all functionality.
- It may be within specifications to have a certain
amount of expected downtime even when services
are functionally duplicated. This is true in some
fields, such as telecommunications and
networking. - Replication may only be necessary for continuous
operation.