From Distributed Systems, Chapter 2 - PowerPoint PPT Presentation

1 / 23

About This Presentation

Title:

From Distributed Systems, Chapter 2

Description:

Theorists may provide oversimplified models, and not much can be learned from them. ... Therefore, models for asynchronous problems are suitable for synchronous ... – PowerPoint PPT presentation

Number of Views:94

Avg rating:3.0/5.0

Slides: 24

Provided by: ccGa

Category:

more less

Transcript and Presenter's Notes

Title: From Distributed Systems, Chapter 2

1
What Good Are Models and What Are Models Good For?

From Distributed Systems, Chapter 2
Fred Schneider
Edited by Sape Mullender
Presentation by Scott McManus
February 14, 2007

2
Overview

Distributed systems are tough to design and
understand.
Goal is to develop intuitions for constructing
them.
Concepts and goals of modeling are defined.
Poor intuitions for even simple distributed
systems exist, and an example is discussed.
Models for distributed systems' attributes are
given.
Synchronous versus Asynchronous Systems
Failure Modes

3
Two Traditional Approaches

Experimental Observation
Build and observe, gather experience, and build
similar things.
This doesn't necessarily explain why something
works.
Modeling and Analysis
Simplify and postulate rules for a model.
Analyze the model and infer characteristics.

4
Tension Between Two Approaches

Similar to theory versus practice argument.
Experimental observations may not be addressing
the right problem or incorrectly generalizing.
Theorists may provide oversimplified models, and
not much can be learned from them.

5
Good Models

Model
Collection of attributes set of rules for
attribute interaction
Accuracy
Model analysis yields output similar to object
being modeled.
Tractability
Degree to which analysis is possible.
Accurate and tractable models are difficult to
define.

6
Two Key Problems in Modeling

Feasibility
What classes of problems can be solved?
Avoid wasted effort on unsolvable problems.
Cost
How expensive are solutions to a solvable
problem?
E.g., Avoid protocols that are expensive or slow.

7
Coordination Problem

At publication time (1993), education typically
stressed single-processor computation and
algorithm analysis.
This does not match up with goals for distributed
systems.
The coordination problem is a simple example of
where intuition can go wrong.

8
Coordination Problem (Continued)

Problem Two processors communicate to one
another. Neither one can fail, but the channel
can fail. Devise a protocol where one of two
actions are possible, but both processors take
the same action and neither takes both actions.
Proof is by infinite descent there must be a
shortest handshake in a protocol that solves the
Coordination Problem, but this message cannot be
acknowledged as being received. So this message
is useless, making a shorter handshake. Then a
contradiction is formed.

9
Coordination Problem (Continued)

This problem had a simple model underlying a
simple problem that most would think is feasible.
All protocols between two processors are
equivalent to a series of messages.
Actions taken by a process depend only on the
sequence of messages it has received.
The point is that modeling at the right
granularity can give rise to analysis that may
yield nonintuitive results.
Model can now be changed and reanalyzed.

10
Synchronous versus Asynchronous Systems

Asynchronous
No assumptions are made about timing.
Synchronous
Relative speeds bounded so that a specific time
ordering is induced.
All systems are asynchronous just don't make
assumptions about them.
Therefore, models for asynchronous problems are
suitable for synchronous problems (but the
converse will likely not be true).

11
Synchronous versus Asynchronous Systems Election
Protocols

Asserting a system is synchronous can make
solutions less costly, but at the cost of
flexibility.
An election protocol is example of the tradeoffs
in this model.
Definition All processors have a unique id and
need to elect a leader. All processors start
simultaneously and use broadcasts.
The goal is to elect a leader among the
processors.

12
Election Protocols (Cont.)

Asynchronous Solution
Each process sends its identity and user id.
Each process selects the process with the lowest
user id.
Synchronous Solution
Wait an amount of time based on the user id.
Each process selects the leader based on first
message received.

13
Election Protocols

Again, by limiting requirements on system, the
model yields a more efficient solution.
Model also avoids case where channel may fail
intermittently, making the solution less costly.

14
Failure Models

A system is t-fault tolerant if it can satisfy
its specification provided at most t components
are faulty.
This is a model that contradicted typical system
analyses.
Analyses were typically statistical in nature,
such as the Mean Time Between Failures (MTBF).
By using the t-fault tolerant model and using
probabilities of component failures, the same
information is derived.

15
Failure Modes (Cont.)

The t-fault tolerant model is good in the sense
that it can be used to predict the same
information as empirical observation.
However, attribute failures to components can be
tricky.
In networking, the sender, receiver, or channel
can be at fault.
Cost will depend heavily on what is being
replicated so that faults can be accepted.
(Claim Failures can only be avoided by using
replication.)

16
Failure Models

There are four common failure models, in order of
the least disruptive to most disruptive.
Failstop A processor fails and stays in that
state. A failstop failure can be detected by
other processors.
Crash This is the same as a failstop, but it may
not be detectable.
Message Loss Failures A processor fails due to
failures in receiving and/or sending.
Byzantine The processor fails by exhibiting
arbitrary behavior. This is the most likely
scenario.

17
Failure Models (Cont.)

Why use these general models instead of defining
failures based on the component (e.g.,
radiation-induced bit errors in memory)?
Analyzing all failures of a component and their
interactions will likely be infeasible.
Matter of taste in abstractions
Sending/receiving messages is more general than
what may be irrelevant behavior in a component.

18
Fault Tolerance and Distributed Systems

Coexistence of the two is necessary
More components means a higher probability of
failure in a single component in a distributed
system.
Fault tolerance is likewise dependent on
techniques used in distributed systems (e.g.,
replication, physical isolation of resources).

19
Fault Tolerance and Distributed Systems (Cont.)

Replication is necessary in distributed systems.
Replication in space
Components are physically and electrically
separated.
Replication in time
A device repeats the same computation and
compares results.
Only valid for transient failures.

20
What happens when failures are detected in
replication?

A big tradeoff in cost and flexibility occurs
based on the failure model assumed.
Byzantine failures
For a majority voting scheme, t errors must be
outvoted by t1 components. So a t-fault tolerant
system requires 2t1 components. (This is similar
to arguments in coding theory.)
Failstop model
Each processor can be detected as having failed.
So if t components have stopped, only 1 needs to
have not failed. So t-fault tolerant systems
require t1 components.

21
Which Model When?

Key attributes of a problem must be known in
order to find dimensions of problem.
Programs can treat a model as an interface
definition or specification.
Critical applications may assume a Byzantine
failure model.
Failure model can usually be relaxed.
When model doesn't fit, a program may be setup to
induce the kind of failure mode.
E.g., forcing a failstop when sanity tests on
components fail rather than waiting for Byzantine
faults.

22
Models as Limiting Cases

Models should accept the bounds of the real
systems.
So there arent component failure cases that will
almost never happen, and there arent very basic
faults that are excluded.
Cost and feasibility are then easier to derive.
The model is more inclusive of the systems
components.

23
Questions and Discussion

Text uses term processors, but they are not
necessarily interchangeable with processes.
Processor failure is equivalent to an entire node
failing.
Failure models do not cover all functionality.
It may be within specifications to have a certain
amount of expected downtime even when services
are functionally duplicated. This is true in some
fields, such as telecommunications and
networking.
Replication may only be necessary for continuous
operation.