Title: Consensus
1SECOND PART Algorithms for UNRELIABLE
Distributed Systems (The consensus problem)
2Failures in Distributed Systems
- Link failure A link fails and remains inactive
forever the network may get disconnected - Processor crash failure At some point, a
processor stops (forever) taking steps also in
this case, the network may get disconnected - Processor Byzantine failure during the
execution, a processor changes state arbitrarily
and sends messages with arbitrary content (name
dates back to untrustable Byzantine Generals of
Byzantine Empire, IVXV century A.D.) also in
this case, the network may get disconnected
3Normal operating
a
a
Non-faulty links and nodes
b
b
a
c
a
c
4Link Failures
a
a
Faulty link
b
b
a
c
c
Some of the messages are not delivered
5Processor crash failure
a
Faulty processor
a
b
b
Some of the messages are not sent
6Round 1
Round 2
Round 3
Round 4
Round 5
Failure
After failure the processor disappears from the
network
7Processor Byzantine failure
a
Faulty processor
a
!ç
!ç
/
/
Processor sends arbitrary messages (i.e., they
could be either correct or not), plus some
messages may be not sent
8Round 1
Round 2
Round 3
Round 4
Round 5
Round 6
Failure
Failure
After failure the processor may
continue functioning in the network
9Consensus Problem
- Every processor has an input x?X (this makes the
system non-anonymous), and must decide an output
y?Y. Design an algorithm enjoying the following
properties - Termination Eventually, every non-faulty
processor decides on a value y?Y. - Agreement All decisions by non-faulty processors
must be the same. - Validity If all inputs are the same, then the
decision of a non-faulty processor must equal the
common input (this avoids trivial solutions). - In the following, we assume that XYN
10Agreement
Start
Finish
0
2
1
3
4
4
3
4
2
4
Everybody has an initial value
All non-faulty must decide the same value
11Validity
If everybody starts with the same value, then
non-faulty must decide that value
Finish
Start
1
2
1
1
1
1
1
1
1
1
12Negative result for link failures
- Although this is the simplest fault a MPS may
face, it is already enough to prevent consensus
(just think to the fact that the link failure
could disconnect the network!) - Thus, is general (i.e., there exist at least one
instance of the problem such that) it is
impossible to reach consensus in case of link
failures, even in the synchronous case, and even
if one only wants to tolerate a single link
failure - To illustrate this negative result, we present
the very famous problem of the 2 generals
13Consensus under link failuresthe 2 generals
problem
- There are two generals of the same army who have
encamped a short distance apart. - Their objective is to capture a hill, which is
possible only if they attack simultaneously. - If only one general attacks, he will be
defeated. - The two generals can only communicate
(synchronously) by sending messengers, which
could be captured, though. - Is it possible for them to attack
simultaneously? - ? More formally, we are talking about consensus
in the following MPS
14The 2 generals problem
Lets attack
B
A
15Impossibility of consensus under link failures
- First of all, notice that it is needed to
exchange messages to reach consensus (generals
might have different opinions in mind!) - Assume the problem can be solved, and let ? be
the shortest (i.e., with minimum number of
messages) protocol for a given input
configuration. - Suppose now that the last message in ? does not
reach the destination. Since ? is correct
independent of link failures, consensus must be
reached in any case. This means, the last message
was useless, and then ? could not be shortest!
16Negative result for processor failuresin
asynchronous systems
- It is easy to see that a processor failure (both
crash and byzantine) is at least as difficult as
a link failure, and then the negative result we
just given holds also here - But even worse, it is not hard to prove that in
the asynchronous case, it is impossible to reach
consensus for any system topology and already for
a single crash failure! - Notice that for the synchronous case it cannot be
given a such general negative result ? in search
of some positive result, we focus on synchronous
specific topologies
17Positive results Assumption on the communication
model for crash and byzantine failures
- Complete undirected graph (this implies
non-uniformity) - Synchronous network, synchronous start w.l.o.g.,
we assume that messages are sent, delivered and
read in the very same round
18Overview of Consensus Results
- Let f be the maximum number of faulty processors
Crash failures Byzantine failures
number of rounds f1 2(f1) f1
total number of processors nf1 n4f1 n3f1
message size (Pseudo-) Polynomial (Pseudo-)Polynomial Exponential
19A simple algorithm for fault-free consensus
Each processor
- Broadcasts its input to all processors
- Reads all the incoming messages
- Decides on the minimum
(only one round is needed, since the graph is
complete)
20Start
0
1
4
3
2
21Broadcast values
0,1,2,3,4
0
0,1,2,3,4
0,1,2,3,4
1
4
0,1,2,3,4
3
2
0,1,2,3,4
22Decide on minimum
0,1,2,3,4
0
0,1,2,3,4
0,1,2,3,4
0
0
0,1,2,3,4
0
0
0,1,2,3,4
23Finish
0
0
0
0
0
24This algorithm satisfies the agreement
Finish
Start
All the processors decide the minimum exactly
over the same set of values
25This algorithm satisfies the validity condition
Finish
Start
If everybody starts with the same initial
value, everybody decides on that value (minimum)
26Consensus with Crash Failures
The simple algorithm doesnt work
fail
0
Start
0
1
0
4
3
2
The failed processor doesnt broadcast its value
to all processors
27Broadcasted values
fail
0
0,1,2,3,4
1,2,3,4
1
4
0,1,2,3,4
1,2,3,4
3
2
28Decide on minimum
fail
0
0,1,2,3,4
1,2,3,4
0
1
0,1,2,3,4
1,2,3,4
0
1
29Finish
fail
0
0
1
0
1
No agreement!!!
30If an algorithm solves consensus for f faulty
(crashing) processors we say it is
an f-resilient consensus algorithm
31An f-resilient algorithm
Each processor
Round 1 Broadcast to all (including myself) my
value Read all the incoming values Round 2 to
round f1 Broadcast to all (including
myself) any new received values Read all
the incoming values End of round f1
Decide on the minimum value ever received.
32Example f1 failures, f1 2 rounds needed
Start
0
1
4
3
2
33Example f1 failures, f1 2 rounds needed
Round 1
0
fail
0
0,1,2,3,4
1,2,3,4
1
0
4
(new values)
0,1,2,3,4
1,2,3,4
3
2
Broadcast all values to everybody
34Example f1 failures, f1 2 rounds needed
Round 2
0
0,1,2,3,4
0,1,2,3,4
1
4
0,1,2,3,4
0,1,2,3,4
3
2
Broadcast all new values to everybody
35Example f1 failures, f1 2 rounds needed
Finish
0
0,1,2,3,4
0,1,2,3,4
0
0
0,1,2,3,4
0,1,2,3,4
0
0
Decide on minimum value
36Example f2 failures, f1 3 rounds needed
Start
0
1
4
3
2
37Example f2 failures, f1 3 rounds needed
Round 1
0
Failure 1
1,2,3,4
1,2,3,4
1
0
4
0,1,2,3,4
1,2,3,4
3
2
Broadcast all values to everybody
38Example f2 failures, f1 3 rounds needed
Round 2
0
Failure 1
0,1,2,3,4
1,2,3,4
1
4
0
0,1,2,3,4
1,2,3,4
3
2
Failure 2
Broadcast new values to everybody
39Example f2 failures, f1 3 rounds needed
Round 3
0
Failure 1
0,1,2,3,4
0,1,2,3,4
1
4
0,1,2,3,4
0,1,2,3,4
3
2
Failure 2
Broadcast new values to everybody
40Example f2 failures, f1 3 rounds needed
Finish
0
Failure 1
0,1,2,3,4
0,1,2,3,4
0
0
0,1,2,3,4
0,1,2,3,4
3
0
Failure 2
Decide on the minimum value
41Since there are f failures and f1 rounds, then
there is at least a round with no failed
processors
2
3
4
5
6
1
Round
Example 5 failures, 6 rounds
No failure
42Correctness (1/2)
Lemma In the algorithm, at the end of the round
with no failures, all the processors know the
same set of values. Proof For the sake of
contradiction, assume the claim is false. Let x
be a value which is known only to a subset of
(non-faulty) processors. But when a processor
knew x for the first time, in the next round it
broadcasted it to all. So, the only possibility
is that it received it right in this round,
otherwise all the others should know x as well.
But in this round there are no failures, and so x
must be received by all. QED
43Correctness (2/2)
Agreement this holds, since at the end of the
round with no failure, every (non-faulty)
processor has the same knowledge, and this
doesnt change until the end of the algorithm ?
eventually, everybody will decide the same value!
Remark we dont know the exact position of the
free-of-failures round, so we have to let the
algorithm execute for f1 rounds
Validity this holds, since the value decided
from each processor is some input value (no
exogenous values are introduced)
44Performance of Crash Consensus Algorithm
- Number of processors n gt f
- f1 rounds
- O(n2k) messages, where kO(n) is the number of
different inputs. Indeed, each node sends O(n)
messages containing a given value in X (such
value might be not polynomial in n, by the way!)
45A Lower Bound
Any f-resilient consensus algorithm requires at
least f1 rounds
Theorem
Proof sketch
Assume by contradiction that f or less rounds
are enough
Worst case scenario
There is a processor that fails in each round
46Worst case scenario
Round
1
pi1
a
pi2
before processor pi1 fails, it sends its value a
to only one processor pi2
47Worst case scenario
2
Round
1
pi3
a
pi2
before processor pi2 fails, it sends its value a
to only one processor pi3
48Worst case scenario
2
f
3
Round
1
Pif1
a
pif
Before processor pif fails, it sends its value a
to only one processor pif1. Thus, at the end of
round f only one processor knows about a
49Worst case scenario
decide
2
f
3
Round
1
b
a
pif1
No agreement Processor pif1 may decide a, and
all other processors may decide another value,
say bgta ? contradiction, f rounds are not enough.
QED
50Consensus with Byzantine Failures
f-resilient (to byzantine failures) consensus
algorithm
solves consensus for f failed processors
51Lower bound on number of rounds
Any f-resilient consensus algorithm with
byzantine failures requires at least f1 rounds
Theorem
Proof
follows from the crash failure lower bound
52A Consensus Algorithm
The King algorithm
- solves consensus in 2(f1) rounds with
- processors and
- failures, where
-
- Assumption Processors have (distinct) ids, and
these are in 1,,n this is common knowledge,
i.e., processors cannot cheat about their ids
(namely, pi cannot behave like if it was pj, i?j)
53The King algorithm
There are f1 phases each phase has 2 rounds,
used to update in each processor pi a preferred
value vi. In the beginning, the preferred value
is set to the input value In each phase there
is a different king
? There is a king that is non-faulty!
54The King algorithm
Phase k
Round 1, processor
- Broadcast to all (including myself) preferred
value
- Let be the majority
- of received values (including )
(in case of tie pick an arbitrary value)
55The King algorithm
Phase k
Round 2, king
Broadcast new preferred value
Round 2, processor
If had majority of less than
then set
56The King algorithm
End of Phase f1
Each processor decides on preferred value
57Example 6 processors, 1 fault
p4
p3
0
1
p2
p5
0
2
king 2
p6
p1
1
1
king 1
Faulty
58Phase 1, Round 1
0,2,1,0,0,1
p4
p3
1,2,1,0,0,1
0
1
p5
0,2,1,0,0,1
0,2,1,0,0,1
0
1
0
2
p2
0
p6
0
p1
1
1
1
1,2,1,0,0,1
king 1
Everybody broadcasts
59Phase 1, Round 1
Choose the majority
p4
p3
1,2,1,0,0,1
1
0
0,2,1,0,0,1
p5
0,2,1,0,0,1
0
0
0,2,1,0,0,1
p2
p6
p1
1
1
1,2,1,0,0,1
king 1
Each majority is equal to
? On round 2, everybody will choose the kings
value
60Phase 1, Round 2
p4
p3
1
0
p2
p5
1
0
0
0
0
3
p6
p1
1
1
1
king 1
The king broadcasts
61Phase 1, Round 2
p4
p3
0
1
1
0
p2
p5
0
3
0
p6
p1
1
1
king 1
Everybody chooses the kings value
62Phase 2, Round 1
0,3,1,0,0,1
p4
p3
1,3,1,0,0,1
0
1
p5
0,3,1,0,0,1
0,3,1,0,0,1
0
1
0
3
p2
0
king 2
p6
0
p1
1
1
1
1,3,1,0,0,1
Everybody broadcasts
63Phase 2, Round 1
Choose the majority
p4
p3
1,3,1,0,0,1
1
0
0,3,1,0,0,1
p2
p5
0,3,1,0,0,1
0
0
0,3,1,0,0,1
king 2
p6
p1
1
1
1,3,1,0,0,1
Each majority is equal to
? On round 2, everybody will choose the kings
value
64Phase 2, Round 2
p4
p3
1
0
0
0
p2
p5
0
0
0
king 2
0
0
p6
p1
1
1
The king broadcasts
65Phase 2, Round 2
p4
p3
1
0
0
p2
p5
0
0
king 2
p6
p1
1
1
0
Everybody chooses the kings value
Final decision and agreement!
66Correctness of the King algorithm
Lemma 1 At the end of a phase ? where the king
is non-faulty, every non-faulty processor decides
the same value
Proof Consider the end of round 1 of phase
?. There are two cases
Case 1 some node has chosen its preferred
value with strong majority (
votes)
Case 2 No node has chosen its preferred value
with strong majority
67Case 1 suppose node has chosen its
preferred value with
strong majority ( votes)
? This implies that at least n/21 non-faulty
nodes must have broadcasted a at start of round
1, and then at the end of round 1, every other
non-faulty node must have preferred value a
(including the king)
68At end of round 2
If a node keeps its own value then
decides
If a node gets the value of the king then
it decides , since the king has
decided
Therefore Every non-faulty node decides
69Case 2
No node has chosen its preferred value
with strong majority ( votes)
Every non-faulty node will adopt the value of
the king, thus all decide on same value
END of PROOF
70Lemma 2 Let a be a common value decided by
non-faulty processors at the end of a phase ?.
Then, a will be preferred until the end.
Proof After ?, a will always be preferred with
strong majority (i.e., gtn/2f), since there are
at least n-f non-faulty processors and
Thus, until the end of phase f1, every
non-faulty processor decides a.
QED
71Agreement in the King algorithm
- Follows from Lemma 1 and 2, observing that since
there are f1 phases and at most f failures,
there is al least one phase in which the king is
non-faulty (and thus from Lemma 1 at the end of
that phase all non-faulty processors decide the
same, and from Lemma 2 this decision will be
maintained until the end).
72Validity in the King algorithm
Follows from the fact that if all (non-faulty)
processors have a as input, then in round 1 of
phase 1 each non-faulty processor will receive a
with strong majority, since
and so in round 2 of phase 1 this will be the
preferred value of non-faulty processors,
independently of the kings broadcasted value.
From Lemma 2, this will be maintained until the
end, and will be exactly the decided output!
QED
73Performance of King Algorithm
- Number of processors n gt 4f
- 2(f1) rounds
- T(n2f) messages. Indeed, each non-faulty node
sends T(n) messages in each round, each
containing a given preference value (such value
might be not polynomial in n, by the way!)
74An Impossibility Result
There is no f-resilient algorithm for n
processors when
Theorem
Proof
First we prove the 3 processors case, and then
the general case
75The 3 processes case
Lemma
There is no 1-resilient algorithm for 3
processors
Proof
Assume by contradiction that there is a
1-resilient algorithm for 3 processors
B(1)
Local algorithm
A(0)
Initial value
C(0)
76A first execution
B(1)
A(1)
C(1)
C(1)
C(0)
faulty
77Decision value
1
1
faulty
(validity condition)
78A second execution
B(0)
1
A(0)
C(0)
A(0)
1
1
A(1)
faulty
faulty
790
1
0
1
1
faulty
faulty
(validity condition)
80A third execution
faulty
B(1)
B(1)
B(0)
A(1)
C(0)
0
1
0
1
1
faulty
faulty
81faulty
B(1)
B(1)
B(0)
A(1)
C(0)
0
B(0)
B(1)
1
C(1)
C(0)
A(0)
A(1)
0
1
C(0)
A(1)
faulty
faulty
82faulty
0
0
1
1
0
1
1
faulty
faulty
No agreement!!! Contradiction, since the
algorithm was supposed to be 1-resilient
83Therefore There is no algorithm that
solves consensus for 3 processors in which 1 is a
byzantine!
84The n processors case
Assume by contradiction that there is an
f-resilient algorithm A for n processors when
We will use algorithm A to solve consensus for 3
processors and 1 failure
(contradiction)
85W.l.o.g. let n3f we partition arbitrarily the
processors in 3 sets P0,P1,P2, each containing
n/3 processors then, given a 3-processor system
Qltq0,q1,q2gt, we associate each qi with Pi
Each processor q simulates algorithm A on n/3
processors
86fails
When a processor in Q fails, then at most n/3
original processors are affected
87Finish of algorithm A
k
k
k
k
all decide k
k
k
k
k
k
k
k
k
k
fails
algorithm A tolerates failures
88Final decision
k
k
fails
We reached consensus with 1 failure
Impossible!!!
89Therefore
There is no -resilient algorithm for
processors, where
90Exponential Tree Algorithm
- This algorithm uses
- f1 rounds (optimal)
- n3f1 processors (optimal)
- exponential number of messages (sub-optimal),
possibly having a content non-polynomial in n - Each processor keeps a tree data structure in its
local state - Topologically, the tree has height f1, and all
the leaves are at the same level - Values are filled top-down in the tree during the
f1 rounds - At the end of round f1, the values in the tree
are used to compute bottom-up the decision.
91Local Tree Data Structure
- Assumption Similarly to the King algorithm,
processors have (distinct) ids in 0,1,,n-1,
and we denote by pi the processor with id i this
is common knowledge, i.e., processors cannot
cheat about their ids - Each tree node is labeled with a sequence of
unique processor ids in 0,1,,n-1 - Root's label is empty sequence ? (the root has
level 0 and height f1) - Root has n children, labeled 0 through n-1
- Child node of the root (level 1) labeled i has
n-1 children, labeled i0 through in-1 (skipping
ii) - Node at level dgt1 labeled i1i2id has n-d
children, labeled i1i2id0 through
i1i2idn-1 (skipping any index i1,i2,,id) - Nodes at level f1 are leaves and have height 0.
92Example of Local Tree
93Filling-in the Tree Nodes
- Initially store your input in the root (level 0)
- Round 1
- send level 0 of your tree (i.e., your input) to
all (including yourself) - store value x received from each pj in tree node
labeled j (level 1) use a default value if
necessary - node labeled j in the tree associated with pi now
contains what pj told to pi about its input - Round 2
- send level 1 of your tree to all, including
yourself (this means, send n messages to each
processor) - let x be the value received from pj for the node
labeled k?j then store x in node labeled kj
(level 2) use a default value if necessary - node kj in the tree associated with pi now
contains "pj told to pi that pk told to me that
its input was x"
94Filling-in the Tree Nodes (2)
- .
- .
- .
- Round dgt2
- send level d-1 of your tree to all, including
yourself (this means, send n(n-1)(n-(d-2))
messages to each processor) - Let x be the value received from pj for node of
level d-1 labeled i1i2id-1, with i1,i2,,id-1
?j then, store x in tree node labeled
i1i2id-1 j (level d) use a default value
(known to all) if necessary - Continue for f1 rounds
95Calculating the Decision
- In round f1, each processor uses the values in
its tree to compute its decision. - Recursively compute the "resolved" value for the
root of the tree, resolve(?), based on the
"resolved" values for the other tree nodes
value in tree node labeled ? if it is a leaf
resolve(?)
majorityresolve(?') ?' is a child of ?
otherwise (use a default if tied)
96Example of Resolving Values
(assuming is the default)
0
0
1
1
0
0
1
0
0
0
1
1
1
1
1
0
97Resolved Values are consistent
- Lemma 1 If pi and pj are non-faulty, then pi's
resolved value for tree node labeled pp'j is
consistent, i.e., it equals what pj stores in its
node p during the filling-up of the tree (and so
the value stored and resolved in p by pi is the
same (i.e., is consistent)!). - Proof By induction on the height of the tree
node. - Basis height0 (leaf level). Then, pi stores in
node p what pj sends to it for p in the last
round. By definition, this is the resolved value
by pi for p.
98- Induction p is not a leaf, i.e., has height hgt0
- By definition, p has at least n-f children, and
since ngt3f, this implies n-fgt2f, i.e., it has a
majority of non-faulty children (i.e., whose last
digit of the label corresponds to a non-faulty
processor) - Let pkpjk be a child of p of height h-1 such
that pk is non-faulty. - Since pj is non-faulty, it correctly reports a
value v stored in its p node thus, pk stores it
in its ppj node. - By induction, pis resolved value for pk equals
the value v that pk stored in its p node. - So, all of ps non-faulty children resolve to v
in pis tree, and thus p resolves to v in pis
tree.
END of PROOF
99Inductive step by a picture
Non-faulty pj
Non-faulty pk
Non-faulty pi
p
height h1
v
resolve to v
ppj
ppj
stores v
v
v
height h
stores v
pjk
v
v
height h-1
resolve to v by ind. hyp.
majority resolve to v by ngt3f
Remark all the non-faulty processors will
resolve the very same value in p, namely v
100Validity
- Suppose all inputs of (non-faulty) processors are
v - Non-faulty processor pi decides resolve(?), which
is the majority among resolve(j), 0 j n-1,
based on pi's tree. - Since resolved values are consistent, resolve(j)
(at pi) if pj is non-faulty is the value stored
at the root of pj tree, namely pj's input value,
i.e., v. - Since there are a majority of non-faulty
processors, pi decides v.
101Agreement Common Nodes and Frontiers
- A tree node ? is common if all non-faulty
processors compute the same value of resolve(?). - To prove agreement, we have to show that the root
is common - A tree node ? has a common frontier if every path
from ? to a leaf contains at least a common node.
102- Lemma 2 If ? has a common frontier, then ? is
common. - Proof By induction on height of ?
- Basis (p is a leaf) then, since the only path
from p to a leaf consists solely of p, the common
node of such a path can only be p, and so p is
common - Induction (p is not a leaf) By contradiction,
assume p is not common then - Every child p pk of p has a common frontier
(this is not true, in general, if p is common) - By inductive hypothesis, p is common
- Then, all non-faulty processors resolve the same
value for p, and thus all non-faulty processors
resolve the same value for p, i.e., p is common.
END of PROOF
103Agreement the root has a common frontier
- There are f2 nodes on a root-leaf path
- The label of each non-root node on a root-leaf
path ends in a distinct processor index
i1,i2,,if1 - Since there are at most f faulty processors, at
least one such node corresponds to a non-faulty
processor - This node, say i1i2,,ik-1ik, is common
(indeed, by Lemma 1 concerning the consistency of
resolved values, in all the trees associated with
non-faulty processors, the resolved value equals
the value stored by the non-faulty processor pik)
in node i1i2,,ik-1 - Thus the root has a common frontier, and so is
common (by previous lemma) - Therefore, agreement is guaranteed!
104Complexity
- Exponential tree algorithm uses
- ngt3f processors
- f1 rounds
- Exponential number of messages (regardless of
message content, which may be not polynomial in
n) - In round 1, each (non-faulty) processor sends n
messages ? O(n2) total messages - In round d2, each (non-faulty) processor
broadcasts to all the level d-1 of its local
tree, which contains n(n-1)(n-2)(n-(d-2)) nodes
? this means a total of - O(nnn(n-1)(n-2)(n-(d-2)))O(nd1) messages
- When df1, this number is exponential in n if f
is more than a constant relative to n
105- Exercise 1 Show an execution with n4 processors
and f1 for which the King algorithm fails. - Exercise 2 Show an execution with n3 processors
and f1 for which the exp-tree algorithm fails.