Title: State Machines
1State Machines
2General Problems
- Consensus
- a particular problem
- algorithms and different formulations
- correctness and time analysis
- Application To Data Replication
- replica coordination
- group membership reintegration
- unique identifiers using logical/real clocks
3The Paxos Parliament And The Consensus Problem
- The Paxos Parliament
- determine the law of the land, defined by the
sequence of decrees passed - each legislator had his own ledger with decrees,
their unique number and their contents - entries in ledgers could not be modified or
deleted - legislators could leave the court for very long
periods of time and return later - communication only by messangers (could lose the
message, send it many times or lose the messages) - Requirements
- consistency of the ledgers
- progress to ensure that some decree will
eventually be passed - The Synod
- basically, the same problem as with the
Parliament, just that a single decree had to be
passed - the group of priests/legislators asked to vote
for a decree was called the quorum
4- This can be modelled as a consensus problem
- Agreement no two ledgers should contain
different decrees with the same number (no
conflicts among ledgers) - Validity any decree should be written in the
standard form - Termination (the progress condition)
- Agreement and validation are guaranteed and
progress is possible if three conditions are
satisfied - B1 Each ballot has a unique number.
- B2 The quorums of any two ballots have at least
one priest in common. - B3 For every ballot, if any priest in a quorum
has voted in an earlier ballot, then the decree
equals the decree of the latest of those earlier
ballots.
5Assumptions About The System
- partial synchronous distributed system in which
processes take actions within l time and messages
are delivered within d time - the system doen not necessarily exhibits this
normal timing behavior - each process has a direct communication channel
with each other process - allowed failures
- timig failures (the bounds of l and d can be
occasionally exceded) - loss, duplication or reordering of messages
- process stopping
- some stable storage is needed
- process recovery is considered
6The Synod Algorithm
- (1) Priest p chooses a new ballot number b. p
sends message NextBallot(b) to some set of
priests. - (2) When a priest q recieves a NextBallot(b), he
checks the notes in the back of his ledger and
determines the vote v with the largest ballot
number less then b that he has voted for. If such
a vote doesnt exist, then a default value
null(q) is used. - q sends p a LastVoted(b,v) message.
- (3) After p receives a LastVoted(b,v) message
from all the priests in a majority set Q, he
initiates a new ballot with number b, quorum Q,
and decree chosen according to B3. - p records the new ballot and sens
BeginBallot(b,d) to Q. - (4) If q receives BeginBallot(b,d) and decides to
vote, then he records the vote in the back of his
ledger and sends Voted(b,q) to p.
7(5) If p has recieved a Voted(b,q) from all q in
Q, then he writes d in his ledger and sends
Success(d) to all priests. (6) After receiving
Success(d), a priest enters d in his ledger.
8Notes on The Synod Algorithm
- to maintain B1, each ballot has to receive a
unique number this can be done by - having each priest noting the ballots in his
ledger - patitioning the set of possible ballots among the
priests - ( later we will talk about different
implementations) - a priest should not cast the vote after receiving
BeginBallot(b,d) if he has already sent a
LastVote(b,v) message for some other ballot and
v.balltbltb. - It follows that
- a priest must record
- the number of every ballot he has initiated
- every vote he has cast
- every LastVote message he has sent
9Stating The Problem in Terms of State Machines
- a state machine consists of
- state variables (encoded in states)
- commands (which transform the states)
- each command is implemented by a deterministic
program and its execution is atomic with respect
to other commands - clock I/O automaton specific state machine
devised by Lynch and Tuttle for modelling,
verifying, and analyzing time-based systems
10Clock I/O Automata
- An I/O time automaton A consists of
- a set of states states(A)
- a nonempty set start(A) of start states
- a set of actions partitioned in input, output,
internal, and time-passage actions and specified
in the signature of A - a transition relation steps(A) subset of
states(A)acts(A)states(A). - No input action can be blocked for all s state,
for all a input action, there is a state s such
that (s,a,s ) is a step in A. - A time-passage action (t) models the passage of
real time t. - A special real variable Clock is included in each
state to model the local clock of the process. It
is not necessary that Clock simulates the real
time.
11The Synod Algorithm In Terms Of Clock GTA
- The Distributed Setting
- relation with the Paxos problem
- priest/process
- law book/state
- passing a decree/executing a command
- complete network of n processes with unique
identifiers in a totally ordered set known by all
processes - clock GT automata are used to model both
processes and channels each automaton has a
local clock and the local clock for a channel is
used to detect timig failures - The Algorithm
- ideea propose values until one of them is
accepted by a majority of processes - any process may propose a value by initiating a
round for that value it becomes the leader of
that round - the leader and the other processes are agents
12- (1) The leader sends a Collect message to all
agents - (2) If an agent recieves a Collect message and it
is already committed for a round with a biger
round number, it sends an OldRound message
otherwise, it sends a Last message with its
information about rounds previously conducted. - (3) If the leader receives more than n/2 Last
messages, it initiates a new round and sends to
all agents a Begin message. - (4) If an agent receives the Begin message and is
committed, it sends an OldRound message
otherwise, it accepts the value proposed and
responds with an Accept message. - (5) If the leader receives more than n/2 Accept
messages, then the round is successful and its
own output value is the value of the round. - (6) The leader broadcasts the reached decision.
- Notes
- the set of agents Last (Accept) messages are
received frominfo-quorum (accepting-quorum)
13Implementation(1)
14Implementation(2)
- BPLEADER(I) (clock GTA running the leader at
process i) - Input NewRound(i), Leader(i)
- NotLeader(i)
- Receive(m)(j,i), mLast, Accept,
Success, OldRound - Output Send(m)(j,i), mCollect, Begin
- BeginCast(i)
- RndSuccess(v)(i)
- Internal Collect(i), GatherLast(i) ...
- Time-passage ...
- BPAGENT(I) (clock GTA running an agent at process
i) - Input Receive(m)(j,i), mCollect, Begin
- Output Send(m)(j,i), mLast, Accept, OldRound
- Internal LastAccept(i), Accept(i), ...
- Time-passage ...
15Correctness Proof
- execution fragment sequence of states followed
by actions in steps according to the automaton - problem specification set of allowable behaviors
(behavior sequence of external actions from an
execution fragment) - an automaton A solves the problem if each of its
behaviors is contained in the problem
specification - safety properties must hold in every state of a
computation - liveness properties specify events that must
eventually be performed
16Safety/Liveness Properties
- safety property in any execution of the system
agreement and validity are guaranteed - liveness property under some conditions,
termination is guaranteed - an execution fragment is nice if
- no loss or duplication takes place
- at each time-passage action the local clock is
incremented with the real time variation - every process is either stopped or alive
- a majority of process are alive
- Theorem If a nice execution fragment starts in a
reachable state and it has a unique leader and
lasts for more than 16l8nl9d time units, then
by the time 16l8nl9d the leader has reached a
decision. - Note proofs are based on invariants.
17Other Results On Time Performance
- If a nice execution fragment starts in a
reachable state and lasts more than 24l10nl13d,
then - the leader decides by the time 21l8nl11d and at
most 8n messages are sent - all alive processes decide by time 24l10nl13d
and at most 2n additional messages are sent
18Generalization Of The Synod Protocol MULTIPAXOS
- consensus has to be reached on a sequence of
values - for each value we run BAXICPAXOS
- the automata used for each instance of the
algorithm are like automata in BAXIXPAXOS, except
that an additional parameter (the index of the
proposed value) is present in each action - concurrency several leaders may concurrently
initiate rounds and these round are carried out
concurrently - several leaders initiating values concurrently is
an important difference between Paxos algorithm
and three phase commit protocol
19Data Replication
- problem providing distributed and concurrent
access to data objects - simple implementation maintain the object at a
single process accessed by multiple clients - some disadvantages
- not good scaling when the number of clients
increases - not fault-tolerant
- other solution data replication
- servers are replicated each server runs the same
state machine - clients make requests which are redirected to
specific servers
20Replica Coordination(1)
- Requirements
- requests should be processed by state machines
one at a time - the order of processing should be consistent with
potential causality - outputs determined only by the sequence of
requests, independent of time or any other
activity in the system - Replica coordination
- agreement every nonfaulty state machine replica
receives every request - order every nonfaulty state machine replica
processes the requests it receives in the same
relative order - issues to be considered fault-tolerance and
reconfiguration - MULTIPAXOS possible solution to the problem
21Replica Coordination(2)MULTIPAXOS For Replica
Coordination
- each process in the system maintains a copy of
the data object - a client requests un update operation
- a process proposes the operation in an instance
of MULTIPAXOS - after some time, the update operation is the
output value of the instance of MULTIPAXOS - the leader of the round updates its local copy
because of correctness, all the alive processes
update their copies, too - a report to the client is given
- a client requests a read operation
- the request is immediately satisfied based on the
local copy - Note majority to achieve consistency-gt majority
voting - a unique leader required to achieve
termination-gt primary copy replication
22Replica Coordination(3)Order and Stability
- unique identifiers for requests (total order)
- implementation a replica next processes the
stable request with the smallest unique
identifier (stable request no request from a
correct client and with a lower uid can be
subsequently delivered to that state machine) - using logical clocks to ensure order and
stability - each process has a local counter
- local counter is incremented after each event at
that process - each message sent is timestamped with the local
clock - upon receipt of a message, the local clock of the
receiver becomes 1maximum of timestamp and local
clock - a uid for each event is given by appending a
fixed-length bit (encodes the process id) to the
counter value of the process where the event
takes place - using real clocks to ensure order and stability
- assumptions
- the degree of clock synchronization better than
min message delivery time - a request r will be received by every correct
process no later then uid(r)? - stability test a request r is stable at a state
machine if the local clock reads time t and
tgtuid(r) ?
23Replica Coordination(4)Reconfiguration
- at time t there are P(t) processes, F(t) faulty
- necessary condition for correct output
- P(t)gtF(t)/2 if Byzantine failures are possible
- P(t)gtF(t) if only fail-stop failures
- system described by 3 sets clients (C), state
machines (S), and output devices (O)
information about them stored in state variables
and changed by commands - C and O make periodical queries-gt better share
processors - messages sent by S always contain information
about future reconfiguration-gt permanent
communication Slt-gtC and Slt-gtO - requests to change a configuration of the system
made by failure/recovery detector mechanism
24Replica Coordination(6)Integrating A Repaired
Object
- goal integrate element e at request r
- notation er is the state a non-faulty system
element e should be in after processing all the
requests up to r - if processors are fail stop and logical clocks
are implemented, then the cooperation of only one
state machine replica is needed (if the sm has
not failed, then it is correct, and because of
consensus among replicas, its information on the
system is correct and complete with respect to
other sm) -gt the used sm should have access to
enough information - implementation er is sent to e before the
output produced by processing any request with
uid larger than uid(r) - e in O er usually is device-specific setup
information - can be stored in state variables of sm
- e in C er usually based on sensor values read
- use information from C to sm
25Replica Coordination(7)Integrating A Repaired
State Machine
- try to use the algorithm sm sends to e the
values of all its state variables before the
output produced by processing any request with
uid larger than uid(r) .... problem some client
request might be recieved by sm after sending
er, but delivered to e before its repair - solution sm must relay to e requests received
from clients - how long as soon as e has received a request
directly from a client c, requests from the same
c with larger uid need not be relayed to e - so, e should inform sm of the uid of requests
received directly from c - algorithm
- (1) sm sends e the values of its state
variables and copies of pending requests - (2) sm sends to e every subsequent
request r received from client c s.t.
uid(r)ltuid(rc) (rc is the first request e has
directly recieved from c, after e restarted)