Title: Reliable Distributed Systems
1Reliable Distributed Systems
- Fault Tolerance
- (Recoverability ? High Availability)
2Reliability and transactions
- Transactions are well matched to database model
and recoverability goals - Transactions dont work well for non-database
applications (general purpose O/S applications)
or availability goals (systems that must keep
running if applications fail) - When building high availability systems,
encounter replication issue
3Types of reliability
- Recoverability
- Server can restart without intervention in a
sensible state - Transactions do give us this
- High availability
- System remains operational during failure
- Challenge is to replicate critical data needed
for continued operation
4Replicating a transactional server
- Two broad approaches
- Just use distributed transactions to update
multiple copies of each replicated data item - We already know how to do this, with 2PC
- Each server has equal status
- Somehow treat replication as a special situation
- Leads to a primary server approach with a warm
standby
5Replication with 2PC
- Our goal will be 1-copy serializability
- Defined to mean that the multi-copy system
behaves indistinguishably from a single-copy
system - Considerable form and theoretical work has been
done on this - As a practical matter
- Replicate each data item
- Transaction manager
- Reads any single copy
- Updates all copies
6Observation
- Notice that transaction manager must know where
the copies reside - In fact there are two models
- Static replication set basically, the set is
fixed, although some members may be down - Dynamic the set changes while the system runs,
but only has operational members listed within it - Today stick to the static case
7Replication and Availability
- A series of potential issues
- How can we update an object during periods when
one of its replicas may be inaccessible? - How can 2PC protocol be made fault-tolerant?
- A topic well study in more depth
- But the bottom line is we cant!
8Usual responses?
- Quorum methods
- Each replicated object has an update and a read
quorum - Designed so QuQr gt replicas and QuQu gt
replicas - Idea is that any read or update will overlap with
the last update
9Quorum example
- X is replicated at a,b,c,d,e
- Possible values?
- Qu 1, Qr 5 (violates QUQu gt 5)
- Qu 2, Qr 4 (same issue)
- Qu 3, Qr 3
- Qu 4, Qr 2
- Qu 5, Qr 1 (violates availability)
- Probably prefer Qu4, Qr2
10Things to notice
- Even reading a data item requires that multiple
copies be accessed! - This could be much slower than normal local
access performance - Also, notice that we wont know if we succeeded
in reaching the update quorum until we get
responses - Implies that any quorum replication scheme needs
a 2PC protocol to commit
11Next issue?
- Now we know that we can solve the availability
problem for reads and updates if we have enough
copies - What about for 2PC?
- Need to tolerate crashes before or during runs of
the protocol - A well-known problem
12Availability of 2PC
- It is easy to see that 2PC is not able to
guarantee availability - Suppose that manager talks to 3 processes
- And suppose 1 process and manager fail
- The other 2 are stuck and cant terminate the
protocol
13What can be done?
- Well revisit this issue soon
- Basically,
- Can extend to a 3PC protocol that will tolerate
failures if we have a reliable way to detect them - But network problems can be indistinguishable
from failures - Hence there is no commit protocol that can
tolerate failures - Anyhow, cost of 3PC is very high
14A quandry?
- We set out to replicate data for increased
availability - And concluded that
- Quorum scheme works for updates
- But commit is required
- And represents a vulnerability
- Other options?
15Other options
- We mentioned primary-backup schemes
- These are a second way to solve the problem
- Based on the log at the data manager
16Server replication
- Suppose the primary sends the log to the backup
server - It replays the log and applies committed
transactions to its replicated state - If primary crashes, the backup soon catches up
and can take over
17Primary/backup
primary backup
log
Clients initially connected to primary, which
keeps backup up to date. Backup tracks log
18Primary/backup
primary backup
Primary crashes. Backup sees the channel break,
applies committed updates. But it may have
missedthe last few updates!
19Primary/backup
primary backup
Clients detect the failure and reconnect to
backup. Butsome clients may have gone away.
Backup state couldbe slightly stale. New
transactions might suffer from this
20Issues?
- Under what conditions should backup take over
- Revisits the consistency problem seen earlier
with clients and servers - Could end up with a split brain
- Also notice that still needs 2PC to ensure that
primary and backup stay in same states!
21Split brain reminder
primary backup
log
Clients initially connected to primary, which
keeps backup up to date. Backup follows log
22Split brain reminder
primary backup
Transient problem causes some links to break but
not all. Backup thinks it is now primary, primary
thinks backup is down
23Split brain reminder
primary backup
Some clients still connected to primary, but one
has switched to backup and one is completely
disconnected from both
24Implication?
- A strict interpretation of ACID leads to
conclusions that - There are no ACID replication schemes that
provide high availability - Most real systems solve by weakening ACID
25Real systems
- They use primary-backup with logging
- But they simply omit the 2PC
- Server might take over in the wrong state (may
lag state of primary) - Can use hardware to reduce or eliminate split
brain problem
26How does hardware help?
- Idea is that primary and backup share a disk
- Hardware is configured so only one can write the
disk - If server takes over it grabs the token
- Token loss causes primary to shut down (if it
hasnt actually crashed)
27Reconciliation
- This is the problem of fixing the transactions
impacted by lack of 2PC - Usually just a handful of transactions
- They committed but backup doesnt know because
never saw commit record - Later. server recovers and we discover the
problem - Need to apply the missing ones
- Also causes cascaded rollback
- Worst case may require human intervention
28Summary
- Reliability can be understood in terms of
- Availability system keeps running during a crash
- Recoverability system can recover automatically
- Transactions are best for latter
- Some systems need both sorts of mechanisms, but
there are deep tradeoffs involved
29Replication and High Availability
- All is not lost!
- Suppose we move away from the transactional model
- Can we replicate data at lower cost and with high
availability? - Leads to virtual synchrony model
- Treats data as the state of a group of
participating processes - Replicated update done with multicast
30Steps to a solution
- First look more closely at 2PC, 3PC, failure
detection - 2PC and 3PC both block in real settings
- But we can replace failure detection by consensus
on membership - Then these protocols become non-blocking
(although solving a slightly different problem) - Generalized approach leads to ordered atomic
multicast in dynamic process groups
31Non-blocking Commit
- Goal a protocol that allows all operational
processes to terminate the protocol even if some
subset crash - Needed if we are to build high availability
transactional systems (or systems that use quorum
replication)
32Definition of problem
- Given a set of processes, one of which wants to
initiate an action - Participants may vote for or against the action
- Originator will perform the action only if all
vote in favor if any votes against (or dont
vote), we will abort the protocol and not take
the action - Goal is all-or-nothing outcome
33Non-triviality
- Want to avoid solutions that do nothing (trivial
case of all or none) - Would like to say that if all vote for commit,
protocol will commit - ... but in distributed systems we cant be sure
votes will reach the coordinator! - any live protocol risks making a mistake and
counting a live process that voted to commit as a
failed process, leading to an abort - Hence, non-triviality condition is hard to capture
34Typical protocol
- Coordinator asks all processes if they can take
the action - Processes decide if they can and send back ok
or abort - Coordinator collects all the answers (or times
out) - Coordinator computes outcome and sends it back
35Commit protocol illustrated
ok to commit?
36Commit protocol illustrated
ok to commit?
ok with us
37Commit protocol illustrated
ok to commit?
ok with us
commit
Note garbage collection protocol not shown here
38Failure issues
- So far, have implicitly assumed that processes
fail by halting (and hence not voting) - In real systems a process could fail in arbitrary
ways, even maliciously - This has lead to work on the Byzantine generals
problem, which is a variation on commit set in a
synchronous model with malicious failures
39Failure model impacts costs!
- Byzantine model is very costly 3t1 processes
needed to overcome t failures, protocol runs in
t1 rounds - This cost is unacceptable for most real systems,
hence protocols are rarely used - Main area of application hardware
fault-tolerance, security systems - For these reasons, we wont study such protocols
40Commit with simpler failure model
- Assume processes fail by halting
- Coordinator detects failures (unreliably) using
timouts. It can make mistakes! - Now the challenge is to terminate the protocol if
the coordinator fails instead of, or in addition
to, a participant!
41Commit protocol illustrated
ok to commit?
ok with us
times outabort!
crashed!
Note garbage collection protocol not shown here
42Example of a hard scenario
- Coordinator starts the protocol
- One participant votes to abort, all others to
commit - Coordinator and one participant now fail
- ... we now lack the information to correctly
terminate the protocol!
43Commit protocol illustrated
ok to commit?
vote unknown!
ok
decision unknown!
ok
44Example of a hard scenario
- Problem is that if coordinator told the failed
participant to abort, all must abort - If it voted for commit and was told to commit,
all must commit - Surviving participants cant deduce the outcome
without knowing how failed participant voted - Thus protocol blocks until recovery occurs
45Skeen Three-phase commit
- Seeks to increase availability
- Makes an unrealistic assumption that failures are
accurately detectable - With this, can terminate the protocol even if a
failure does occur
46Skeen Three-phase commit
- Coordinator starts protocol by sending request
- Participants vote to commit or to abort
- Coordinator collects votes, decides on outcome
- Coordinator can abort immediately
- To commit, coordinator first sends a prepare to
commit message - Participants acknowledge, commit occurs during a
final round of commit messages
47Three phase commit protocol illustrated
ok ....
prepared...
Note garbage collection protocol not shown here
48Observations about 3PC
- If any process is in prepare to commit all
voted for commit - Protocol commits only when all surviving
processes have acknowledged prepare to commit - After coordinator fails, it is easy to run the
protocol forward to commit state (or back to
abort state)
49Assumptions about failures
- If the coordinator suspects a failure, the
failure is real and the faulty process, if it
later recovers, will know it was faulty - Failures are detectable with bounded delay
- On recovery, process must go through a
reconnection protocol to rejoin the system!
(Find out status of pending protocols that
terminated while it was not operational)
50Problems with 3PC
- With realistic failure detectors (that can make
mistakes), protocol still blocks! - Bad case arises during network partitioning
when the network splits the participating
processes into two or more sets of operational
processes - Can prove that this problem is not avoidable
there are no non-blocking commit protocols for
asynchronous networks
51Situation in practical systems?
- Most use protocols based on 2PC 3PC is more
costly and ultimately, still subject to blocking! - Need to extend with a form of garbage collection
mechanism to avoid accumulation of protocol state
information (can solve in the background) - Some systems simply accept the risk of blocking
when a failure occurs - Others reduce the consistency property to make
progress at risk of inconsistency with failed
proc.
52Process groups
- To overcome cost of replication will introduce
dynamic process group model (processes that join,
leave while system is running) - Will also relax our consistency goal seek only
consistency within a set of processes that all
remain operational and members of the system - In this model, 3PC is non-blocking!
- Yields an extremely cheap replication scheme!
53Failure detection
- Basic question how to detect a failure
- Wait until the process recovers. If it was dead,
it tells you - I died, but I feel much better now
- Could be a long wait
- Use some form of probe
- But might make mistakes
- Substitute agreement on membership
- Now, failure is a soft concept
- Rather than up or down we think about whether
a process is behaving acceptably in the eyes of
peer processes
54Architecture
Applications use replicated data for high
availability
3PC-like protocols use membership changes instead
of failure notification
Membership Agreement, join/leave and P seems
to be unresponsive
55Issues?
- How to detect failures
- Can use timeout
- Or could use other system monitoring tools and
interfaces - Sometimes can exploit hardware
- Tracking membership
- Basically, need a new replicated service
- System membership lists are the data it manages
- Well say it takes join/leave requests as input
and produces views as output
56Architecture
Application processes
membership views
A
A A,B,D A,D A,D,C D,C
GMS processes
join
B
leave
GMS
join
C
X
Y
Z
D
A seems to have failed
57Issues
- Group membership service (GMS) has just a small
number of members - This core set will tracks membership for a large
number of system processes - Internally it runs a group membership protocol
(GMP) - Full system membership list is just replicated
data managed by GMS members, updated using
multicast
58GMP design
- What protocol should we use to track the
membership of GMS - Must avoid split-brain problem
- Desire continuous availability
- Well see that a version of 3PC can be used
- But cant always guarantee liveness
59Reading ahead?
- Read chapters 12, 13
- Thought problem how important is external
consistency (called dynamic uniformity in the
text)? - Homework Read about FLP. Identify other
impossibility results for distributed systems.
What is the simplest case of an impossibility
result that you can identify?