Title: Design of High Availability Systems and Networks Replication
1Design of High Availability Systems and Networks
Replication
Ravi K. Iyer Center for Reliable and
High-Performance Computing Department of
Electrical and Computer Engineering
and Coordinated Science Laboratory University of
Illinois at Urbana-Champaign iyer_at_crhc.uiuc.edu
http//www.crhc.uiuc.edu/DEPEND
2Outline
- Replication
- Example systems
- Example algorithms for replicating multithreaded
applications - Self-checking processes the ARMOR approach
3Replication Basic Idea
- Use space redundancy to achieve FT
- Two basic techniques
- Active replication (fault masking)
- Passive replication (standby spare)
- Use multiple instances of a component so that
independent failures can be assumed
4Active Replication
- Replicas simultaneously perform the same work
- Replicas start from same initial state
- Replicas get the same ordered set of inputs
- In absence of fault, replicas produce same
outputs (assumption!) - Voter produces single response by majority voting
on replicas outputs - Mask potential errors/failures
- Voter is a single point of failure !
- Active replication schemes in SW
- Pass-First Each replica execute client requests
independently and sends reply to leader, which
forwards to client the first received reply - Leader Only (semi-active) Only leader sends
replies to client (other replicas suppress their
replies, unless leader fails) - Majority Voting All replicas replies are voted,
and client is delivered the majority vote - Require replica determinism
5Passive Replication
- Only one component (primary) does the computation
- Spares (backups) ready to switch over when a
fault is detected in primary - Spares need to access primary state
- Replicas must be both observable and controllable
- Need to have a detection mechanism
- Passive replication schemes in SW
- State Cast primary multicasts its state
- Stable Storage primary saves its state to a
system accessible stable storage - State can be sent/saved
- After processing each client/request
- Periodically (requires replica determinism)
6Replication Issues
- Replica must see the same ordered set of inputs
- Non-determinism in replica execution may cause
replicas to diverge - Voting
- Re-integration
- Cost/Performance and Complexity
7Replication Issues
- Replica must see the same ordered set of inputs
- In HW use additional wires
- E.g., Integrity S2 - Tandem
- In SW use network
- group communication protocols
- Provide reliable broadcast /unicast, total order
broadcast (atomic broadcast), group membership
service, and virtual synchrony - Fault Model crashes, Byzantine with signature,
network partitions - Examples ISIS/HORUS/Ensemble, Totem, Transis,
SecureRing, Rampart
8Replication Issues
- Non-determinism in replica execution may cause
replicas to diverge - In HW execute same instruction stream
- Tandem TMR synchronizes CPUs on (1) global
memory references, (2) external interrupts, and
(3) periodically. For (2) and (3) use cycle
counter. - In SW
- State machine approach avoid non-determinism
(e.g., multithreading, local timers/random number
generators, etc.) - Force same instruction streams (instruction
counter) - Use non-preemptive deterministic scheduler
9Non-preemptive deterministic scheduler
- Eternal (CORBA)
- Intercept system calls, replacing C standard
library - Only one logical thread can execute at a time (a
RMI blocks the replica process) - Transactional Drago (ADA)
- Scheduler embedded in run-time support
- More logical thread can proceed concurrently
- Scheduling points service requests, selective
receptions, lock requests, server calls and end
of execution. - Use internal queue and external queue
- Reads on external queue must be blocking !
- Non-preemptive ? only one physical thread can be
running (CPU/IO, SMP)
10Replication Issues
- Voting
- Voter is single point of failure
- In SW can replicate both clients and servers,
embedding a voter in each client/server - Fundamental assumption is
- Values differ ? a replica failed
- In SW, outputs can have chunks containing
replica/node dependent information
11Replication Issues
- Re-integration
- Hardware
- power off failed unit
- power on a spare unit (automatically or manually)
- set internal state of new spare to be equal to
other running units - Synchronize units and re-start computation
- e.g., CPU re-integration in Integrity S2
- Software
- Need to get/set internal state (checkpoint)
- Replicas must be quiescent no operation that can
change the state should be in progress - Replica state can be divided into Application
state, Infrastructure state, OS state - Application must minimize dependency on OS
resource when a checkpoint is taken (e.g., close
open files)
12Replication Issues
- Cost/Performance and Complexity
- HW (e.g., TANDEM)
- Between HW and SO (e.g., Hypervisor)
- OS (e.g., TARGON/32)
- Between OS and User App (e.g., TFT)
- User App (e.g., FT-CORBA)
- Overhead can range from 20 to 1,000
13Examples of Replicated Systems
- Self-Test And Repair (STAR) JPL
- Prototype for real-time satellite-control
computer - Techniques error-detection coding, on-line
monitoring, standby redundancy,replication with
voting, component redundancy, and re-execution - The hard-core monitor (TARP) is triplicated
- Electronic Switching Systems (ESS) Bell Lab
- Target 3 minutes down-time per year (R
0.999994) - Technique redundant components and self-checking
duplicated processor - Tolerates all single-failures
- Delta-4
- Network attachment controllers (NAC) run atomic
multicast protocol and enforce fail-silence - Passive, semi-active, active replication
- CORBA-based Replicated Systems
- Fault tolerance integrated in ORB (OrbixIsis,
Maestro, Electra) - Efficient but non-standard ORB
- Modifies transport level mapping in ORB
- Fault tolerance as services above ORB
- Inefficient/FT explicit to user
- Intercept IIOP messages and send them through
reliable broadcast protocol (Eternal, AQuA)
14Summary
- Replication is expensive in terms of
cost/complexity and overhead - Fault Model crash, Byzantine, network partitions
- HW/SW approaches similar issues/different
solutions - Replica must see the same ordered set of inputs
- Non-determinism in replica execution may cause
replicas to diverge - Voting
- Re-integration
15A Preemptive Deterministic Scheduling Algorithm
for Multithreaded Replicas
16Motivations
- Replication offers high degree of data integrity
and system availability - Replicated systems suffer from large overhead
- Multithreading can help reduce the overhead but
it results in nondeterminism in replica behavior - Not relevant from the application standpoint, but
it must be removed to support fault masking - Most replication approaches do not support
multithreading - Those that do (e.g., Eternal, Transactional
Drago) limit concurrency to at most one executing
thread at a time
17Proposed Approach
- Synchronize replicas only on shared state updates
- Intercept lock/unlock requests
- Enforce equivalent thread interleaving across
replicas - Allow multiple physical threads to run at the
same time - MT features that we rely on
- Updates to shared state are protected via mutexes
- Different mutexes protect different shared
variables - Requirements
- No other form of nondeterminism is present in the
application (e.g., local timers or local random
number gen.) - Can replay the application by enforcing the same
(causal) order of mutex acquisitions
18Formal Definitions
- Correct Application Assumption
- Each thread releases only mutexes it owns
- A thread executing infinitely often
- Eventually releases each mutex it acquires
- Requests mutexes infinitely often
- The mutex dependency graph is acyclic (no
deadlock) - Correctness Property
- Internal Correctness
- (Mutual Exclusion) At most one thread holds a
given mutex - (No Lockout) A mutex acquisition request is
eventually served - External Correctness
- (Safety) Two replicas impose the same causal
order of mutex acquisitions - (Liveness) Any mutex acquisition in one replica
is eventually performed by the other replica
19A Solution Preemptive Deterministic Scheduler
(PDS)
- No inter-replica communication
- Basic Idea of PDS
- Assume a total order lt on thread ids
- The lt relation is the same for all replicas
- Replica execution is broken into a sequence of
rounds - In a round a thread can acquire at most
- 1 mutex (PDS-1)
- 2 mutexes (PDS-2)
20Basic PDS Algorithm Example
- Threads t1, t2, and t3 have requested their next
mutex acquisition and have been suspended - Thread t4 is still executing
t1
m1
t2
m1
t3
m2
t4
N
21Basic PDS Algorithm Example
t1
m1
t2
Round N fires
m1
t3
m2
t4
N
22Basic PDS Algorithm Example
- Threads t1 and t3 execute concurrently
m1
t1
t1 releases m1
m1
t2
? t2 acquires m1
m2
t3
m2
t4
N
23Basic PDS Algorithm Example
- Threads t1, t2, and t3 execute concurrently
m1
t1
m1
t2
m2
m2
t3
m2
t4
N
24Basic PDS Algorithm Example
t1
m1
t2
m2
t3
t3 releases m2
m2
? t4 acquires m2
t4
N
25Basic PDS Algorithm Example
- All threads execute concurrently
t1
m1
t2
t3
m2
t4
N
26Basic PDS Algorithm Example
t1
m1
t2
t3
m2
t4
N
27Basic PDS Algorithm Example
m3
t1
m1
t2
m3
t3
m2
t4
N
28Basic PDS Algorithm Example
m3
t1
m3
m1
t2
m3
t3
m2
t4
N
29Basic PDS Algorithm Example
m3
t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
30Basic PDS Algorithm Example
t1 releases m3
? t2 acquires m3
m3
t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
31Basic PDS Algorithm Example
- Thread t1 and t2 execute concurrently
t1
m3
m1
t2
m3
t3
m2
m3
t4
N
N1
32PDS Algorithm
- Extend PDS-1 by allowing each thread to acquire
up to 2 mutexes per round - Divide a round in two steps
- NOTE threads having lower ids always win the tie
with threads having higher ids
- Avoid a thread suspension whenever the thread
would acquire a mutex first, regardless of other
threads (future) mutex requests
33Performance/Dependability Experimental Evaluation
- The PDS algorithm was formally specified and
proved for correctness - Experimental study of Performance/Dependability
tradeoffs involved in selecting different thread
scheduling algorithms - Studied algorithms
- Loose Synchronization Algorithm (LSA)
- Proposed in Basile SRDS02
- Preemptive Deterministic Scheduler (PDS)
- Non-Preemptive Deterministic Scheduler (NPDS)
- Based on MTRDS algorithm used in Transactional
Drago
34Experimental Setup
- Triplicated server running a synthetic MT
benchmark - 10 worker threads serve requests from 15 clients
- Serving a client request involves execution of a
sequence of - mutex acquisitions (modeling accesses to shared
data) - I/O activities (modeling accesses to persistent
storage) - Majority voting
- Replicas, voter/fanout, and clients run on
different machines
35Performance Evaluation
- Major Results
- LSA provides 5 times more throughput than NPDS
- PDS-2 provides 2 times more throughput than NPDS
- LSA incurs in 40-60 performance overhead w.r.t.
non-replicated benchmark (baseline)
36Dependability Evaluation Approach
- Use software-based error injection
- Dependability Measures
- Number of catastrophic failures
- Must be minimized in highly available systems
- Ratio between manifested and injected errors
- Provides Prfailureerror. Given an error
arrival rate, one can derive system availability. - Ratio between activated and manifested errors
- Provides a closer look into the error sensitivity
of a replica code.
37Error Injection Results
- Uniform injections (text/data/heap)
- PDS is less sensitive to errors than LSA
- A number of manifested errors results in
catastrophic failures (0.2 for PDS and 0.6 for
LSA) - The difference between PDS and LSA is due to
their different use of underlying GCS (Ensemble) - All Ensemble functions used by PDS are used by
LSA - A number of Ensemble functions are used by LSA
and not by PDS - Targeted injections show that errors into
Ensemble generate a significant number
catastrophic failures - 1-3 of manifested failures
38Lesson Learned
- Performance LSA gt PDS gt NPDS
- Dependability
- LSA more sensitive to the underlying GCSs fail
silence violations - NPDS dependability characteristics are similar to
PDS - Errors in the group communication layer do
propagate and cause catastrophic failures - Using interactive consistency will not suffice
- A fault tolerance middleware should itself be
fault-tolerant
39Multithreaded Apache (2.0.35)
- Triplicated system with majority voting
- Pentium III 500 MHz,
- Linux 2.4
- Ensemble 1.40
- Use Apaches worker module
- 10 server threads.
- 10 concurrent clients each sending 1000 requests
to retrieve a 1.5KB HTML page
40Lessons Learned (Replication)
- Dedicated hardware solutions such as Tandem
- achieve high reliability/availability through
extensive hardware redundancy - offer only a fixed level of dependability
throughout the lifetime of the application - require significant investment from the customer
- Software middleware solutions are vital
alternative to hardware based approaches - Provide high reliability/availability services to
the end user - Applications executed in the network system may
require varying levels of availability. - Services must adapt to varying availability
requirements. - Provide efficient error detection and fast
recovery to minimize system downtime. - Ensure minimum error propagation across the
network by - self-checking processes and fail-silent nodes
- Protect the entities that provide fault tolerance
services
41Lessons Learned (Self checking Middleware)
- Progressive fault injections to stress error
detection and recovery services of SIFT
environment. - SIFT environment imposes negligible overhead
during failure-free execution and lt 5 overhead
during recovery of ARMOR processes. - Correlated failures involving application and
ARMOR processes can impact application
availability. - Successful recovery of many correlated failures
due to hierarchical error detection and recovery. - Targeted heap injections show internal Self
Checks and microcheckpointing useful in
preventing error propagation.