Clusters Part 4 - Systems
  • Lars Lundberg
  • The slides in this presentation cover Part 4
    (Chapters 12-15) in Pfisters book. We will,
    however, only present slides for chapter 12.
  • This part is the most important one in Pfisters

High Availability
  • What we today call high availability was
    previously called fault tolerance.
  • Traditionally there has been hardware fault
    tolerant systems. This means that faults are
    entirely handled by the hardware, and the
    software does not have to care.
  • Cluster systems offer fault tolerance in
    software, i.e. they use standard hardware.

Classes of Availability
Measuring Availability
  • The availability is usually measured as the
    percentage of the time that a system is
    available. Assuming that a system can be either
    fully available or not available at all.
  • Potential problems when measuring availability
  • What if the system is partly available
  • Should we include periods when the system is not
  • Should we include planned outages for maintenance
    etc. Planned outages can be a real problem in
    non-stop operation environments.

High Availability vs. Continuous operation
  • If we separate the planned outages (maintenance,
    upgrades etc.) from the unplanned ones (crashes,
    faults etc.), we can make the distinction
  • High availability (few and short unplanned
  • Continuous operation (few and short planned and
    unplanned outages)
  • High availability and continuous operation are
    not always equally important.

Reasons for unplanned outages
  • Loss of power
  • Application software
  • Operating system software
  • Subsystem software (e.g. databases)
  • Hardware with moving parts (e.g. disks, fans,
  • I/O-adapters
  • Memory
  • Processors, caches etc.

Outage Duration
  • Hardware does not break as often as software,
    but when it does it takes longer to repair.
  • Traditional hardware fault tolerance can recover
    from a fault faster than software fault tolerant
    cluster systems.
  • Very few clusters can recover from a fault in
    less than 30 seconds. It often takes much longer.

Definition of High Availability
  • A system is highly available if
  • No replaceable piece is a single point of
  • The system is sufficiently reliable that you are
    likely to be able to repair or replace any broken
    parts before anything else breaks.
  • Single point of failure is a single element of
    hardware or software which, if it fails, brings
    down the entire system.

Summary of High Availability
  • For 24x365 operation (24 hour 365 days per year),
    you must consider things like cooling, power
    supply, and also provide careful system
  • 24x365 operation also implies dealing with
    planned outages and disasters, not just breakage
    and errors.
  • Disregarding power failure, software causes the
    largest number of outages
  • The longest unplanned outages are caused as much
    by hardware as software (again disregarding power

Summary of High Availability cont.
  • Avoid single point of failure
  • Clusters can help with planned outages and some
    unplanned errors in hardware and software.
  • Hardware based fault tolerance fails over
    instantaneous, but does not help with software
    errors and planned outages.
  • There is no industrial consensus on what high
    availability and fault tolerance means.

One computer (Alice) is watching another computer
(Bozo) if Bozo dies, Alice takes over Bozos work

Failover problems
  • If Alice tries to take of the control at the
    same time as Bozo comes back up again, we will
    have two computers struggling for the control at
    the same time. This can cause a lot of problems.

Avoiding planned outages
  • If we want to upgrade Bozo we can do the
  • Do a controlled (forced) fail over to Alice
  • Upgrade Bozo while Alice is taking care of
  • Do a failback to Bozo
  • Alice can now also be upgraded
  • Consequently, one of the advantages with
    clusters is that we do not have to take the
    system down during upgrades and maintenance.
  • Problems may, however, occur when
  • the upgrade includes change of data format on
    disk, or when
  • when the software runs in parallel across the
    cluster nodes

Moving resources when failing over
  • When an application is moved from one node to
    another the resources that it needs must also be
    moved, e.g. files and IP-addresses.
  • Early high-availability cluster system left this
    problem to the user, i.e. the user had to write a
    number of shell scripts that were executed during
    a failover.
  • One way to help the user is to define the
    dependencies between different applications and
    resources. The user then only has to define where
    a certain application should go,and the cluster
    software will move the necessary resources along
    with the application.

Potential problems when moving resources
  • Resources may depend on individual cluster nodes,
    e.g. a certain disk may only be accessible on a
    certain node.
  • The procedure for bringing resources on-line may
    depend on the node, e.g. a printer queue may
    already be defined on some nodes, and redefining
    it may cause problems.
  • The information about the resource dependencies
    must be available and consistent throughout the
    cluster nodes, even when the node responsible for
    updating this information crashes.

Moving data - replication vs. switchover
  • Moving data from Bozo to Alice an be done in two
  • Replication (separate disks/shared nothing, see
    Figure 108)
  • Bozo and Alice have their own separate disks, and
    the changes made on Bozo are continuously sent to
  • As an alternative, the changes in Bozo could be
    sent in batches at certain time intervals.
  • Switchover (shared disk, see Figure 109)
  • A disk (or other storage device) is connected to
    both Bozo and Alice, and when Bozo crashes, Alice
    takes control over the disk.
  • Switchover is often preferred in high
    availability systems

Replication vs. switchover
  • Replication advantages
  • It is easier to add a new node when using
  • It can be difficult to synchronize the disks in
    switchover configurations, e.g. the two systems
    must agree on disk partitions, volume names etc.
  • In switchover the disks are in one place. This
    limits the distance between the nodes and also be
    a problem with flooding of the room with the disk
    or other disasters.
  • Replication can use simpler storage units
  • The disks do not need to support dual access
  • The disks themselves are not a single point of

Replication vs. switchover
  • Switchover advantages
  • Easier to backup the disk
  • Less disk space is required
  • Less overhead, i.e. when using replication the
    Bozo must send copies of the change to Alice, and
    Alice must write these updates on the local
    disks. This uses CPU and I/O capacity.
  • If Bozo waits for Alice to signal that each
    update has been recorded correctly, the
    performance will be degraded. If Bozo does not
    wait, data may be lost when a failure occurs.
  • Failback is easier.

Avoiding corrupt data - transactions
  • When Bozo crashes, it might corrupt data or leave
    it in an inconsistent state.
  • Transactions are used for avoiding this problem
  • Transactions are usually implemented by having a
    log file on stable storage (e.g. mirrored disk)
  • No matter what happens (assuming the stable
    storage stays stable) a consistent state of the
    data can be recreated from the log file.
  • In replicated systems, transactions are
    implemented by a technique called two-phase

Failing over communication
  • When Alice takes over the job from Bozo, the
    communication from the client is redirected using
    IP takeover
  • IP takeover is obtained by resetting one (or
    more) of the communication adapters on Alice to
    respond to the IP address(es) that Bozo was
  • Since most communication protocols have routines
    for retransmission after a time out limit, the
    client computes never know the difference.
    However, the people at the client computers
    probably have to log in again, i.e. their
    sessions are usually aborted at failover.
  • An alternative way of failing over communication
    is that each client have multiple IP addresses
    the primary server, the secondary server and so
    on. If the primary server does not respond the
    client tries to contact the secondary server and
    so on.

Time for doing a failover
  • The time for reaching a fully operational state
    after a failover can be substantial. In best case
    scenarios the time can be as low as tens of
  • The failover times can be reduced by having pairs
    of processes
  • There is one process on Alice for each process on
  • Every time the process on Bozo changes its state
    that change is reflected on the process on Alice.
  • Tandem has claimed that by using this technique,
    sub-second failover is achievable.

Failover to where?
  • This question becomes interesting when there are
    more than two nodes in the cluster
  • Simple add-on high-availability systems often use
    static schemes, e.g. if Bozo dies, put jobs A and
    B on Alice n the rest on Clara.
  • Sophisticated cluster systems provide mechanisms
    for automatic load balancing (possibly also
    considering some user defined priorities).
  • Dynamic load balancing is easier is shared-disk
    clusters than in shared nothing clusters. In
    hared nothing clusters replication is used and
    this makes the backup order more static.

Global locks
  • In a shared-disk system, one must handle the
    problem of system wide locks when a node crashes
  • The processes on the node that crashed were
    probably holding resources that processes on
    other nodes will have to use. If the locks are
    not released the entire system will lockup.
  • There are two ways of handling this problem
  • Letting the applications keep track of the locks
    that it was holding
  • Letting a global lock manager keep track of the
    locks that the applications on the crashed node
    were holding.

  • Heartbeat messages are used for detecting when a
    node is dead.
  • Each node sends short messages to the other
    nodes, telling them that the node is alive
  • If a heartbeat message does not arrive within a
    time-out period, the node is declared dead.
  • One problem with this approach is that the
    message could be delayed for various reasons, and
    in that case a node which is declared dead may be
    OK. This can cause a lot of problems.
  • Another problem with this approach is that the
    node may be OK, but the communication link for
    the heartbeat is not OK. This could also lead to
    the dangerous conclusion that an Ok node is dead.
  • In order to improve the reliability of the
    heartbeat method the cluster might send heartbeat
    signals on a number of different channels, e.g.
    normal LAN, RS232 serial links, I/O links etc.

Actions when Bozo is declared dead
  • Establish a new heartbeat chain that excludes
  • Inform parallel subsystems that were running on
    Bozo such as databases, of what has occurred and
    is about to happen
  • Fence Bozo off from its resources (e.g. disks)
  • Form a cluster-wide, consistent plan defining how
    Bozos resources should be redistributed
  • Execute the plan, i.e. move the resources etc.
  • Inform the subsystem that the resource
    reallocation has been completed
  • Resume normal operation

Alternatives to heartbeats
  • Instead of heartbeats, one can use the opposite
    approach a liveness check.
  • This means that Alice will at certain points ask
    Bozo if he is OK.
  • A liveness check suffers from the same kind of
    problems as heartbeats, i.e. it is hard to
    guarantee a response within certain limits.
  • If a cluster node has reasons to believe that the
    rest of the system thinks that the node is dead,
    the node had better commit suicide. This could
    happen when a node detects that its heartbeat
    signals have been delayed beyond the time-out

IBM RS/6000 Cluster Technology (Phoenix)
  • The purpose of Phoenix is to help the developer
    to build cluster-parallel applications that are
    highly available, i.e. Phoenix is a development
    tool and does not do anything by itself.
  • The product is highly scalable designed for 512
    nodes it has been run on clusters with more than
    400 nodes.
  • There are tree core services in Phoenix (see
    Figure 111)
  • Topology ServicesThis service has no direct
    interface to the application. It manages
    heartbeats and maintains a dynamic map of the
    state of the other cluster nodes.
  • Group ServicesThe key interface that helps the
    application to deal with high availability issues
    when some event happens.
  • Event ManagerThis service provides a way to
    inform a program running anywhere in the cluster
    when some thing interesting happens

Microsofts Clustering Services (MSCS)
  • MSCS is currently supporting only two-node
    clusters, later versions will however support a
    larger number of nodes.
  • MSCS is, unlike Phoenix, a self-contained
    high-availability cluster product
  • A key component is MSCS is the quorum resource,
    which is usually a disk. The purpose of the
    quorum resource is to make sure that only one of
    the two nodes thinks that it is in charge of the
  • Each node has access to a dynamic, but
    cluster-wide consistent, configuration database.

  • The more there are in a cluster the less you pay
    for high availability, e.g.
  • The additional cost for handling a node failure
    in a one-node system is 100, i.e. we need two
    instead of one computers.
  • The additional cost of handling a node failure in
    a four-node system is 25, i.e. we need five
    instead of four computers.
  • One implication of this that it is desirable to
    use computers that cannot individually fulfill
    the job requirements.

Disaster Recovery
  • Disasters differ from ordinary failures in that
    they are distributed over an area, e.g. flooding
    of a room, earthquakes etc.
  • Shared disk switchover solutions will not work
    for disasters.
  • Some crude and simple solutions are often used
  • Sending away a backup tape to a remote location
    at certain intervals
  • Sending away a backup electronically to a remote
    location at certain intervals
  • The key difference between disaster recovery and
    normal clustering is the distance between the
    nodes. This causes delays which can strongly
    affect performance.

SMP and CC-NUMA Availability
  • If one processor node in an SMP or a CC-NUMA
    multiprocessor crashes, the entire system will
  • There are a number of reasons for this, e.g.
  • The caches on the processor nodes may contain the
    only valid copy of a certain variable.
  • The data structures in the operating system is
    shared between the processors, and if a processor
    crashes it may corrupt the shared data.
