Title: Clusters Part 4 Systems
1Clusters Part 4 - Systems
- Lars Lundberg
- The slides in this presentation cover Part 4
(Chapters 12-15) in Pfisters book. We will,
however, only present slides for chapter 12. - This part is the most important one in Pfisters
book!
2High Availability
- What we today call high availability was
previously called fault tolerance. - Traditionally there has been hardware fault
tolerant systems. This means that faults are
entirely handled by the hardware, and the
software does not have to care. - Cluster systems offer fault tolerance in
software, i.e. they use standard hardware.
3Classes of Availability
4Measuring Availability
- The availability is usually measured as the
percentage of the time that a system is
available. Assuming that a system can be either
fully available or not available at all. - Potential problems when measuring availability
- What if the system is partly available
- Should we include periods when the system is not
used - Should we include planned outages for maintenance
etc. Planned outages can be a real problem in
non-stop operation environments.
5High Availability vs. Continuous operation
- If we separate the planned outages (maintenance,
upgrades etc.) from the unplanned ones (crashes,
faults etc.), we can make the distinction
between - High availability (few and short unplanned
outages) - Continuous operation (few and short planned and
unplanned outages) - High availability and continuous operation are
not always equally important.
6Reasons for unplanned outages
- Loss of power
- Application software
- Operating system software
- Subsystem software (e.g. databases)
- Hardware with moving parts (e.g. disks, fans,
printers) - I/O-adapters
- Memory
- Processors, caches etc.
7Outage Duration
- Hardware does not break as often as software,
but when it does it takes longer to repair. - Traditional hardware fault tolerance can recover
from a fault faster than software fault tolerant
cluster systems. - Very few clusters can recover from a fault in
less than 30 seconds. It often takes much longer. -
8Definition of High Availability
- A system is highly available if
- No replaceable piece is a single point of
failure. - The system is sufficiently reliable that you are
likely to be able to repair or replace any broken
parts before anything else breaks. - Single point of failure is a single element of
hardware or software which, if it fails, brings
down the entire system.
9Summary of High Availability
- For 24x365 operation (24 hour 365 days per year),
you must consider things like cooling, power
supply, and also provide careful system
management. - 24x365 operation also implies dealing with
planned outages and disasters, not just breakage
and errors. - Disregarding power failure, software causes the
largest number of outages - The longest unplanned outages are caused as much
by hardware as software (again disregarding power
failure)
10Summary of High Availability cont.
- Avoid single point of failure
- Clusters can help with planned outages and some
unplanned errors in hardware and software. - Hardware based fault tolerance fails over
instantaneous, but does not help with software
errors and planned outages. - There is no industrial consensus on what high
availability and fault tolerance means.
11Failover
One computer (Alice) is watching another computer
(Bozo) if Bozo dies, Alice takes over Bozos work
12Failover problems
- If Alice tries to take of the control at the
same time as Bozo comes back up again, we will
have two computers struggling for the control at
the same time. This can cause a lot of problems.
13Avoiding planned outages
- If we want to upgrade Bozo we can do the
following - Do a controlled (forced) fail over to Alice
- Upgrade Bozo while Alice is taking care of
business - Do a failback to Bozo
- Alice can now also be upgraded
- Consequently, one of the advantages with
clusters is that we do not have to take the
system down during upgrades and maintenance. - Problems may, however, occur when
- the upgrade includes change of data format on
disk, or when - when the software runs in parallel across the
cluster nodes
14Moving resources when failing over
- When an application is moved from one node to
another the resources that it needs must also be
moved, e.g. files and IP-addresses. - Early high-availability cluster system left this
problem to the user, i.e. the user had to write a
number of shell scripts that were executed during
a failover. - One way to help the user is to define the
dependencies between different applications and
resources. The user then only has to define where
a certain application should go,and the cluster
software will move the necessary resources along
with the application.
15Potential problems when moving resources
- Resources may depend on individual cluster nodes,
e.g. a certain disk may only be accessible on a
certain node. - The procedure for bringing resources on-line may
depend on the node, e.g. a printer queue may
already be defined on some nodes, and redefining
it may cause problems. - The information about the resource dependencies
must be available and consistent throughout the
cluster nodes, even when the node responsible for
updating this information crashes.
16Moving data - replication vs. switchover
- Moving data from Bozo to Alice an be done in two
ways - Replication (separate disks/shared nothing, see
Figure 108) - Bozo and Alice have their own separate disks, and
the changes made on Bozo are continuously sent to
Alice. - As an alternative, the changes in Bozo could be
sent in batches at certain time intervals. - Switchover (shared disk, see Figure 109)
- A disk (or other storage device) is connected to
both Bozo and Alice, and when Bozo crashes, Alice
takes control over the disk. - Switchover is often preferred in high
availability systems
17Replication vs. switchover
- Replication advantages
- It is easier to add a new node when using
replication. - It can be difficult to synchronize the disks in
switchover configurations, e.g. the two systems
must agree on disk partitions, volume names etc. - In switchover the disks are in one place. This
limits the distance between the nodes and also be
a problem with flooding of the room with the disk
or other disasters. - Replication can use simpler storage units
because - The disks do not need to support dual access
- The disks themselves are not a single point of
failure
18Replication vs. switchover
- Switchover advantages
- Easier to backup the disk
- Less disk space is required
- Less overhead, i.e. when using replication the
Bozo must send copies of the change to Alice, and
Alice must write these updates on the local
disks. This uses CPU and I/O capacity. - If Bozo waits for Alice to signal that each
update has been recorded correctly, the
performance will be degraded. If Bozo does not
wait, data may be lost when a failure occurs. - Failback is easier.
19Avoiding corrupt data - transactions
- When Bozo crashes, it might corrupt data or leave
it in an inconsistent state. - Transactions are used for avoiding this problem
- Transactions are usually implemented by having a
log file on stable storage (e.g. mirrored disk) - No matter what happens (assuming the stable
storage stays stable) a consistent state of the
data can be recreated from the log file. - In replicated systems, transactions are
implemented by a technique called two-phase
commit.
20Failing over communication
- When Alice takes over the job from Bozo, the
communication from the client is redirected using
IP takeover - IP takeover is obtained by resetting one (or
more) of the communication adapters on Alice to
respond to the IP address(es) that Bozo was
using. - Since most communication protocols have routines
for retransmission after a time out limit, the
client computes never know the difference.
However, the people at the client computers
probably have to log in again, i.e. their
sessions are usually aborted at failover. - An alternative way of failing over communication
is that each client have multiple IP addresses
the primary server, the secondary server and so
on. If the primary server does not respond the
client tries to contact the secondary server and
so on.
21Time for doing a failover
- The time for reaching a fully operational state
after a failover can be substantial. In best case
scenarios the time can be as low as tens of
seconds. - The failover times can be reduced by having pairs
of processes - There is one process on Alice for each process on
Bozo. - Every time the process on Bozo changes its state
that change is reflected on the process on Alice. - Tandem has claimed that by using this technique,
sub-second failover is achievable.
22Failover to where?
- This question becomes interesting when there are
more than two nodes in the cluster - Simple add-on high-availability systems often use
static schemes, e.g. if Bozo dies, put jobs A and
B on Alice n the rest on Clara. - Sophisticated cluster systems provide mechanisms
for automatic load balancing (possibly also
considering some user defined priorities). - Dynamic load balancing is easier is shared-disk
clusters than in shared nothing clusters. In
hared nothing clusters replication is used and
this makes the backup order more static.
23Global locks
- In a shared-disk system, one must handle the
problem of system wide locks when a node crashes - The processes on the node that crashed were
probably holding resources that processes on
other nodes will have to use. If the locks are
not released the entire system will lockup. - There are two ways of handling this problem
- Letting the applications keep track of the locks
that it was holding - Letting a global lock manager keep track of the
locks that the applications on the crashed node
were holding.
24Heartbeats
- Heartbeat messages are used for detecting when a
node is dead. - Each node sends short messages to the other
nodes, telling them that the node is alive - If a heartbeat message does not arrive within a
time-out period, the node is declared dead. - One problem with this approach is that the
message could be delayed for various reasons, and
in that case a node which is declared dead may be
OK. This can cause a lot of problems. - Another problem with this approach is that the
node may be OK, but the communication link for
the heartbeat is not OK. This could also lead to
the dangerous conclusion that an Ok node is dead. - In order to improve the reliability of the
heartbeat method the cluster might send heartbeat
signals on a number of different channels, e.g.
normal LAN, RS232 serial links, I/O links etc.
25Actions when Bozo is declared dead
- Establish a new heartbeat chain that excludes
Bozo - Inform parallel subsystems that were running on
Bozo such as databases, of what has occurred and
is about to happen - Fence Bozo off from its resources (e.g. disks)
- Form a cluster-wide, consistent plan defining how
Bozos resources should be redistributed - Execute the plan, i.e. move the resources etc.
- Inform the subsystem that the resource
reallocation has been completed - Resume normal operation
26Alternatives to heartbeats
- Instead of heartbeats, one can use the opposite
approach a liveness check. - This means that Alice will at certain points ask
Bozo if he is OK. - A liveness check suffers from the same kind of
problems as heartbeats, i.e. it is hard to
guarantee a response within certain limits. - If a cluster node has reasons to believe that the
rest of the system thinks that the node is dead,
the node had better commit suicide. This could
happen when a node detects that its heartbeat
signals have been delayed beyond the time-out
limit.
27IBM RS/6000 Cluster Technology (Phoenix)
- The purpose of Phoenix is to help the developer
to build cluster-parallel applications that are
highly available, i.e. Phoenix is a development
tool and does not do anything by itself. - The product is highly scalable designed for 512
nodes it has been run on clusters with more than
400 nodes. - There are tree core services in Phoenix (see
Figure 111) - Topology ServicesThis service has no direct
interface to the application. It manages
heartbeats and maintains a dynamic map of the
state of the other cluster nodes. - Group ServicesThe key interface that helps the
application to deal with high availability issues
when some event happens. - Event ManagerThis service provides a way to
inform a program running anywhere in the cluster
when some thing interesting happens
28Microsofts Clustering Services (MSCS)
- MSCS is currently supporting only two-node
clusters, later versions will however support a
larger number of nodes. - MSCS is, unlike Phoenix, a self-contained
high-availability cluster product - A key component is MSCS is the quorum resource,
which is usually a disk. The purpose of the
quorum resource is to make sure that only one of
the two nodes thinks that it is in charge of the
cluster. - Each node has access to a dynamic, but
cluster-wide consistent, configuration database.
29Scaling
- The more there are in a cluster the less you pay
for high availability, e.g. - The additional cost for handling a node failure
in a one-node system is 100, i.e. we need two
instead of one computers. - The additional cost of handling a node failure in
a four-node system is 25, i.e. we need five
instead of four computers. - One implication of this that it is desirable to
use computers that cannot individually fulfill
the job requirements.
30Disaster Recovery
- Disasters differ from ordinary failures in that
they are distributed over an area, e.g. flooding
of a room, earthquakes etc. - Shared disk switchover solutions will not work
for disasters. - Some crude and simple solutions are often used
- Sending away a backup tape to a remote location
at certain intervals - Sending away a backup electronically to a remote
location at certain intervals - The key difference between disaster recovery and
normal clustering is the distance between the
nodes. This causes delays which can strongly
affect performance.
31SMP and CC-NUMA Availability
- If one processor node in an SMP or a CC-NUMA
multiprocessor crashes, the entire system will
crash. - There are a number of reasons for this, e.g.
- The caches on the processor nodes may contain the
only valid copy of a certain variable. - The data structures in the operating system is
shared between the processors, and if a processor
crashes it may corrupt the shared data.