Title: FailSafe SGI
1 FailSafe SGIs High
Availability Solution
- Mayank Vasa
- MTS, Linux FailSafe Gatekeeper
- vasa_at_sgi.com
2FailSafe - What is it?
- High Availability for business critical
applications at a low cost - User level software running in a clustered
environment providing - single point of failure recovery
- cluster administration services GUI
- a simple way to make applications HA aware
3FailSafe - What it looks like
4FailSafe - Terminology
- Node a single Linux image
- Cluster one or more nodes connected via some
interconnect - Pool entire set of nodes involved with a group
of clusters - Node Membership list of nodes in a cluster on
which FailSafe can allocate resource groups
5FailSafe - Terminology (contd.)
- Process Membership list of process instances in
a cluster which form a process group - Resource a single physical or logical entity
- Resource Group Collection of inter-dependent
resources - cannot overlap
- Behaves like an atomic unit of failover
- Must have a unique name throughout the cluster
6FailSafe - Terminology (contd.)
- Failover process of moving a resource group
from one node to another - Failover Policy method used by FailSafe to
determine the destination node of a failover - Failover Domain ordered list of nodes on which
a given resource group can be allocated
7FailSafe - Terminology (contd.)
- Failover Attributes Auto Failback, Controlled
Failback, InPlace Recovery - Failover policy script shell script which
generates an ordered set of node names on which
the resource group can be placed - Action scripts scripts which determine how a
resource is started, stopped and monitored
8FailSafe - Architecture
9FailSafe - Acronyms (so many!)
- CMS Cluster Membership Service
- GCS Group Communication Service
- SRM System Resource Manager
- CRS Cluster Reset Service
- CAD Cluster Administration Daemon
- CDB Cluster Database
- CDBD Cluster Database Daemon
10FailSafe - Cluster Database
- Repository for all cluster configuration
- Dynamic changes supported
- Consistency is automatically supported
- Replicated in all nodes of the pool
- Provides read and write transactional semantics
11FailSafe - Cluster Database Daemon
- Controls read and write accesses to the CDB
- Notifies clients of dynamic changes to the CDB
- Keeps global portions of the CDB consistent
across the pool
12FailSafe - Cluster Administration Daemon
- Daemon responsible for dynamically updating the
GUI - CAD is a client of CDBD
- CDBD notifies CAD of any changes
- Provides notification (default email) of status
changes in node, cluster or resource groups
13FailSafe - Cluster Membership Service
- Provides cluster node membership information to
its clients - Node membership information includes
- nodes that are currently part of the cluster
- Node status i.e. up, down or unknown
- Node name
- IP address currently being used for inter-CMSD
communication - Inactive cluster node membership information is
also provided
14FailSafe - Cluster Membership Service (contd.)
- Any change in cluster status results in a node
membership message issued by CMSD to its clients
on all nodes of the cluster - CMSD implements failstop and quorum policy
- CMSDs monitor each other by exchanging heartbeat
messages directly with each other
15FailSafe - Group Communication Service
- Provides a consistent view of process group
memberships in presence of process failures, new
processes joining, and changing node memberships - Provides a reliable ordered atomic messaging
service to members of the process group under
changing node and group memberships - GCS operates in the context of a cluster as
defined by CMS
16FailSafe - System Resource Manager
- Manages the resources and resource groups in a
cluster - Co-ordinates access to physically shared
resources - Monitors availability of resources
- Performs local failover of resources
- Maps a set of resources into a resource group
- Atomically allocate resource groups
17FailSafe - Failsafe Daemon
- A policy implementor for Resource Groups (RG)
- Provides the ability to enable/disable monitoring
an application dynamically - Provides ability to failover an application if
monitoring fails - Failover can be either local (restart) or remote
18FailSafe - Failsafe Daemon (contd.)
- Failover Policy Module (PM)
- PMs components
- Failover script
- Initial Failure Domain
- Attributes
19FailSafe - Cluster Reset Service
- Provides reset facility in a cluster upon request
from one of its clients - Provides facility to monitor each reset line that
connects to a machine that it is expected to
reset - Special reset network to ensure connectivity for
resetting remote machines
20FailSafe - Agents
- Glue between a resource type and the Failsafe
daemon - Collection of action scripts and binaries that
the action scripts could be calling - Goal Make a resource a highly available service
- Examples a file server agent, a web server
agent, an agent for making an IP address , a
filesystem or a volume highly available
21FailSafe - Action Scripts
- Determine how a resource is started, stopped and
monitored - Action scripts are per resource type
- Types start, stop, monitor, exclusive, restart
- Returns status for each resource acted on
- Called by SRM
22FailSafe - Related HA Technologies
- A journaled file system for fast recovery
- FailSafe can support multiple journaled
filesystems such as XFS, GFS, ext3fs - Volume manager for disk failures (lvm)
- Network mirroring
- Monitoring tool (mon)
23FailSafe - Docs, Contacts
- Documentation http//oss.sgi.com/projects/failsa
fe/ - Contact failsafe_at_oss.sgi.com
24FailSafe - Q A
- Questions - Sure!
- Answers . Well maybe )