FailSafe SGI - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

FailSafe SGI

Description:

SGI s High Availability Solution Mayank Vasa MTS, Linux FailSafe Gatekeeper vasa_at_sgi.com FailSafe - What is it? High Availability for business critical applications ... – PowerPoint PPT presentation

Number of Views:90
Avg rating:3.0/5.0
Slides: 25
Provided by: Mayan6
Category:
Tags: failsafe | sgi | script | shell

less

Transcript and Presenter's Notes

Title: FailSafe SGI


1
FailSafe SGIs High
Availability Solution
  • Mayank Vasa
  • MTS, Linux FailSafe Gatekeeper
  • vasa_at_sgi.com

2
FailSafe - What is it?
  • High Availability for business critical
    applications at a low cost
  • User level software running in a clustered
    environment providing
  • single point of failure recovery
  • cluster administration services GUI
  • a simple way to make applications HA aware

3
FailSafe - What it looks like
4
FailSafe - Terminology
  • Node a single Linux image
  • Cluster one or more nodes connected via some
    interconnect
  • Pool entire set of nodes involved with a group
    of clusters
  • Node Membership list of nodes in a cluster on
    which FailSafe can allocate resource groups

5
FailSafe - Terminology (contd.)
  • Process Membership list of process instances in
    a cluster which form a process group
  • Resource a single physical or logical entity
  • Resource Group Collection of inter-dependent
    resources
  • cannot overlap
  • Behaves like an atomic unit of failover
  • Must have a unique name throughout the cluster

6
FailSafe - Terminology (contd.)
  • Failover process of moving a resource group
    from one node to another
  • Failover Policy method used by FailSafe to
    determine the destination node of a failover
  • Failover Domain ordered list of nodes on which
    a given resource group can be allocated

7
FailSafe - Terminology (contd.)
  • Failover Attributes Auto Failback, Controlled
    Failback, InPlace Recovery
  • Failover policy script shell script which
    generates an ordered set of node names on which
    the resource group can be placed
  • Action scripts scripts which determine how a
    resource is started, stopped and monitored

8
FailSafe - Architecture
9
FailSafe - Acronyms (so many!)
  • CMS Cluster Membership Service
  • GCS Group Communication Service
  • SRM System Resource Manager
  • CRS Cluster Reset Service
  • CAD Cluster Administration Daemon
  • CDB Cluster Database
  • CDBD Cluster Database Daemon

10
FailSafe - Cluster Database
  • Repository for all cluster configuration
  • Dynamic changes supported
  • Consistency is automatically supported
  • Replicated in all nodes of the pool
  • Provides read and write transactional semantics

11
FailSafe - Cluster Database Daemon
  • Controls read and write accesses to the CDB
  • Notifies clients of dynamic changes to the CDB
  • Keeps global portions of the CDB consistent
    across the pool

12
FailSafe - Cluster Administration Daemon
  • Daemon responsible for dynamically updating the
    GUI
  • CAD is a client of CDBD
  • CDBD notifies CAD of any changes
  • Provides notification (default email) of status
    changes in node, cluster or resource groups

13
FailSafe - Cluster Membership Service
  • Provides cluster node membership information to
    its clients
  • Node membership information includes
  • nodes that are currently part of the cluster
  • Node status i.e. up, down or unknown
  • Node name
  • IP address currently being used for inter-CMSD
    communication
  • Inactive cluster node membership information is
    also provided

14
FailSafe - Cluster Membership Service (contd.)
  • Any change in cluster status results in a node
    membership message issued by CMSD to its clients
    on all nodes of the cluster
  • CMSD implements failstop and quorum policy
  • CMSDs monitor each other by exchanging heartbeat
    messages directly with each other

15
FailSafe - Group Communication Service
  • Provides a consistent view of process group
    memberships in presence of process failures, new
    processes joining, and changing node memberships
  • Provides a reliable ordered atomic messaging
    service to members of the process group under
    changing node and group memberships
  • GCS operates in the context of a cluster as
    defined by CMS

16
FailSafe - System Resource Manager
  • Manages the resources and resource groups in a
    cluster
  • Co-ordinates access to physically shared
    resources
  • Monitors availability of resources
  • Performs local failover of resources
  • Maps a set of resources into a resource group
  • Atomically allocate resource groups

17
FailSafe - Failsafe Daemon
  • A policy implementor for Resource Groups (RG)
  • Provides the ability to enable/disable monitoring
    an application dynamically
  • Provides ability to failover an application if
    monitoring fails
  • Failover can be either local (restart) or remote

18
FailSafe - Failsafe Daemon (contd.)
  • Failover Policy Module (PM)
  • PMs components
  • Failover script
  • Initial Failure Domain
  • Attributes

19
FailSafe - Cluster Reset Service
  • Provides reset facility in a cluster upon request
    from one of its clients
  • Provides facility to monitor each reset line that
    connects to a machine that it is expected to
    reset
  • Special reset network to ensure connectivity for
    resetting remote machines

20
FailSafe - Agents
  • Glue between a resource type and the Failsafe
    daemon
  • Collection of action scripts and binaries that
    the action scripts could be calling
  • Goal Make a resource a highly available service
  • Examples a file server agent, a web server
    agent, an agent for making an IP address , a
    filesystem or a volume highly available

21
FailSafe - Action Scripts
  • Determine how a resource is started, stopped and
    monitored
  • Action scripts are per resource type
  • Types start, stop, monitor, exclusive, restart
  • Returns status for each resource acted on
  • Called by SRM

22
FailSafe - Related HA Technologies
  • A journaled file system for fast recovery
  • FailSafe can support multiple journaled
    filesystems such as XFS, GFS, ext3fs
  • Volume manager for disk failures (lvm)
  • Network mirroring
  • Monitoring tool (mon)

23
FailSafe - Docs, Contacts
  • Documentation http//oss.sgi.com/projects/failsa
    fe/
  • Contact failsafe_at_oss.sgi.com

24
FailSafe - Q A
  • Questions - Sure!
  • Answers . Well maybe )
Write a Comment
User Comments (0)
About PowerShow.com