Title: LinuxHA Release 2
1Linux-HA Release 2
- Alan Robertson
- IBM Linux Technology Center
- alanr_at_unix.sh
2Linux-HA Release 2
- What is High-Availability (HA) Clustering?
- What can HA do for me?
- What is the Linux-HA project?
- Linux-HA applications
- Linux-HA customers
- Linux-HA release 1 capabilities
- Linux-HA release 2 capabilities
- Comparative Architectures
- Release 2 Details
- Futures
3What Is HA Clustering?
- Putting together a group of computers which trust
each other to provide a service even when system
components fail - When one machine goes down, others take over its
work - This involves IP address takeover, service
takeover, etc. - New work comes to the takeover machine
- Not primarily designed for high-performance
4What Can HA Clustering Do For You?
- It cannot achieve 100 availability nothing
can. - HA Clustering designed to recover from single
faults - It can make your outages very short
- From about a second to a few minutes
- It is like a Magician's (Illusionist's) trick
- When it goes well, the hand is faster than the
eye - When it goes not-so-well, it can be reasonably
visible - A good HA clustering system adds a 9 to your
base availability - 99-99.9, 99.9-99.99, 99.99-99.999,
etc. - Complexity is the enemy of reliability!
5Single Points of Failure (SPOFs)
- A single point of failure is a component whose
failure will cause near-immediate failure of an
entire system or service - Good HA design eliminates of single points of
failure
6How Does HA work?
- Manage redundancy to improve service availability
- Like a cluster-wide-super-init on steroids
- Even complex services are now respawn
- on node (computer) death
- on impairment of nodes
- on loss of connectivity
- for services that aren't working (not necessarily
stopped) - managing very complex dependency relationships
7Redundant Communications
- Intra-cluster communication is critical to HA
system operation - Most HA clustering systems provide mechanisms for
redundant internal communication for heartbeats,
etc. - External communications is usually essential to
provision of service - External communication redundancy is usually
accomplished through routing tricks - Having an expert in BGP or OSPF is a help
8Redundant Data Access
- Replicated
- Copies of data are kept updated on more than one
computer in the cluster - Shared
- Typically Fiber Channel Disk (SAN)
- Sometimes shared SCSI
- Back-end Storage (Somebody Else's Problem)
- NFS, SMB
- Back-end database
9The Desire for HA systems
- Who wants low-availability systems?
- Why are so few systems High-Availability?
10Why isn't everything HA?
11(No Transcript)
12The Linux-HA Project
- Linux-HA is the oldest high-availability project
for Linux, with the largest associated community - The core piece of Linux-HA is called
heartbeat(though it does much more than
heartbeat) - Linux-HA has been in production since 1999, and
is currently in use on about ten thousand sites - Linux-HA also runs on FreeBSD and Solaris, and is
being ported to OpenBSD and others - Linux-HA is shipped with every major Linux
distribution except one.
13Linux-HA Release 1 Applications
- Load Balancers
- Web Servers
- Database Servers
- Custom Applications
- Firewalls
- Retail Point of Sale Solutions
- Authentication
- File Servers
- Proxy Servers
- Medical Imaging
- Almost any type server application you can think
of except SAP
14Linux-HA customers
- Emageon medical imaging services
- Contraloria General de la Republica (Colombian
government) - Incredimail bases their mail service on Linux-HA
on IBM hardware - Karstadts' uses Linux-HA in each of several
hundred stores - Bavarian Radio Station (Munich) coverage of 2002
Olympics in Salt Lake City - Circuit City, Autozone, others uses Linux-HA in
each of several hundred stores - Citysavings Bank in Munich (infrastructure)
- University of Toledo (US) 20k student Computer
Aided Instruction system - Autostrada 230 clusters across country
- The Weather Channel (weather.com)
- Sony (manufacturing)
- ISO New England manages power grid using 25
Linux-HA clusters
15Linux-HA Release 1 capabilities
- Supports 2-node clusters
- Can use serial, UDP bcast, mcast, ucast comm.
- Fails over on node failure
- Fails over on loss of IP connectivity
- Capability for failing over on loss of SAN
connectivity - Limited command line administrative tools to fail
over, query current status, etc. - Active/Active or Active/Passive
- Simple resource group dependency model
- Requires external tool for resource monitoring
- SNMP monitoring
16Linux-HA Release 2 capabilities
- Built-in resource monitoring
- Support for the OCF resource standard
- Much Larger clusters supported ( 8 nodes)
- Sophisticated dependency model with rich
constraint support (resources, groups,
incarnations, master/slave) (needed for SAP) - XML-based resource configuration
- Configuration and monitoring GUI
- Support for GFS cluster filesystem
- Multi-state (master/slave) resource support
- Initially - no IP, SAN monitoring
17Release 2 Credits
- Andrew Beekhof CRM, CIB
- Gouchun Shi significant infrastructure
improvements - Sun, Jiang Dong and Huang, Zhen LRM, Stonithd
and testing - Lars Marowsky-Bree architecture, PHB -)
- Alan Robertson architecture, project
leadership, original heartbeat code and testing
18Linux-HA Release 1 Architecture
19Linux-HA Release 2 Architecture(add TE and PE)
20Resource Objects in Release 2
- Release 2 supports resource objects which can
be any of the following - Primitive Resources
- OCF, heartbeat-style, or LSB resource agent
scripts - Resource Incarnations need n resource objects
- somewhere - Resource groups a group of resources with
implied co-location and linear ordering
constraints - Multi-state resources (master/slave)
- Designed to model master/slave (replication)
resources (DRBD, et al)
21Basic Dependencies in Release 2
- Ordering Dependencies
- start before (implies stop after)
- start after (implies stop before)
- Mandatory Co-location Dependencies
- must be co-located with
- cannot be co-located with
22Resource Location Constraints
- Mandatory Constraints
- Resource Objects can be constrained to run on any
selected subset of nodes. Default is none. - Preferential Constraints
- Resource Objects can also be preferentially
constrained to run on specified nodes by
providing weightings for arbitrary logical
conditions - The resource object is run on the node which has
the highest weight (score)
23Resource Incarnations
- Resource Incarnations allow one to have a
resource which runs multiple (n) times on the
cluster - This is useful for managing
- load balancing clusters where you want n of
them to be slave servers - Cluster filesystems
- Cluster Alias IP addresses
24Resource Groups
- Resource Groups provide a shorthand for making a
creating ordering and co-location dependencies - Each resource object in the group is declared to
have linear start-after ordering relationships - Each resource object in the group is declared to
have co-location dependencies on each other - This is an easy way of converting release 1
resource groups to release 2
25Multi-State (master/slave) Resources
- Normal resources can be in one of two stable
states - running
- stopped
- Multi-state resources can have more than two
stable states. For example - running-as-master
- running-as-slave
- stopped
- This is ideal for modeling replication resources
like DRBD
26Advanced Constraints
- Nodes can have arbitrary attributes associated
with them in namevalue form - Attributes have types int, string, version
- Constraint expressions can use these attributes
as well as node names, etc in largely arbitrary
ways - Operators
- , ! , ,
- defined(attrname), undefined(attrname),
- colocated(resource id), not colocated(resource id)
27Advanced Constraints (cont'd)
- Each constraint is associated with particular
resource, and is evaluated in the context of a
particular node. - A given constraint has a boolean predicate
associated with it according to the expressions
before, and is associated with a weight, and a
condition. - If the predicate is true, then the condition is
used to compute the weight associated with
locating the given resource on the given node. - Supported conditions are (these distinctions
may be unneeded ?) - can same as prefer with MAXINT weight
- cannot same as prefer with -MAXINT weight
- prefer positive weight
- prefer not same as prefer with negative weight
28Security Considerations
- Cluster A computer whose backplane is the
Internet - If this isn't frightening, you don't
understand... - You may think you have a secure cluster network
- You're probably mistaken now
- You will be in the future
29Secure Networks are Difficult Because...
- Security is not often well-understood by admins
- Security is well-understood by black hats
- Network security is easy to breach accidentally
- Users bypass it
- Hardware installers don't fully understand it
- Most security breaches come from trusted staff
- Staff turnover is often a big issue
- Virus/Worm/P2P technologies will create new holes
especially for Windows machines
30Security Advice
- Good HA software should be designed to assume
insecure networks - Not all HA software assumes insecure networks
- Good HA installation architects use dedicated
(secure?) networks for intra-cluster HA
communication - Crossover cables are reasonably secure all else
is suspect -)
31References
- http//linux-ha.org/
- http//linux-ha.org/download/
- http//wiki.linux-ha.org/NewHeartbeatDesign
- New Web site content (in progress)
- http//linux-ha.trick.ca/ (pretty - offline!)
- http//wiki.linux-ha.org/ (editable)
- www.linux-mag.com/2003-11/availability_01.html
32Legal Statements
- IBM is a trademark of International Business
Machines Corporation. - Linux is a registered trademark of Linus
Torvalds. - Other company, product, and service names may be
trademarks or service marks of others. - This work represents the views of the author and
does not necessarily reflect the views of the IBM
Corporation.