Title: Single System Abstractions for Clusters of Workstations
1Single System Abstractionsfor Clusters of
Workstations
2What is a cluster?
A collection of loosely connected self-contained
computers cooperating to provide the abstraction
of a single one
Possible System Abstractions
Characterized by
System Abstraction
Fine grain parallelism
Massively parallel processor
Coarse grain concurrency
Multi-programmed system
Fast interconnects
Independent Nodes
Transparency is a goal
3Question
Compare three approaches to provide abstraction
of a single system for clusters of workstations
using the following criteria
- Transparency
- Availability
- Scalability
4Contributions
- Improvements to the Microsoft Cluster Service
- better availability and scalability
- Adaptive Replication
- Automatically adapting replication levels to
maintain availability as cluster grows
5Outline
- Comparison of approaches
- Transparent remote execution (GLUnix)
- Preemptive load balancing (MOSIX)
- Highly available servers (Microsoft Cluster
Service) - Contributions
- Improvements to the MS Cluster Service
- Adaptive Replication
- Conclusions
6GLUnix
7GLUnix Transparent Remote Execution
master node
master daemon
Execute (make, env)
signals
fork
node daemon
stdin
glurun make
exec make
user
stdout, stderr
remote node (selected by master)
home node
8GLUnixVirtues and Limitations
- Transparency
- home node transparency limited by user-level
implementation - interactive jobs supported
- special commands for running cluster jobs
- Availability
- detects and masks node failures
- master process is single point of failure
- Scalability
- master process performance bottleneck
9MOSIX
10MOSIXPreemptive Load Balancing
1
3
4
2
5
node
process
- probabilistic diffusion of load information
- redirects system calls to home node
11MOSIXPreemptive Load Balancing
Exchange local load with random node
delay
Consider migrating a process to a node with
minimal cost
- Keeps load information from fixed number nodes
- load average size of ready queue
- cost f(cpu time) f(communication)
f(migration time)
12MOSIXVirtues and Limitations
- Transparency
- limited home node transparency
- Availability
- masks node failures
- no process restart
- preemptive load balancing limits portability and
performance - Scalability
- flooding and swinging possible
- low communication overhead
13MicrosoftClusterService
14Microsoft Cluster Service (MSCS)Highly available
server processes
SQL
Web
MSCS
MSCS
status
status
- replicated consistent node/server status database
- migrates servers from failed nodes
15Microsoft Cluster Service Hardware Configuration
ethernet
Web
SQL
bottleneck
status
status
SCSI
status
single points of failure
16MSCSVirtues and Limitations
- Transparency
- server migration transparent to clients
- Availability
- servers migrated from failed nodes
- shared disk are single points of failure
- Scalability
- manual static configuration
- manual static load balancing
- shared disk bus is performance bottleneck
17Summary of Approaches
System
Transparency
Availability
Scalability
GLUnix
home node limited
single point of failure masks failures no
fail-over
load balancing bottleneck
MOSIX
home node transparent
masks failures no fail-over
load balancing
MSCS
clients
server fail-over single point of failure
bottleneck
18Re-designing MSCS
19Transaction-basedReplication
operates on object
writex
replication
operates on copies
writex1, , writexn
transactions
20Re-designing MSCS
- Idea New core resource group fixed on every node
- special disk resource
- distributed transaction processing resource
- transactional replicated file storage resource
- Implement consensus with transactions
(El-Abbadi-Toueg algorithm) - changes to configuration DB
- cluster membership service
- Improvements
- eliminates complex global update and regroup
protocols - switchover not required for application data
- provides new generally useful service
- Transactional replicated object storage
21Re-designed MSCSwith transactional replicated
object storage
Node
Cluster Service
node manager
resource manager
RPC
Resource Monitor
resource DLL
resource DLL
RPC
Replicated Storage Svc
Transaction Service
network
22ADAPTIVEREPLICATION
23Adaptive ReplicationProblem
What should a replication service do when nodes
are added to the cluster?
replication vs. migration
Goal Maintain availability
Hypothesis
- Must alternate migration with replication
- Replication (R) should happen significantly less
often that migration (M)
24Replication increases number of copies of objects
2 nodes
X y
X y
2 nodes added
X y
X y
X y
X y
4 nodes
25Migration re-distributes objects across all nodes
2 nodes
X y
X y
2 nodes added
X
y
x
y
4 nodes
26Simplifying Assumptions
- System keeps same number of copies k of each
object - System has n nodes
- Initially n k
- n increases k nodes at a time
- ignore partitions in computing availability
27ConjectureHighest availability can be obtained
if objects partitioned in q n / k groups living
disjoint sets of nodes.
Example k 3, n 6, q 2
X
X
X
q
X
X
X
k
Lets call this optimal migration
28Adaptive Replication Necessary
Let each node have availability p The
availability of the system is A(k,n) 1 - q
pk Since optimal migration always increases q,
migration decreases availability (albeit slowly)
Adaptive replication may be necessary to
maintain availability
29Adaptive ReplicationFurther Work
- determine when it matters in real situations
- relax assumptions
- formalize arguments
30SupportSlides
31Home Node Single System Image
32Talk focuses on Coarse Grain Layer
LCM layers supported
Mechanisms Used
System
NET, CGP, FGP
active Messages, trasparent remote
execution, message passing API
Berkeley NOW
NET, CGP
preemptive load balancing kernel-to-kernel RPC
MOSIX
CGP
node regroup, resource failover switchover
MSCS
NET, FGP
user level protocol stack with semaphores
ParaStation
33GLUnixCharacteristics
- Provides special user commands for managing
cluster jobs - Both batch and interactive jobs can be executed
remotely - Supports dynamic load balancing
34MOSIX preemptive load balancing
load balance
less loaded node exists?
no
yes
Select candidate process p with maximal impact
on local load
no
yes
p can migrate?
no
yes
signal p to consider migration
return
35xFS distribued log-based file system
data block
1
data stripes
log segment (dirty data blocks)
2
3
parity stripe
client
writes are always sequential
1
2
3
stripe group
36xFSVirtues and Limitations
- Exploits aggregate bandwidth of all disks
- No need to buy expensive RAIDs
- No single point of failure
- Reliability Relies of accumulating dirty blocks
to generate large sequential writes - Adaptive replication potentially more difficult
37Microsoft Cluster Service (MSCS) GOAL
Off-the-shelf Server Application
Cluster-aware Server Application
Wrapper
Highly Available
38MSCSAbstractions
- Node
- Resource
- e.g. disks, IP addresses, server
- Resource dependency
- e.g. DBMS depends on disk holding its data
- Resource group
- e.g. server and its IP number
- Quorum resource
- logs configuration data
- breaks ties during membership changes
39MSCSGeneral Characteristics
- Global state of all nodes and resources
consistently replicated across all nodes (write
all using atomic multicast protocol) - Node and resource failures detected
- Resources of failed nodes migrated to surviving
nodes - Failed resources restarted
40MSCS System Architecture
network
41MSCS virtually synchronous regroup operation
regroup
Activate
- determine nodes in its connected component
- determine if its component is the primary
- elect new tie-breaker
- if node new tie breaker then broadcast
- component as new membership
Closing
Pruning
- if not in the new membership halt
Cleanup 1
- install new membership from new tie breaker
- acknowledge ready to commit
Cleanup 2
- if own quorum disk, log membership change
end
42MSCSPrimary Component Determination Rule
A node is in the primary component if one of the
following holds
- node connected to a majority of previous
membership - node connected to half (gt2) of the previous
members and one of those is a tie-breaker - node isolated and previous membership had two
nodes and node owned quorum resource during
previous membership
43MSCS switchover
Every disk a single point of failure!
node failure
Alternative Replication
44Summary of Approaches
System
Transparency
Availability
Performance
Berkeley NOW
home node limited
single point of failure no fail-over
load balancing bottleneck
MOSIX
home node transparent
masks failures no fail-over tolerates partitions
load balancing low msg overhead
MSCS
server
single point of failure low MTTR tolerates
partitions
bottleneck
45Comparing Approaches Design Goals
LCM layers supported
Mechanisms Used
System
NET, CGP, FGP
active Messages transparent remote
execution Message passing API
Berkeley NOW
NET, CGP
preemptive load balancing kernel-to-kernel RPC
MOSIX
CGP
cluster membership services resource fail-over
MSCS
NET, FGP
user level protocol stack network interface
hardware
ParaStation
46Comparing Approaches Global Information Management
Approach
Description
System
centralized
processes run to completion once assigned to
processor
Berkeley NOW
distributed probabilistic
processes brought offline at source and online
at destination
MOSIX
replicated consistent
process migrated at any point during execution
MSCS
47Comparing Approaches Fault-tolerance
Single Points of Failure
Possible solution
System
master process
process pairs
Berkeley NOW
none
N.A.
MOSIX
quorum resource shared disks
virtual partitions replication algorithm
MSCS
48Comparing Approaches Load Balancing
Approach
Description
System
manual
sys admin manually assigns processes to nodes
MSCS
static
processes statically assigned to processors
dynamic
uses dynamic load information to
assign processes to processors
Berkeley NOW
preemptive
migrates processes in the middle of their
execution
MOSIX
49Comparing Approaches Process Migration
Process Migration Approach
Description
System
none
processes run to completion once assigned to
processor
Berkeley NOW
cooperative shutdown/restart
processes brought offline at source and online
at destination
MSCS
transparent
process migrated at any point during execution
MOSIX
50Example k 3, n 3
X
x
x
Each letter (e.g. x above) represents a group of
objects with copies in the same subset of nodes
51fail-over/ failback
redundancy
switch-over
error-correcting codes
replication
MSCS
RAID
xFS
primary copy
voting (quorum consensus)
HARP
voting w/ views (virtual partitions)