Single System Abstractions for Clusters of Workstations - PowerPoint PPT Presentation

About This Presentation

Title:

Single System Abstractions for Clusters of Workstations

Description:

Talk focuses on Coarse Grain Layer. GLUnix. Characteristics ... Reliability: Relies of accumulating dirty blocks to generate large sequential writes ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 46

Provided by: programmin4

Learn more at: http://www.ece.uprm.edu

Category:

more less

Transcript and Presenter's Notes

Title: Single System Abstractions for Clusters of Workstations

1
Single System Abstractionsfor Clusters of
Workstations

Bienvenido Vélez

2
What is a cluster?
A collection of loosely connected self-contained
computers cooperating to provide the abstraction
of a single one
Possible System Abstractions
Characterized by
System Abstraction
Fine grain parallelism
Massively parallel processor
Coarse grain concurrency
Multi-programmed system
Fast interconnects
Independent Nodes
Transparency is a goal
3
Question
Compare three approaches to provide abstraction
of a single system for clusters of workstations
using the following criteria

Transparency
Availability
Scalability

4
Contributions

Improvements to the Microsoft Cluster Service
better availability and scalability
Adaptive Replication
Automatically adapting replication levels to
maintain availability as cluster grows

5
Outline

Comparison of approaches
Transparent remote execution (GLUnix)
Preemptive load balancing (MOSIX)
Highly available servers (Microsoft Cluster
Service)
Contributions
Improvements to the MS Cluster Service
Adaptive Replication
Conclusions

6
GLUnix
7
GLUnix Transparent Remote Execution
master node
master daemon
Execute (make, env)
signals
fork
node daemon
stdin
glurun make
exec make
user
stdout, stderr
remote node (selected by master)
home node

Dynamic load balancing

8
GLUnixVirtues and Limitations

Transparency
home node transparency limited by user-level
implementation
interactive jobs supported
special commands for running cluster jobs
Availability
detects and masks node failures
master process is single point of failure
Scalability
master process performance bottleneck

9
MOSIX
10
MOSIXPreemptive Load Balancing
1
3
4
2
5
node
process

probabilistic diffusion of load information
redirects system calls to home node

11
MOSIXPreemptive Load Balancing
Exchange local load with random node
delay
Consider migrating a process to a node with
minimal cost

Keeps load information from fixed number nodes
load average size of ready queue
cost f(cpu time) f(communication)
f(migration time)

12
MOSIXVirtues and Limitations

Transparency
limited home node transparency
Availability
masks node failures
no process restart
preemptive load balancing limits portability and
performance
Scalability
flooding and swinging possible
low communication overhead

13
MicrosoftClusterService
14
Microsoft Cluster Service (MSCS)Highly available
server processes
SQL
Web
MSCS
MSCS
status
status

replicated consistent node/server status database
migrates servers from failed nodes

15
Microsoft Cluster Service Hardware Configuration
ethernet
Web
SQL
bottleneck
status
status
SCSI
status
single points of failure
16
MSCSVirtues and Limitations

Transparency
server migration transparent to clients
Availability
servers migrated from failed nodes
shared disk are single points of failure
Scalability
manual static configuration
manual static load balancing
shared disk bus is performance bottleneck

17
Summary of Approaches
System
Transparency
Availability
Scalability
GLUnix
home node limited
single point of failure masks failures no
fail-over
load balancing bottleneck
MOSIX
home node transparent
masks failures no fail-over
load balancing
MSCS
clients
server fail-over single point of failure
bottleneck
18
Re-designing MSCS
19
Transaction-basedReplication
operates on object
writex
replication
operates on copies
writex1, , writexn
transactions
20
Re-designing MSCS

Idea New core resource group fixed on every node
special disk resource
distributed transaction processing resource
transactional replicated file storage resource
Implement consensus with transactions
(El-Abbadi-Toueg algorithm)
changes to configuration DB
cluster membership service
Improvements
eliminates complex global update and regroup
protocols
switchover not required for application data
provides new generally useful service
Transactional replicated object storage

21
Re-designed MSCSwith transactional replicated
object storage
Node
Cluster Service
node manager
resource manager
RPC
Resource Monitor
resource DLL
resource DLL
RPC
Replicated Storage Svc
Transaction Service
network
22
ADAPTIVEREPLICATION
23
Adaptive ReplicationProblem
What should a replication service do when nodes
are added to the cluster?
replication vs. migration
Goal Maintain availability
Hypothesis

Must alternate migration with replication
Replication (R) should happen significantly less
often that migration (M)

24
Replication increases number of copies of objects
2 nodes
X y
X y
2 nodes added
X y
X y
X y
X y
4 nodes
25
Migration re-distributes objects across all nodes
2 nodes
X y
X y
2 nodes added
X
y
x
y
4 nodes
26
Simplifying Assumptions

System keeps same number of copies k of each
object
System has n nodes
Initially n k
n increases k nodes at a time
ignore partitions in computing availability

27
ConjectureHighest availability can be obtained
if objects partitioned in q n / k groups living
disjoint sets of nodes.
Example k 3, n 6, q 2
X
X
X
q
X
X
X
k
Lets call this optimal migration
28
Adaptive Replication Necessary
Let each node have availability p The
availability of the system is A(k,n) 1 - q
pk Since optimal migration always increases q,
migration decreases availability (albeit slowly)
Adaptive replication may be necessary to
maintain availability
29
Adaptive ReplicationFurther Work

determine when it matters in real situations
relax assumptions
formalize arguments

30
SupportSlides
31
Home Node Single System Image
32
Talk focuses on Coarse Grain Layer
LCM layers supported
Mechanisms Used
System
NET, CGP, FGP
active Messages, trasparent remote
execution, message passing API
Berkeley NOW
NET, CGP
preemptive load balancing kernel-to-kernel RPC
MOSIX
CGP
node regroup, resource failover switchover
MSCS
NET, FGP
user level protocol stack with semaphores
ParaStation
33
GLUnixCharacteristics

Provides special user commands for managing
cluster jobs
Both batch and interactive jobs can be executed
remotely
Supports dynamic load balancing

34
MOSIX preemptive load balancing
load balance
less loaded node exists?
no
yes
Select candidate process p with maximal impact
on local load
no
yes
p can migrate?
no
yes
signal p to consider migration
return
35
xFS distribued log-based file system
data block
1
data stripes
log segment (dirty data blocks)
2
3
parity stripe
client
writes are always sequential
1
2
3
stripe group
36
xFSVirtues and Limitations

Exploits aggregate bandwidth of all disks
No need to buy expensive RAIDs
No single point of failure
Reliability Relies of accumulating dirty blocks
to generate large sequential writes
Adaptive replication potentially more difficult

37
Microsoft Cluster Service (MSCS) GOAL
Off-the-shelf Server Application
Cluster-aware Server Application
Wrapper
Highly Available
38
MSCSAbstractions

Node
Resource
e.g. disks, IP addresses, server
Resource dependency
e.g. DBMS depends on disk holding its data
Resource group
e.g. server and its IP number
Quorum resource
logs configuration data
breaks ties during membership changes

39
MSCSGeneral Characteristics

Global state of all nodes and resources
consistently replicated across all nodes (write
all using atomic multicast protocol)
Node and resource failures detected
Resources of failed nodes migrated to surviving
nodes
Failed resources restarted

40
MSCS System Architecture
network
41
MSCS virtually synchronous regroup operation
regroup
Activate

determine nodes in its connected component

determine if its component is the primary
elect new tie-breaker
if node new tie breaker then broadcast
component as new membership

Closing
Pruning

if not in the new membership halt

Cleanup 1

install new membership from new tie breaker
acknowledge ready to commit

Cleanup 2

if own quorum disk, log membership change

end
42
MSCSPrimary Component Determination Rule
A node is in the primary component if one of the
following holds

node connected to a majority of previous
membership
node connected to half (gt2) of the previous
members and one of those is a tie-breaker
node isolated and previous membership had two
nodes and node owned quorum resource during
previous membership

43
MSCS switchover
Every disk a single point of failure!
node failure
Alternative Replication
44
Summary of Approaches
System
Transparency
Availability
Performance
Berkeley NOW
home node limited
single point of failure no fail-over
load balancing bottleneck
MOSIX
home node transparent
masks failures no fail-over tolerates partitions
load balancing low msg overhead
MSCS
server
single point of failure low MTTR tolerates
partitions
bottleneck
45
Comparing Approaches Design Goals
LCM layers supported
Mechanisms Used
System
NET, CGP, FGP
active Messages transparent remote
execution Message passing API
Berkeley NOW
NET, CGP
preemptive load balancing kernel-to-kernel RPC
MOSIX
CGP
cluster membership services resource fail-over
MSCS
NET, FGP
user level protocol stack network interface
hardware
ParaStation
46
Comparing Approaches Global Information Management
Approach
Description
System
centralized
processes run to completion once assigned to
processor
Berkeley NOW
distributed probabilistic
processes brought offline at source and online
at destination
MOSIX
replicated consistent
process migrated at any point during execution
MSCS
47
Comparing Approaches Fault-tolerance
Single Points of Failure
Possible solution
System
master process
process pairs
Berkeley NOW
none
N.A.
MOSIX
quorum resource shared disks
virtual partitions replication algorithm
MSCS
48
Comparing Approaches Load Balancing
Approach
Description
System
manual
sys admin manually assigns processes to nodes
MSCS
static
processes statically assigned to processors
dynamic
uses dynamic load information to
assign processes to processors
Berkeley NOW
preemptive
migrates processes in the middle of their
execution
MOSIX
49
Comparing Approaches Process Migration
Process Migration Approach
Description
System
none
processes run to completion once assigned to
processor
Berkeley NOW
cooperative shutdown/restart
processes brought offline at source and online
at destination
MSCS
transparent
process migrated at any point during execution
MOSIX
50
Example k 3, n 3
X
x
x
Each letter (e.g. x above) represents a group of
objects with copies in the same subset of nodes
51
fail-over/ failback
redundancy
switch-over
error-correcting codes
replication
MSCS
RAID
xFS
primary copy
voting (quorum consensus)
HARP
voting w/ views (virtual partitions)

Write a Comment

User Comments (0)