Title: Distributed Systems: Message Passing, Clusters, and Implementation of Clusters in Representative Ope
1Distributed Systems Message Passing,
Clusters, andImplementation of Clusters in
Representative Operating Systems
2Distributed message passing
- Communication and synchronization mechanisms in
distributed systems - Distributed message passing
- Remote procedure call
- An implementation approach for message passing
- Use the services of a message-passing module
- Service is requested in the form of primitives
and parameters
3Distributed message passing (cont.)
- Send primitive
- Parameters
- Destination process identifier
- The message contents
- Operation
- Sending process uses Send primitive
(destination, message contents) - Message-passing module constructs data unit with
destination and contents - Data unit is sent to the destination machine
using communication facility (e.g., TCP/IP) - Data unit is received by the destination machine
and is routed by the communication facility to
the message-passing module - The message-passing module stores the message in
the buffer for the destination process - Receive primitive
- Operation
- Destination process assigns buffer area for
messages and uses Receive primitive to the
message passing module - Alternatively, message-passing module signals
destination process with Receive' signal and
places message in shared buffer
4Distributed message passing (cont.)
- Design issues
- Reliability vs. unreliability
- Blocking vs. non-blocking
- Reliability vs. unreliability
- Reliable message passing
- Guarantees delivery if possible
- Uses a reliable transport protocol
- Performs error checking, acknowledgment,
retransmission, and reordering of messages if
delivered out of sequence - Acknowledgment to the sending process that
delivery was either successful or it failed (e.g.
network failure) - Unreliable message passing
- Message-passing facility sends the message
without reporting success or failure - Message passing facility has a simple design and
low overhead - Applications may use Request and Reply to
acknowledge delivery
5Distributed message passing (cont.)
- Blocking vs. non-blocking
- Blocking or synchronous primitives
- Blocking Send does not return control to the
sending process (process suspended) - until
- Message has been transmitted (unreliable
service), or - Message has been sent and an acknowledgment
received (reliable service) - Blocking Receive does not return control to the
receiving process until - Message has been placed in the allocated buffer
6Distributed message passing (cont.)
- Blocking vs. non-blocking
- Non-blocking or asynchronous primitives
- Send primitive does not suspend process
- Control returned to the process as soon as the
message has been queued for transmission or a
copy has been made - After the message has been transmitted or copied
to a safe place for later transmission, sending
process is interrupted to be informed that the
message buffer is available - Receive primitive does not suspend process
- Process is sent an interrupt upon message arrival
or process can poll periodically for messages - Advantages/disadvantages
- Efficient use of message passing mechanism
- Difficult to test and debug time-dependent
sequences can lead to obscure bugs
7Remote procedure calls
- Provides access to remote services by providing
simple procedure call/return semantics, similar
to those used for local services - Advantages
- The procedure call is used extensively
- Remote interfaces can be specified and clearly
documented as a set of named operations with
designated types - The interface is standardized
- The communication code for an application can be
generated automatically - Client/server modules can be easily ported
between different OSs and target systems - Example of procedure call for the calling program
- CALL P (X, Y)
- where P procedure name
- X passed arguments
- Y returned values
8Remote procedure calls (cont.)
- Dummy or stub procedure on the local machine
- Included in the callers address space or
dynamically linked at call time - Creates message identifying remote procedure and
includes parameters - Sends message to remote system and waits for
reply - When reply arrives, it returns to the calling
program providing the returned values - Dummy or stub procedure on the remote machine
- Upon receiving the message, generates a local
CALL P (X, Y) - Returns reply
9Remote procedure calls (cont.)
- Design issues
- Parameter passing
- Call by value (parameters passed as values)
- Parameters copied into the message and sent to
remote system - Easy to implement for RPCs
- Call by reference (pointers to a location that
contains the value) - More difficult to implement for RPCs
- Parameters and results representation
- No problem if the calling and called programs use
the same language and run on the same type of OSs
and machines - If there are differences, the remote procedure
call mechanism must provide the conversion
standardized format for common objects (e.g.,
integers, characters) - Client/server binding
- A client/server binding is established after the
two applications have made a logical connection
and are ready to exchange commands and data - Non-persistent binding Logical connection
between the two processes established at the time
of RPC and disconnected after the values are
returned - Persistent binding Connection set up for RPC
remains up after return
10Remote procedure calls (cont.)
- Design issues (cont.)
- Synchronous vs. asynchronous
- Synchronous RPC
- Calling process waits for the returned values
- Traditional, functions like a subroutine call
- Easy to understand and test but leads to lower
performance - Asynchronous RPC
- Calling process is not blocked
- Methods for synchronizing the client and the
server - Higher layer applications in both client and
server initiate the exchange and then verifies
that all actions have been completed - Client uses a series of asynchronous RPCs
followed by a synchronous RPC
11Remote procedure calls (cont.)
- Design issues (cont.)
- Object-oriented mechanisms
- Operation
- Client sends request to an object request broker
- Broker acts as a directory of all remote services
on the network. Broker calls appropriate remote
object and passes data. - Remote object services request, replies to
broker, which returns response to client - Competing approaches
- Common Object Request Broker Architecture (CORBA)
from the Object Management Group, backed by IBM,
Apple, Sun - Common Object Model (COM), the basis for Object
Linking and Embedding (OLE) from Microsoft
12Clusters
- Cluster group of interconnected computers
(nodes) working together as a unified computer
recourse and creating the illusion of being one
machine - Advantages of clusters
- Absolute scalability
- Clusters can consist of hundreds of machines,
each being a multiprocessor - Incremental scalability
- A cluster can grow in small increments with
minimum service disruption - High availability
- Fault-tolerant operation in software
- High price/performance ratio
- Off-the shelf building blocks
13Clusters (cont.)
- Cluster configurations
- Passive standby
- Active system processes the entire load, the
standby takes over in case of failure of primary - Active sends heartbeat messages to standby to
indicate continued operation - High cost no tasks sharing
- Easy to implement
- Active secondary
- Secondary server is also used for processing
tasks - Reduced cost due to tasks sharing
- Increased complexity
14Clusters (cont.)
- Cluster configurations (cont.)
- Separate servers
- Each server has its own disk, no disks shared
- Data copied between servers periodically
- Scheduling assigns client requests to servers to
balance the load - High availability
- High server and network overhead due to data
copying - Shared disks, non-shared volumes (shared nothing)
- Common disks are partitioned into volumes, each
volume owned by only one computer - On computer failure, cluster is reconfigured to
assign volumes to remaining computers - Shared disks, shared volumes
- Each computer has access to all volumes on all
disks - Locking mechanism used to ensure that data is
accessed by one computer at a time
15Clusters (cont.)
- OS design issues
- Failure management
- Highly available clusters
- High probability that all resources will be in
service - In case of failure, the queries in progress are
lost - If retried, the query will be serviced by another
computer in the cluster - Fault-tolerant clusters
- Redundant shared disks and fault-tolerant
operations - Fail-over switching an application from a failed
system to an alternative - Fail-back the restoration of applications and
data resources to the failed system after
recovery - Load balancing
- Load must be balanced among available computers
- When a new computer is added to the cluster,
loads needs to be rebalanced to include the new
computer
16Clusters (cont.)
- OS design issues (cont.)
- Parallelizing computation executing software
from a single application in parallel - Parallelizing compiler
- It is determined, at compile time, which parts of
the application can be run in parallel - The parallel parts are assigned to different
computers in the cluster - Parallelized application
- The application is designed to run on the cluster
and uses message passing for communication - Most powerful approach to exploit clusters
- Parametric computing
- Useful for programs that must be executed a large
number of times, each time with a different set
of parameters (e.g., a simulation model) - Parametric processing tools are needed to
organize, run, and manage the jobs
17Clusters (cont.)
- Cluster computer architecture
- All computers are interconnected by a high-speed
LAN or switch - Each computer is capable of operating
independently - A middleware layer of software runs on each
computer to implement the cluster functionality - Provides a unified system image to the user,
called a single-system image - Is responsible for providing load balancing and
high availability - Middleware services and functions
- Single entry point A user logs into the cluster,
not on a specific computer - Single file hierarchy The user sees only a
single file hierarchy, under one root directory - Single control point A default workstation is
used for cluster management and control - Single virtual networking There is a single
virtual network connecting the cluster computers,
even if it consists of multiple interconnected
networks
18Clusters (cont.)
- Middleware services and functions (cont.)
- Single memory space A distributed shared memory
is used to share variables - Single job-management system The cluster has a
job scheduler and jobs are submitted to the
cluster and not to individual computers - Single user interface A common graphic interface
is used for all users, regardless of the
workstation they use to enter the cluster - Single I/O space Any node can access any I/O
device - Single process space A process on any node can
create or communicate with any other process in
the cluster - Check-pointing Process states and intermediate
results are saved periodically, permitting
rollback recovery after failures - Process migration Processes can mode inside the
cluster to provide load balancing
19Clusters (cont.)
- Clusters compared with SMPs
- SMPs
- Easier to manage and configure than clusters
- Much closer to the original uniprocessor model
- Major difference from the uniprocessor is the
scheduler function - Uses less physical space and requires less energy
than a comparable cluster - SMP products are well established and stable
- Clusters
- Far superior to SMPs in terms of absolute and
incremental scalability - Far superior in terms of availability
- Clusters are likely to dominate the
high-performance server market
20Windows 2000 Cluster Server
- The configuration is a shared-nothing cluster,
where each volume and other resources are owned
by a single system at a time (initially
code-named Wolfpack) - Main concepts
- Cluster Service
- The software on each node responsible for
cluster-specific activities - Resource
- These are the resources managed by the cluster
service - They are objects representing either physical
hardware devices (e.g., disk drives, network
cards) or logical items (e.g., disk volumes, IP
addresses, applications, databases) - Resources are implemented as dynamically linked
libraries (DLLs) and managed by a resource
monitor - Online A resource is online at a node if it
provides a service at that node - Group
- A collection of resources that are managed as a
single entity - Consists of all elements needed to run a specific
application and to allow the client systems to
connect to the service provided by that
application - Operations can be performed on the entire group
(e.g., transfer to another node)
21Windows 2000 Cluster Server (cont.)
22Windows 2000 Cluster Server (cont.)
- The W2K Cluster Server components and their
relationship in a single node of a cluster - Node manager
- Responsible for maintaining this nodes
membership in the cluster - It sends periodic heartbeat messages to the node
managers of the other nodes in the same cluster - If it detects the loss of heartbeat messages from
another node - It broadcasts a message to the entire cluster
- All members exchange messages to verify their
view of current cluster membership - If a node manager does not reply, it is removed
from cluster and its active groups are
transferred to one or more of the other nodes in
the cluster - Configuration database manager
- Responsible for the cluster configuration
database - The database has information about all cluster
resources, groups, and node ownership of groups - Database managers on all nodes communicate with
each other to maintain a consistent view of
configuration information in the cluster - The integrity of the database is maintained by
using fault-resistant software for all changes to
cluster configuration
23Windows 2000 Cluster Server (cont.)
- The W2K Cluster Server components and their
relationship in a single node of a cluster
(cont.) - Resource manager / fail-over manager
- Responsible for management of resource groups
- Initiates actions such as startup, reset, and
fail-over - In case of fail-over, the fail-over managers on
the active nodes negotiate the redistribution of
resource groups from the failed node to the
remaining active ones - When the node that failed has recovered, the
fail-over managers may decide to move back some
groups - Event processor
- Connects all the components of the cluster
service - Handles common operations
- Controls cluster service initialization
- Communications manager
- Provides the facilities for message exchange with
other nodes in the cluster - Global update manger
- Provides an update service for other components
24Sun cluster
- Solaris UNIX has been extended to make the Sun
Cluster distributed operating system - It appears to users and applications as a single
computer running the Solaris OS - Components
- Object and communications support
- Process management
- Networking
- Global distributed file system
25Sun cluster (cont.)
- Object and communications support
- Object oriented uses the CORBA object model to
define objects and the remote procedure call
(RPC) mechanism - Global process management
- The location of a process is transparent to the
user - Each process has a unique identifier within the
cluster - Process migration is possible a process can move
from node to node to achieve load balancing and
for fail-over (caveat the threads of a single
process must be on the same node) - Networking
- Strategy
- A packet filter is used to route packets to the
proper node - Cluster appears externally as a single server
with a single IP address - Operation
- Incoming packets are received on the node that
has the network adapter, filtered, and delivered
to the correct target node for protocol
processing over cluster interconnect - For outgoing packets, originating node performs
protocol processing, transfers packet over
cluster interconnect to the node that has
external network physical connection
26Sun cluster (cont.)
- Global file system
- Like the standard Solaris, the Sun Cluster is
based on the the concepts of virtual node (vnode)
and the virtual file system (vfs) - Standard Solaris
- Vnode
- The vnode structure is used to provide a
general-purpose interface to all types of file
systems - A vnode provides mapping to an object in any file
system type (by contrast, an inode in UNIX can
provide mapping to UNIX files only) - The vnode interface accepts general-purpose file
manipulation commands (e.g., read, write) and
translates them into the actions appropriate for
the respective file system - Vfs
- Vfs structures are used to describe entire file
systems - The Vfs interface accepts general-purpose
commands that operate on entire files and
translates them into actions appropriate for a
particular file system
27Sun cluster (cont.)
- Global file system (cont.)
- Global file access
- The global file system provides an uniform
interface to files distributed over the cluster - Processes on all nodes use the same pathname to
locate a file and can open any file - Implementation
- A proxy file system was built on top of the
existing Solaris file system at the vnode
interface - Vfs/vnode operations are converted by the proxy
layer into object invocations - The invoked object may reside on any node in the
cluster it performs a local vnode/vfs operation
on the underlying file system - Caching is used for file contents, directory
information, and file attributes
28Beowulf and Linux clusters
- Beowulf
- Beowulf project
- Initiated under the NASA High Performance
Computing and Communications (HPCC) project - Goal expand the capabilities of clustered PCs
for performing important computational tasks - Widely implemented, the most important new
cluster technology available - Beowulf features
- Use of off-the shelf components, no custom
components, available from many vendors - Dedicated processors
- Dedicated private network (LAN or WAN or
inter-networked combination) - Scalable I/O
- Free software base and distributed computing
tools - Return of the design and improvements to the
community
29Beowulf and Linux clusters (cont.)
30Beowulf and Linux clusters (cont.)
- Most Beowulf implementations use a cluster of
Linux workstations or PCs - A representative Linux implementation of Beowulf
contains - A number of workstations (not necessarily the
same platform) all running Linux - Secondary storage at each workstation can be
available for distributed access (e.g.,
distributed file sharing) - The Linux nodes are interconnected with an
off-the-shelf network (e.g., Ethernet switch or
an interconnected set of Ethernet switches) - Beowulf software
- Open-source Beowulf software
- Beowulf tools and utilities
- Linux kernel, modified to allow the individual
nodes to participate in a number of global
namespaces
31Beowulf and Linux clusters (cont.)
- Examples of Beowulf system software
- Beowulf distributed process space (BPROC)
- Allows a process to span multiple nodes in a
cluster environment - Provides a mechanism for starting a process on
another node without logging in that node - Makes all remote processes visible in the process
table of the clusters front end node - Beowulf Ethernet channel bonding
- Mechanism that joins multiple networks into a
single logical network with high bandwidth - Distributes packets over the available device
transmit queues - Provides load balancing over multiple Ethernets
connected to Linux workstations - PVMSYNC
- Provides a synchronization mechanism and shared
data objects within a cluster - EnFusion
- Set of tools for parametric computing, i.e.,
execution of a program as a large number of jobs,
each with different parameters