Distributed Systems: Message Passing, Clusters, and Implementation of Clusters in Representative Ope presentation

About This Presentation

Title:

Distributed Systems: Message Passing, Clusters, and Implementation of Clusters in Representative Ope

Description:

Communication and synchronization mechanisms in distributed systems. Distributed ... Global update manger. Provides an update service for other components ... –

Number of Views:642

Avg rating:3.0/5.0

Slides: 32

Provided by: marius3

Learn more at: http://www.cs.iit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Distributed Systems: Message Passing, Clusters, and Implementation of Clusters in Representative Ope

1
Distributed Systems Message Passing,
Clusters, andImplementation of Clusters in
Representative Operating Systems

2
Distributed message passing

Communication and synchronization mechanisms in
distributed systems
Distributed message passing
Remote procedure call
An implementation approach for message passing
Use the services of a message-passing module
Service is requested in the form of primitives
and parameters

3
Distributed message passing (cont.)

Send primitive
Parameters
Destination process identifier
The message contents
Operation
Sending process uses Send primitive
(destination, message contents)
Message-passing module constructs data unit with
destination and contents
Data unit is sent to the destination machine
using communication facility (e.g., TCP/IP)
Data unit is received by the destination machine
and is routed by the communication facility to
the message-passing module
The message-passing module stores the message in
the buffer for the destination process
Receive primitive
Operation
Destination process assigns buffer area for
messages and uses Receive primitive to the
message passing module
Alternatively, message-passing module signals
destination process with Receive' signal and
places message in shared buffer

4
Distributed message passing (cont.)

Design issues
Reliability vs. unreliability
Blocking vs. non-blocking
Reliability vs. unreliability
Reliable message passing
Guarantees delivery if possible
Uses a reliable transport protocol
Performs error checking, acknowledgment,
retransmission, and reordering of messages if
delivered out of sequence
Acknowledgment to the sending process that
delivery was either successful or it failed (e.g.
network failure)
Unreliable message passing
Message-passing facility sends the message
without reporting success or failure
Message passing facility has a simple design and
low overhead
Applications may use Request and Reply to
acknowledge delivery

5
Distributed message passing (cont.)

Blocking vs. non-blocking
Blocking or synchronous primitives
Blocking Send does not return control to the
sending process (process suspended)
until
Message has been transmitted (unreliable
service), or
Message has been sent and an acknowledgment
received (reliable service)
Blocking Receive does not return control to the
receiving process until
Message has been placed in the allocated buffer

6
Distributed message passing (cont.)

Blocking vs. non-blocking
Non-blocking or asynchronous primitives
Send primitive does not suspend process
Control returned to the process as soon as the
message has been queued for transmission or a
copy has been made
After the message has been transmitted or copied
to a safe place for later transmission, sending
process is interrupted to be informed that the
message buffer is available
Receive primitive does not suspend process
Process is sent an interrupt upon message arrival
or process can poll periodically for messages
Advantages/disadvantages
Efficient use of message passing mechanism
Difficult to test and debug time-dependent
sequences can lead to obscure bugs

7
Remote procedure calls

Provides access to remote services by providing
simple procedure call/return semantics, similar
to those used for local services
Advantages
The procedure call is used extensively
Remote interfaces can be specified and clearly
documented as a set of named operations with
designated types
The interface is standardized
The communication code for an application can be
generated automatically
Client/server modules can be easily ported
between different OSs and target systems
Example of procedure call for the calling program
CALL P (X, Y)
where P procedure name
X passed arguments
Y returned values

8
Remote procedure calls (cont.)

Dummy or stub procedure on the local machine
Included in the callers address space or
dynamically linked at call time
Creates message identifying remote procedure and
includes parameters
Sends message to remote system and waits for
reply
When reply arrives, it returns to the calling
program providing the returned values
Dummy or stub procedure on the remote machine
Upon receiving the message, generates a local
CALL P (X, Y)
Returns reply

9
Remote procedure calls (cont.)

Design issues
Parameter passing
Call by value (parameters passed as values)
Parameters copied into the message and sent to
remote system
Easy to implement for RPCs
Call by reference (pointers to a location that
contains the value)
More difficult to implement for RPCs
Parameters and results representation
No problem if the calling and called programs use
the same language and run on the same type of OSs
and machines
If there are differences, the remote procedure
call mechanism must provide the conversion
standardized format for common objects (e.g.,
integers, characters)
Client/server binding
A client/server binding is established after the
two applications have made a logical connection
and are ready to exchange commands and data
Non-persistent binding Logical connection
between the two processes established at the time
of RPC and disconnected after the values are
returned
Persistent binding Connection set up for RPC
remains up after return

10
Remote procedure calls (cont.)

Design issues (cont.)
Synchronous vs. asynchronous
Synchronous RPC
Calling process waits for the returned values
Traditional, functions like a subroutine call
Easy to understand and test but leads to lower
performance
Asynchronous RPC
Calling process is not blocked
Methods for synchronizing the client and the
server
Higher layer applications in both client and
server initiate the exchange and then verifies
that all actions have been completed
Client uses a series of asynchronous RPCs
followed by a synchronous RPC

11
Remote procedure calls (cont.)

Design issues (cont.)
Object-oriented mechanisms
Operation
Client sends request to an object request broker
Broker acts as a directory of all remote services
on the network. Broker calls appropriate remote
object and passes data.
Remote object services request, replies to
broker, which returns response to client
Competing approaches
Common Object Request Broker Architecture (CORBA)
from the Object Management Group, backed by IBM,
Apple, Sun
Common Object Model (COM), the basis for Object
Linking and Embedding (OLE) from Microsoft

12
Clusters

Cluster group of interconnected computers
(nodes) working together as a unified computer
recourse and creating the illusion of being one
machine
Advantages of clusters
Absolute scalability
Clusters can consist of hundreds of machines,
each being a multiprocessor
Incremental scalability
A cluster can grow in small increments with
minimum service disruption
High availability
Fault-tolerant operation in software
High price/performance ratio
Off-the shelf building blocks

13
Clusters (cont.)

Cluster configurations
Passive standby
Active system processes the entire load, the
standby takes over in case of failure of primary
Active sends heartbeat messages to standby to
indicate continued operation
High cost no tasks sharing
Easy to implement
Active secondary
Secondary server is also used for processing
tasks
Reduced cost due to tasks sharing
Increased complexity

14
Clusters (cont.)

Cluster configurations (cont.)
Separate servers
Each server has its own disk, no disks shared
Data copied between servers periodically
Scheduling assigns client requests to servers to
balance the load
High availability
High server and network overhead due to data
copying
Shared disks, non-shared volumes (shared nothing)
Common disks are partitioned into volumes, each
volume owned by only one computer
On computer failure, cluster is reconfigured to
assign volumes to remaining computers
Shared disks, shared volumes
Each computer has access to all volumes on all
disks
Locking mechanism used to ensure that data is
accessed by one computer at a time

15
Clusters (cont.)

OS design issues
Failure management
Highly available clusters
High probability that all resources will be in
service
In case of failure, the queries in progress are
lost
If retried, the query will be serviced by another
computer in the cluster
Fault-tolerant clusters
Redundant shared disks and fault-tolerant
operations
Fail-over switching an application from a failed
system to an alternative
Fail-back the restoration of applications and
data resources to the failed system after
recovery
Load balancing
Load must be balanced among available computers
When a new computer is added to the cluster,
loads needs to be rebalanced to include the new
computer

16
Clusters (cont.)

OS design issues (cont.)
Parallelizing computation executing software
from a single application in parallel
Parallelizing compiler
It is determined, at compile time, which parts of
the application can be run in parallel
The parallel parts are assigned to different
computers in the cluster
Parallelized application
The application is designed to run on the cluster
and uses message passing for communication
Most powerful approach to exploit clusters
Parametric computing
Useful for programs that must be executed a large
number of times, each time with a different set
of parameters (e.g., a simulation model)
Parametric processing tools are needed to
organize, run, and manage the jobs

17
Clusters (cont.)

Cluster computer architecture
All computers are interconnected by a high-speed
LAN or switch
Each computer is capable of operating
independently
A middleware layer of software runs on each
computer to implement the cluster functionality
Provides a unified system image to the user,
called a single-system image
Is responsible for providing load balancing and
high availability
Middleware services and functions
Single entry point A user logs into the cluster,
not on a specific computer
Single file hierarchy The user sees only a
single file hierarchy, under one root directory
Single control point A default workstation is
used for cluster management and control
Single virtual networking There is a single
virtual network connecting the cluster computers,
even if it consists of multiple interconnected
networks

18
Clusters (cont.)

Middleware services and functions (cont.)
Single memory space A distributed shared memory
is used to share variables
Single job-management system The cluster has a
job scheduler and jobs are submitted to the
cluster and not to individual computers
Single user interface A common graphic interface
is used for all users, regardless of the
workstation they use to enter the cluster
Single I/O space Any node can access any I/O
device
Single process space A process on any node can
create or communicate with any other process in
the cluster
Check-pointing Process states and intermediate
results are saved periodically, permitting
rollback recovery after failures
Process migration Processes can mode inside the
cluster to provide load balancing

19
Clusters (cont.)

Clusters compared with SMPs
SMPs
Easier to manage and configure than clusters
Much closer to the original uniprocessor model
Major difference from the uniprocessor is the
scheduler function
Uses less physical space and requires less energy
than a comparable cluster
SMP products are well established and stable
Clusters
Far superior to SMPs in terms of absolute and
incremental scalability
Far superior in terms of availability
Clusters are likely to dominate the
high-performance server market

20
Windows 2000 Cluster Server

The configuration is a shared-nothing cluster,
where each volume and other resources are owned
by a single system at a time (initially
code-named Wolfpack)
Main concepts
Cluster Service
The software on each node responsible for
cluster-specific activities
Resource
These are the resources managed by the cluster
service
They are objects representing either physical
hardware devices (e.g., disk drives, network
cards) or logical items (e.g., disk volumes, IP
addresses, applications, databases)
Resources are implemented as dynamically linked
libraries (DLLs) and managed by a resource
monitor
Online A resource is online at a node if it
provides a service at that node
Group
A collection of resources that are managed as a
single entity
Consists of all elements needed to run a specific
application and to allow the client systems to
connect to the service provided by that
application
Operations can be performed on the entire group
(e.g., transfer to another node)

21
Windows 2000 Cluster Server (cont.)
22
Windows 2000 Cluster Server (cont.)

The W2K Cluster Server components and their
relationship in a single node of a cluster
Node manager
Responsible for maintaining this nodes
membership in the cluster
It sends periodic heartbeat messages to the node
managers of the other nodes in the same cluster
If it detects the loss of heartbeat messages from
another node
It broadcasts a message to the entire cluster
All members exchange messages to verify their
view of current cluster membership
If a node manager does not reply, it is removed
from cluster and its active groups are
transferred to one or more of the other nodes in
the cluster
Configuration database manager
Responsible for the cluster configuration
database
The database has information about all cluster
resources, groups, and node ownership of groups
Database managers on all nodes communicate with
each other to maintain a consistent view of
configuration information in the cluster
The integrity of the database is maintained by
using fault-resistant software for all changes to
cluster configuration

23
Windows 2000 Cluster Server (cont.)

The W2K Cluster Server components and their
relationship in a single node of a cluster
(cont.)
Resource manager / fail-over manager
Responsible for management of resource groups
Initiates actions such as startup, reset, and
fail-over
In case of fail-over, the fail-over managers on
the active nodes negotiate the redistribution of
resource groups from the failed node to the
remaining active ones
When the node that failed has recovered, the
fail-over managers may decide to move back some
groups
Event processor
Connects all the components of the cluster
service
Handles common operations
Controls cluster service initialization
Communications manager
Provides the facilities for message exchange with
other nodes in the cluster
Global update manger
Provides an update service for other components

24
Sun cluster

Solaris UNIX has been extended to make the Sun
Cluster distributed operating system
It appears to users and applications as a single
computer running the Solaris OS
Components
Object and communications support
Process management
Networking
Global distributed file system

25
Sun cluster (cont.)

Object and communications support
Object oriented uses the CORBA object model to
define objects and the remote procedure call
(RPC) mechanism
Global process management
The location of a process is transparent to the
user
Each process has a unique identifier within the
cluster
Process migration is possible a process can move
from node to node to achieve load balancing and
for fail-over (caveat the threads of a single
process must be on the same node)
Networking
Strategy
A packet filter is used to route packets to the
proper node
Cluster appears externally as a single server
with a single IP address
Operation
Incoming packets are received on the node that
has the network adapter, filtered, and delivered
to the correct target node for protocol
processing over cluster interconnect
For outgoing packets, originating node performs
protocol processing, transfers packet over
cluster interconnect to the node that has
external network physical connection

26
Sun cluster (cont.)

Global file system
Like the standard Solaris, the Sun Cluster is
based on the the concepts of virtual node (vnode)
and the virtual file system (vfs)
Standard Solaris
Vnode
The vnode structure is used to provide a
general-purpose interface to all types of file
systems
A vnode provides mapping to an object in any file
system type (by contrast, an inode in UNIX can
provide mapping to UNIX files only)
The vnode interface accepts general-purpose file
manipulation commands (e.g., read, write) and
translates them into the actions appropriate for
the respective file system
Vfs
Vfs structures are used to describe entire file
systems
The Vfs interface accepts general-purpose
commands that operate on entire files and
translates them into actions appropriate for a
particular file system

27
Sun cluster (cont.)

Global file system (cont.)
Global file access
The global file system provides an uniform
interface to files distributed over the cluster
Processes on all nodes use the same pathname to
locate a file and can open any file
Implementation
A proxy file system was built on top of the
existing Solaris file system at the vnode
interface
Vfs/vnode operations are converted by the proxy
layer into object invocations
The invoked object may reside on any node in the
cluster it performs a local vnode/vfs operation
on the underlying file system
Caching is used for file contents, directory
information, and file attributes

28
Beowulf and Linux clusters

Beowulf
Beowulf project
Initiated under the NASA High Performance
Computing and Communications (HPCC) project
Goal expand the capabilities of clustered PCs
for performing important computational tasks
Widely implemented, the most important new
cluster technology available
Beowulf features
Use of off-the shelf components, no custom
components, available from many vendors
Dedicated processors
Dedicated private network (LAN or WAN or
inter-networked combination)
Scalable I/O
Free software base and distributed computing
tools
Return of the design and improvements to the
community

29
Beowulf and Linux clusters (cont.)
30
Beowulf and Linux clusters (cont.)

Most Beowulf implementations use a cluster of
Linux workstations or PCs
A representative Linux implementation of Beowulf
contains
A number of workstations (not necessarily the
same platform) all running Linux
Secondary storage at each workstation can be
available for distributed access (e.g.,
distributed file sharing)
The Linux nodes are interconnected with an
off-the-shelf network (e.g., Ethernet switch or
an interconnected set of Ethernet switches)
Beowulf software
Open-source Beowulf software
Beowulf tools and utilities
Linux kernel, modified to allow the individual
nodes to participate in a number of global
namespaces

31
Beowulf and Linux clusters (cont.)

Examples of Beowulf system software
Beowulf distributed process space (BPROC)
Allows a process to span multiple nodes in a
cluster environment
Provides a mechanism for starting a process on
another node without logging in that node
Makes all remote processes visible in the process
table of the clusters front end node
Beowulf Ethernet channel bonding
Mechanism that joins multiple networks into a
single logical network with high bandwidth
Distributes packets over the available device
transmit queues
Provides load balancing over multiple Ethernets
connected to Linux workstations
PVMSYNC
Provides a synchronization mechanism and shared
data objects within a cluster
EnFusion
Set of tools for parametric computing, i.e.,
execution of a program as a large number of jobs,
each with different parameters

Write a Comment

User Comments (0)

About PowerShow.com