Title: SHARCnet
1Introduction
2Outline
- Definitions
- Examples
- Hardware concepts
- Software concepts
- Readings Chapter 1
- Acknowledgements Grid notes from UCSD
3Definition of a Distributed System (Tannenbaum
and van Steen)
- A distributed system is a piece of software that
ensures that - A collection of independent computers that
appears to its users as a single coherent system.
4Definition of a Distributed System (Colouris)
- A distributed system is
- In which hardware and software components located
at networked computers communicate and coordinate
their actions only by passing messages.
5Definition of a Distributed System (Lamport)
- A distributed system is one in which I cannot get
something done because a machine I've never heard
of is down.
6Primary Characteristics of a Distributed System
- Multiple computers
- Concurrent execution
- Independent operation and failures
- Communications
- Ability to communicate
- No tight synchronization
- Relatively easy to expand or scale
- Transparency
7Example A Typical Intranet (Coulouris)
8Example A Typical Portion of the Internet
(Coulouris)
9Example Portable and Handheld Devices in a
Distributed System (Coulouris)
10Motivation for Building Distributed Systems
- Economics
- Share resources
- Relatively easy to expand or scale
- Speed A distributed system may have more total
computing power then a mainframe. - Cost
- Personalize environments
- Location Independence
- People and information are distributed
- Expandibility
- Availability and Reliability
- If a machine crashes, the system as a whole can
survive.
11Distributed Application Examples
- Automated banking systems
- Retail
- Air-traffic control
- The World Wide Web
- Student Record System
- Distributed Calendar
- Gnutella and Napster
- GAUL
12Examples in More Detail
- Air-Traffic Control
- This is not an Internet application.
- In many countries, airspace is divided into areas
which in turn may be divided into sectors. - Each area is managed by a control center.
- Control systems communicate with tower control
and other control systems (to allow a plane to
cross boundaries). - The planes and air-traffic controls are
distributed. A single centralized systems is
not feasible.
13Examples in More Detail
- World Wide Web
- Shared Resources Documents
- Unique identification using URLs
- Users interested in the documents are
distributed. - The documents are also distributed.
- Banking
- Clients may access their accounts from ATM
machines. - There may be multiple clients attempt to access
their accounts simultaneously. - Multiple copies of account information allows
quicker access.
14Examples in More Detail
- Retail
- Stores are located near their customer base.
- Point of Sale (POS) terminals are used to
customer interactions while mobile units are used
for inventory control. - These units talk to a local processor which in
turn may communicate with remote processors.
15Examples in More Detail
- Gnutella and Napster
- What is being shared are files.
- Gaul
- What is being shared includes disk space, e-mail
server, web server, software
16Key Design Goals
- Connectivity
- Transparency
- Reliability
- Consistency
- Security
- Openness
- Scalability
17Connectivity
- It should be easy for users to access remote
resources and to share them with other users in a
controlled fashion. - Resources that can be shared include printers,
storage facilities, data, files, web pages, etc - Why? Economical
- Connecting users and resources makes
collaboration and the exchange of information
easier. - Just look at e-mail
18Transparency
- A distributed system that is able to present
itself to users and applications as if it were
only a single computer system is said to be
transparent. - Very difficult to make distributed systems
completely transparent. - You may not want to, since transparency often
comes at the cost of performance.
19Transparency in a Distributed System
Transparency Description
Access Hide differences in data representation and how a resource is accessed
Location Hide where a resource is located
Migration Hide that a resource may move to another location
Relocation Hide that a resource may be moved to another location while in use
Replication Hide that a resource is replicated
Concurrency Hide that a resource may be shared by several competitive users
Failure Hide the failure and recovery of a resource
Persistence Hide whether a (software) resource is in memory or on disk
Different forms of transparency in a distributed
system.
20Degree of Transparency
- The goal of full transparency is not always
desirable. - Users may be located in different continents
distribution is apparent and not something you
want to hide. - Completely hiding failures of networks and nodes
is (theoretically and practically) impossible - You cannot distinguish a slow compuer from a
failing one. - You can never be sure that a server actually
performed an operation before a crash. - Full transparency will cost in performance.
- Keeping Web caches exactly up-to-date with the
master copy - Immediately flushing write operations to disk for
fault tolerance.
21Openness
- An open distributed system allows for interaction
with services from other open systems,
irrespectively of the underlying environment. - Systems should conform to well-defined
interfaces. - Systems should support portability of
applications. - Systems should easily interoperate.
Interoperability is characterized by the extent
by which two implementations of systems or
components from different manufacturers can
co-exist and work together. - Example In computer networks there are rules
that govern the format, contents and meaning of
messages send and received.
22Scalability
- There are three dimensions to scalability
- The number of users and processes (size
scalability) - The maximum distance between nodes (geographical
scalability) - The number of administrative domains
(administrative scalability)
23Techniques for Scaling
- Partition data and computations across multiple
machines - Move computations to clients (Java applets)
- Decentralized naming services (DNS)
- Decentralized information systems (WWW)
- Make copies of data available at different
machines - Replicated file servers (for fault tolerance)
- Replicated databases
- Mirrored web sites
- Allow client processes to access local copies
- Web caches (browser/Web proxy)
- File caching (at server and client)
24Scaling The problem
- Applying scaling techniques is easy, except for
the following - Having multiple copies (cached or replicated)
leads to inconsistencies modifying one copy
makes that copy different from the rest. - Always keeping copies consistent requires global
synchronization. - Global synchronization is expensive with respect
to performance. - We have learned to tolerate some inconsistencies.
25Challenges
- Heterogeneity
- Networks
- Hardware
- Operating systems
- Programming languages
26Challenges
- Failure Handling
- Partial failures
- Can non-failed components continue operation?
- Can the failed components easily recover?
- Detecting failures
- Recovery
- Replication
27Hardware Concepts
- Multiprocessors
- Multicomputers
- Networks of computers
28Multiprocessors and Multicomputers
1.6
Different basic organizations and memories in
distributed computer systems
29Shared Memory
- Coherent memory
- Each CPU writes though, reflected at other
immediately - Note the use of cache memory for efficiency
- Limited to small number of processors
30Shared Memory
- All processors share access to a common memory.
- Each CPU writes though, reflected at other
immediately - Scaling requires a memory hierarchy of some kind.
Note the use of cache memory for efficiency.
31Shared Memory
- A crossbar switch
- (b) An omega switching network
32Shared Memory
- Shared memory is considered an efficient
implementation of message passing. - Problem Cache consistency difficult to
maintain if there are a large number of
processors since the probability of an
inconsistency between processors increase. - Problem Bus can become a bottleneck.
- Problem Switch technology requires a lot of
hardware which can be expensive. - Usually these systems have a relatively small
number of processors. - Example Applications Real-time entertainment
applications since they are sensitive to image
quality and performance. - Examples Silicon Graphics Challenge, Sequent
Symmetry
33Multicomputer Systems
- A multicomputer system comprises of a number of
independent machines linked by an interconnection
network. - Each computer executes its own program which may
access its local memory and may send and receive
messages over the network. - The nature of the interconnection network has
been a major topic of research for both academia
and industry.
34Multicomputer Systems
- Pipelined architecture
- Pipelined program divided into a series of tasks
that have to be completed one after the other. - Each task executed by a separate pipeline stage
- Data streamed from stage to stage to form
computation
35Multicomputer Systems
- Pipelined architecture
- Computation consists of data streaming through
pipeline stages
36Multicomputer Systems
- Take a list of integers greater than 1 and
produce a list of primes - e.g. For input 2 3 4 5 6 7 8 9 10, output is
2 3 5 7 - A pipelined approach
- Assume that processors are labeled by P2 ,P3 , P4
- Processor Pi divides each input by i
- If the input is not divisible, it is forwarded
- Last processor only forwards primes
- If you are looking for first N prime numbers you
need sqrt(N) processors
37Multicomputer Systems
- Other example interconnection networks
- Grid
- Hypercube
1-9
38Using a Grid (or systolic array)
- Problem multiply two nxn matrices A aij and
Bbij. Product matrix will be Rrij. - One solution uses an array with nxn cells.
39Using a Grid (or systolic array)
- Let A and B be the following
- a11 a12 a13 a14 b11 b12 b13 b14
- a21 a22 a23 a24 b21 b22 b23 b24
- a31 a32 a33 a34 b31 b32 b33 b34
- a41 a42 a43 a44 b41 b42 b43 b44
- The product of A and B is calculated as follows
- r11 a11b11a12b21a13b31a14b41
- r12 a11b12a12b22a13b32a14b42
- r21 a21b11a22b21a23b31a24b41
- r22 a21b12a22b22a23b32a24b42
40Using a Grid
b41 b42 b43 b44 b31 b32
b33 b34 b21 b22 b23
b24 b11 b12 b13 b14 --
-- -- -- --
-- ----
a44 a34 a24 a14 a43
a33 a23 a13 a42 a32
a22 a12 a41 a31 a21
a11 -- -- -- -- -- --
P11
P12
P21
P31
P13
P22
P32
P41
P14
P23
P33
P42
P24
P34
P43
P44
41Using a Grid
- Each cell updates at each time step as
shown below - initialized to 0
42Using a Grid
43Using a Grid
44Using a Grid
45Multicomputer Systems
- Hypercube Why?
- Lets say that you had a hypercube of 8 nodes.
- Their addresses are 000,001,010,011,100,101,110,11
1 - Nodes that have a difference of one bit are
adjacent - Lets say you wanted to route a message from 000
to 111. - This is easily done in three hops.
- You go from 000 to 001, then 001 to 011 and then
011 to 111. - Routing is simple and fast (certainly simpler
than on the Internet). - BTW, the number of nodes is always a power of 2.
46Multicomputer Systems
110
111
100
101
010
011
000
001
47Multicomputer Systems
- Hypercube
- Pipelines and grids can be embedded into a
hypercube system. - Example
- 000 001 011 010 110 111 101 100
- Example
- 000 001 011 010
- 100 101 111 110
48Multiprocessor Usage
- Scientific and engineering applications often
require loops over large vectors e.g., matrix
elements or points in a grid or 3D mesh.
Applications include - Computational fluid dynamics
- Scheduling (airline)
- Health and biological modeling
- Economics and financial modelling (e.g., option
pricing)
49Multiprocessor Usage
- It should be noted that people have been
developing clusters of machines that are
connected using Ethernet for parallel
applications. - The first such cluster (developed by two
researchers at NASA) had 16 486 machines and was
connected using 10 Mb Ethernet. - This is known as the Beowulf approach to
developing a parallel computing and the clusters
are sometimes called Beowulf clusters.
50Sharcnet
- UWO has taken a leading role in North America in
exploiting the concepts behind the Beowulf
cluster. - High performance clusters Beowulf on steroids
- Powerful off the shelf computational elements
- Advanced communications
- Geographical separation (local use)
- Connect clusters emerging optical communications
- This is referred to as Shared Hierarchical
Academic Research Computing Network or Sharcnet
51Sharcnet
- One cluster is called Great White
- Processors
- 4 alpha processors 833Mhz (4p-SMP)
- 4 Gb of memory
- 38 SMPs a total of 152 processors
- Communications
- 1 Gb/sec ethernet
- 1.6 Gb/sec quadrics connection
- November 2001 183 in the world
- Fastest academic computer in Canada
- 6th fastest academic computer in North America
52Sharcnet
Great White (in Western Science Building)
53Sharcnet
- Extend Beowulf approach to clusters of high
performance clusters - Connect clusters clusters of clusters
- Build on emerging optical communications
- Initial configuration used optical equipment from
telecommunications industry - Collectively a supercomputer!
54Sharcnet
Clusters across Universities (initial cluster)
55Sharcnet
- In 2004, UWO received an investment of 56 million
dollars from the government and private industry
(HP) to expand Sharcnet. - With the new capabilities, Sharcnet could be in
the top 100 or 150 of supercomputers. - Will be the fastest supercomputer of its kind
I.e.,a distributed system where nodes are
clusters.
56Sharcnet
57Sharcnet
- Applications running on Sharcnet come from all
sorts of domains including - Chemistry
- Bioinformatics
- Economics
- Astrophysics
- Material Science and Engineering
58Networks of Computers
- High degree of node heterogeneity
- Nodes include PCs, workstations, multimedia
workstations, palmtops, laptops - High degree of network heterogeneity
- This includes local-area Ethernet, ATM and
wireless connections. - A distributed system should try to hide these
differences. - In this course, the focus really is in networks
of computers.
59Software Concepts
System Description Main Goal
DOS Tightly-coupled operating system for multi-processors and homogeneous multicomputers Hide and manage hardware resources
NOS Loosely-coupled operating system for heterogeneous multicomputers (LAN and WAN) Offer local services to remote clients
Middleware Additional layer atop of NOS implementing general-purpose services Provide distribution transparency
- An overview between
- DOS (Distributed Operating Systems)
- NOS (Network Operating Systems)
- Middleware
60Distributed Operating System
- OS on each computer knows about the other
computers - OS on different computer is generally the same
- Services are generally (transparently)
distributed across computers.
1.14
61Distributed Operating System
- This is harder to implement then a traditional
operating system. Why? - Memory is not shared
- No simple global communication
- No simple systemwide synchronization mechanisms
- May require that OS maintain global memory map in
software. - No central point where resource allocation
decisions can be made. - Only very few truly multicomputer operating
systems exist.
62Network Operating System
- Each computer has its own operating system with
networking facilities - Computers work independently i.e., they may even
have different operating systems - Services are tied to individual nodes (ftp,
telnet, www) - Highly file oriented
63Middleware
- OS on each computer need not know about the other
computers - OS on different computers may be different
- Services are generally (transparently)
distributed across computers.
64Middleware and Openness
1.23
- In an open middleware-based distributed system,
the protocols used by each middleware layer
should be the same, as well as the interfaces
they offer to applications.
65Middleware Services
- Communication Services
- Hide primitive socket programming
- Data management in a distributed system
- Naming services
- Directory services (e.g., LDAP, search engines)
- Location services for tracking mobile objects
- Persistent storage facilities
- Data caching and replication
66Middleware Services
- Services giving applications control over when,
where and how they access data. - Distributed transaction processing
- Code migration
- Services for securing processing and
communication - Authentication and authorization services
- Simple encryption services
- Auditing services
- There are varying levels of success in being able
to provide these types of middleware services.
67Summary
- Distributed systems consist of autonomous
computers that work together. - When properly designed, distributed systems can
scale well with respect to the size of the
underlying network. - Sometimes the lines are blurred between a
distributed system and a system that can support
parallel processing.