Title: Reliable Distributed Systems
1Reliable Distributed Systems
2Some terminology
- A program is the code you type in
- A process is what you get when you run it
- A message is used to communicate between
processes. Arbitrary size. - A packet is a fragment of a message that might
travel on the wire. Variable size but limited,
usually to 1400 bytes or less. - A protocol is an algorithm by which processes
cooperate to do something using message
exchanges.
3More terminology
- A network is the infrastructure that links the
computers, workstations, terminals, servers, etc. - It consists of routers
- They are connected by communication links
- A network application is one that fetches needed
data from servers over the network - A distributed system is a more complex
application designed to run on a network. Such a
system has multiple processes that cooperate to
do something.
4A network is like a mostly reliable post office
5Why isnt it totally reliable?
- Links can corrupt messages
- Rare in the high quality ones on the Internet
backbone - More common with wireless connections, cable
modems - Routers can get overloaded
- When this happens they drop messages
- This is very common
- But protocols that retransmit lost packets can
increase reliability
6How do distributed systems differ from network
applications?
- Distributed systems may have many components but
are often designed to mimic a single,
non-distributed process running at a single
place. - State is spread around in a distributed system
- Networked application is free-standing and
centered around the user or computer where it
runs. (E.g. web browser.) - Distributed system is spread out, decentralized.
(E.g. air traffic control system)
7Web connectivity
- Browser is independent fetches data you request
when you ask for it. - Web servers do not keep track of who is using
them. Each request is self-contained and treated
independently of all others. - Cookies do not count they are stored on client
machine - And the database of account info doesnt count
either this is ancient history, nothing recent - ... So the web has two network applications that
talk to each other - The browser on your machine
- The web server it happens to connect with which
has a database behind it
8Web (contd.)
Cookie identifies this user, encodes past
preferences
Database
HTTP request
Web browser with stashed cookies
Web servers are kept current by the database but
usually dont talk to it when your request comes
in
9Web (contd.)
Web servers immediately forget the interaction
Reply updates cookie
10Web (contd.)
Web servers have no memory of the interaction
Purchase is a transaction on the database
11Web (contd.)
- The data center that serves your request may be a
complex distributed system - Many servers and perhaps multiple physical sites
- Opinions about which clients should talk to which
servers - Data replicated for load balancing and high
availability - Complex security and administration policies
- So we have a networked application talking to
a distributed system
12Other examples of distributed systems
- Air traffic control system with workstations for
the controllers - Banking/brokerage trading system that coordinates
trading (risk management) at multiple locations - Factory floor control system that monitors
devices and re-plans work as they go on/offline
13Is the Web reliable?
- We want to build distributed systems that can be
relied upon to do the correct thing and to
provide services according to the users
expectations - Not all systems need reliability
- If a web site doesnt respond, you just try again
later - Reliability is a growing requirement in
critical settings but these remain a small
percentage of the overall market for networked
computers
14Reliability is a broad term
- Fault-Tolerance remains correct despite failures
- High or continuous availability resumes service
after failures, doesnt wait for repairs - Performance provides desired responsiveness
- Recoverability can restart failed components
- Consistency coordinates actions by multiple
components, so they mimic a single one - Security authenticates access to data, services
- Privacy protects identity, locations of users
15Failure also has many meanings
- Halting failures component simply stops
- Fail-stop halting failures with notifications
- Omission failures failure to send/recv. message
- Network failures network link breaks
- Network partition network fragments into two or
more disjoint subnetworks - Timing failures action early/late clock fails,
etc. - Byzantine failures arbitrary malicious behavior
16Examples of failures
- My PC suddenly freezes up while running a text
processing program. No damage is done. This is
a halting failure - A network file server tells its clients that it
is about to shut down, then goes offline. This
is a failstop failure. (The notification can be
trusted) - An intruder hacks the network and replaces some
parts with fakes. This is a Byzantine failure.
17More terminology
- A real-world network is what we work on. It has
computers, links that can fail, and some problems
synchronizing time. But this is hard to model in
a formal way. - An asynchronous distributed system is a
theoretical model of a network with no notion of
time - A synchronous distributed system, in contrast,
has perfect clocks and bounds all events, like
message passing.
18ISO protocol layers Oft-cited Standard
- ISO is tied to a TCP-style of connection
- Match with modern protocols is poor
- We are mostly at layer 4 session
19Internet protocol suite
- Can be understood in terms of ISO
- Defines addressing standard, basic network
layer (IP packets, limited to 1400 bytes), and
session protocols (TCP, UDP, UDP-multicast) - For example, TCP is a session protocol
- Includes standard domain name service that maps
host names to IP addresses - DNS itself is tree-structured and caches data
20Typical hardware options
- Ethernet 10Mbit CSMA technology, limited to 1400
byte packets. Uses single coax cable. - FDDI twisted pair, self-repairing if cable
breaks - Bridged Ethernet common in big LANs, ring with
multiple ethernet segments - Fast Ethernet 100Mbit version of ethernet
- ATM switching technology for fiber optic paths.
Can run at 155Mbits/second or more. Very
reliable, but mostly used in telephone systems.
21Implications for reliability?
- Protocol designers have problems predicting the
properties of local-area networks - Latencies and throughput may vary widely even in
a single installation - Hardware properties differ widely often, must
assume the least-common-denominator - Packet loss a minor problem in hardware itself
22Technology trends
Did the sudden growth inin LAN speed give us the
Web?
Source Scientific American, Sept. 1995
23Typical latencies (milliseconds)
WAN, disk latencies are fairly constant due to
physical limitations
Note dramatic drop in LAN latencies over
ATM This is the hardware usedtelephone systems
24O/S latency the most expensive overhead on LAN
communication!
25Broad observations
- A discontinuity is currently occurring in WAN
communication speeds! - Especially in military systems, where ATM
networking hardware has been deployed widely - Other performance curves are all similar
- Disks have maxed out and hence are looking
slower and slower - Memory of remote computers looks closer and
closer - O/S imposed communication latencies has risen in
relative terms over past decade!
26Implications?
- The revolution in WAN communication we are now
seeing is not surprising and will continue - Look for a shift from disk storage towards more
use of access to remote objects over the
network - O/S overhead is already by far the main obstacle
to low latency and this problem will seem worse
and worse unless O/S communication architectures
evolve in major ways.
27More Implications
- Look for full motion video to the workstation by
around 2010 or 2015 today we already see this in
bits and pieces but not as a routine option - Low LAN latencies an unexploited niche
- One puzzle what to do with extremely high data
throughput but relatively high WAN latencies - O/S architecture and whole concept of O/S must
change to better exploit the pool of memory of
a cluster of machines otherwise, disk latencies
will loom higher and higher
28Discovery
- Consider the problem of discovering the right
server to connect with - Your computer needs current map data for some
place, perhaps an amusement park - Can think of it in terms of layers the basic
park layout, overlaid with extra data from
various services, such as length of the line for
the Cyclone Coaster or options for vegetarian
dining near here
29Why is discovery hard?
- Client has opinions
- You happen to like vegetarian food, but not spicy
food. So your search is partly controlled by
client goals - But a given service might have multiple servers
(e.g. Amazon might have data centers in Europe
and in the US) and may want your request to go
to a particular one - Once we find the server name we need to map it to
an IP address - And the Internet itself has routing opinions too
30Fundamental terms Protocol
- Protocol is a set of rules that end points in a
telecommunication system use when exchanging
information. - IP Internet protocol defines an unreliable
packet transfer protocol. - TCP Transmission Control Protocol builds on IP
to define a reliable data delivery protocol. - LDAP Lightweight Directory Access Protocol
builds on TCP to define a query-response protocol
for querying the state of a remote database. - HTTP Hyper Text Transfer Protocol builds on TCP
to facilitate hyper-text document exchange.
31Fundamental terms Service
- Service is a network-enabled entity that provides
a specific capability. - Service Protocol Behavior
- A service definition permits many
implementations. - Examples ability to move files, create
processes, verify access rights - An FTP server speaks File Transfer Protocol and
supports remote read and write access to a
collection of files.
32Fundamental terms API
- Application Program Interface (API) defines a
standard interface for invoking a specified set
of functionality. - Examples The Generic Security Service (GSS) API
defines standard functions for verifying identity
of communicating parties, encrypting messages and
so forth.
33Client/Server
- Server refers to a process on a networked
computer that accepts requests from other (local
or remote) processes to perform a service and
responds appropriately. - Client requesting process in the above is
referred to as the client. - Request and response are in the form of messages.
- Client is said to invoke an operation on the
server. - Many distributed systems today are constructed
out of interacting clients/servers.
34Middleware (as defined by NSF)
- Middleware refers to the software which is common
to multiple applications and builds on the
network transport services to enable ready
development of new applications and network
services. - Middleware typically includes a set of components
such as resources and services that can be
utilized by applications either individually or
in various subsets. - Examples of services Security, Directory and
naming, end-to-end quality of service, support
for mobile code. - OMGs CORBA defines a middleware standard.
- OMG Object Management Group
- CORBA Common Object Request Broker Architecture
35Middleware
BR
server
server
client
client
desktop
middleware
middleware
network
36Summary
- In this course, we will study distributed systems
at the middleware level how to define, design
and implement services, how to use the middleware
services in a distributed application. - We will study Java RMI as a case study for simple
distributed system. - We will also introduce webservices standard.