Title: Bran Selic Rational Software Canada bselic@rational.com
1Physical Programming Beyond Mere Logic
- Bran SelicRational Software Canadabselic_at_rationa
l.com
2What I am Hoping For
E THEORY AND PRACTICE OF
SOFTWARE
3The Ideal and the Real
- By focussing on the imperfect world of physical
reality we may miss the essence
- Software seems much closer to the ideal world
4The Software World
- Fundamental design principle separate program
logic from the underlying implementation
technology - separation of concerns
- software portability
Program Logic
HL ProgrammingLanguages
Computing Environment Technology
5The Real-Time Software World
- Key question How long will it take?
- The quantitative characteristics of the computing
environment encroach upon the purity of the logic - software design involves engineering tradeoffs
6A Simple Programming Application
- Traverse a transactions log database and print
all transactions pertaining to a specific account
open (DB) for i 1 to DB.size do record
read (DB) if (record.acctNo
myAccount)then print (record) enddo close
(DB)
7Porting to a Distributed Environment
- Can it really be this simple?
Network
open (DB)for i 1 to DB.size do record
read (DB) if (record.acctNo
myAccount)then print (record) enddoclose
(DB)
RPC_open (DB)for i 1 to DB.size do record
RPC_read (DB) if (record.acctNo
myAccount)then print (record) enddoRPC_close
(DB)
8Some (Unstated!) Assumptions
- The CPU and database are fast enough for the
needs of the application - e.g. random access database hardware
- The CPU and database fail as a unit
- i.e., no need to contend with failures of the
database - Communications is reliable
- order preserving
- exactly once semantics
- A system never has anything more important to do
than what it is doing at the moment
9Partial Failures
- Distributed systems can exhibit partial failures
- fault tolerance ability to recover from partial
failures - Issue failure recovery strategy
- fault detection
- failure recovery
- fault diagnosis
- Issue how do other sites detect that a site has
failed? - (apparent) lack of activity/response
- how do we distinguish between a failed site and a
lost message? - Timeout is the only general mechanism available
- how long do we wait?
- Tradeoff between responsiveness vs. degree of
certainty
10A More Realistic Distribution Scenario
- Dealing with partial failures
DB locate_database (Network)exception abort
RPC_open (DB)exception do DB
locate_database (Network)exception abort
enddo for i 1 to DB.size do record
RPC_read (DB)exception do DB
locate_database (Network)exception abort for
j 1 to (i-1) do RPC_read (DB)
exception abort retry enddo if
(record.acctNo myAccount)then print
(record) enddo RPC_close (DB)
Most of the code is in the exception handlers!
11Asynchronous Events and Fault Tolerance
- Partial system failures are only one kind of
event that may need to be handled in the course
of execution of a distributed program - Others
- high-priority situations (e.g., imminent
deadlines) - aborts
- These events are often unpredictable
- may occur at any point in the execution of a
program - fault tolerance requires that whenever they occur
and whatever they are, we need to deal with them
12Revisiting An Old Assumption
- Is the traditional main path focussed
programming style appropriate when exceptions are
the rule?
13Asynchronous Event Handling
- This is nicely captured by the state-event matrix
of finite state machines
Event A
etc.
Event S
Handler AN
Handler AN1
Handler AN2
14A Conclusion
- In an event-driven and deadline-based
application, a state machine-based programming
model may be more appropriate than the
traditional algorithmic (main path) programming
model - The environment strikes back
- the program logic is strongly affected by the
environment
15Communication Media Failures
- Message loss
- due to hardware failures
- due to software failures (e.g., buffer overflow)
- Message reordering
- due to different paths
- due to variable delays (e.g., due to variable
message lengths) - retransmission due to fault-tolerant protocols
- Message duplication
- due to faulty hardware
- retransmission due to fault-tolerant protocols
16Transmission Delays
- Possibility of out of date status information
17Relativistic Effects
- Relativistic effects
- different observers see different event orderings
(due to different and variable transmission
delays)
18Distribution Transparencies
- Providing supporting layers of functionality that
shield the application from the undesirable
effects of distribution - e.g., reliable communication protocols
client
server
19Impossibility Result No.1
- It is not possible to guarantee that agreement
can be reached in finite time over an
asynchronous communication medium, if the medium
is lossy or one of the distributed sites can fail - Fischer, M., N. Lynch, and M. Paterson,
Impossibility of Distributed Consensus with One
Faulty Process Journal of the ACM, (32, 2) April
1985.
20Impossibility Result No.2
- Even when communication is fully reliable, it is
not possible to guarantee common knowledge if
communication delays are unbounded - Halpern, J.Y, and Moses, Y., Knowledge and
common knowledge in a distributed environment
Journal of the ACM, (37, 3) 1990.
21The End-To-End Argument
- Transparency mechanisms are intended to protect
the application from observing the undesirable
effects of distribution - Most transparency types require distributed
agreement! - The end-to-end argument Saltzer et al.
- if transparency cannot be guaranteed, the
application is not really shielded from the
effects of distribution - the overhead of introducing transparency
mechanisms may not be justified
22Stepping Back...
- Most distribution problems are a consequence of
the encroachment of the physical world into the
pliable and limitless logical world of software - the problem is fundamental (e.g., the end-to-end
argument) - Traditional Programming Logic
- Physical Programming Logic Physics
- like traditional engineers, software designers
must take into account the raw material out of
which they spin their logic - finite resources, finite delays, finite
reliability...
23Quality of Service Concepts
- The physical characteristics of software can be
specified using the general notion of Quality of
Service (QoS) - a specification of how well a service is (to be)
performed - e.g. throughput, capacity, response time
- usually a quantitative measure
- QoS specifications are two sided
- offered QoS the QoS that is offered to clients
- required QoS the QoS required by a client
24Resources and Quality of Service
- Resource an element whose functional capacity is
limited, directly or indirectly, by the finite
capacities of the underlying physical computing
environment - The services of a resource are characterized by
one or more QoS attributes - capacity, reliability, availability, response
time, etc.
Client
Resource
Resource Demand
OfferedQoS
RequiredQoS
RequiredQoS ? OfferedQoS
25Simple Example
- Concurrent tasks accessing a monitor with known
response time characteristics
Required QoS
Deadline 3 ms
MaxExecutionTime 4 ms
Offered QoS
26Types and Physical Types
- The purpose of types is to tell us about the
externally relevant properties of software
components so that we can validate whether they
are being used appropriately - Physical types type specifications that
incorporate QoS characteristics - Answer two key engineering questions
- can this component support the load intended
for it? - what does this component require to support its
offered QoS?
27Physical Type Example
- A semaphore type
- class Semaphore
- heap 10 bytes -- required QoS
- CPU? 5 MIPS -- required QoS
- get()proc? 0.4CPU usstack4 bytes
- rel()proc? 0.4CPU usstack4 bytes
-
- Usage
- mySema Semaphore
- mySema.get() proc? 3 us -- req. QoS
28Violation of Encapsulation?
- Arent the offered QoS characteristics a
consequence of the implementation? - Not necessarily...
- The offered QoS characteristics can and should be
defined independently of the implementation - the worst-case numbers of traditional
engineering - The contractual obligations that the component
designer is willing to assume
29Physical Type Checking
- Can physical types be statically checked?
- The good news Yes, they can (in most cases)
- The bad news typically requires complex analysis
methods (queueing network analysis,
schedulability analysis, etc.) - but then, model checking and theorem proving is
not simple either - Some issues
- Typically, QoS-based analyses cannot be done
incrementally -- the full system context is
required - but then, the same holds for many formal
verification methods - Each type of QoS (e.g., bandwidth, CPU
performance) combines differently
30Required QoS
- Like all guarantees, the offered QoS is
contingent on the component getting what it needs
to do its job - There are two distinct dimensions to this
- the peer dimension
- the layering dimension
31Logical Viewpoint
- Example logical view of aircraft simulator
software
INSTRUCTOR STATION
AIRFRAME
ATMOSPHEREMODEL
PILOT CONTROLS
CONTROLSURFACES
GROUNDMODEL
ENGINES
32Engineering (Realization) Viewpoint
- The realization of a specific set of logical
components using facilities of the run-time
environment
33Viewpoints and Mappings
Realizationmappings
34The Engineering Viewpoint
- The engineering viewpoint represents the raw
material out of which we construct the logical
viewpoint - the quality of the outcome is only as good as the
quality of the ingredients that are put in - as in all true engineering, the quantitative
aspects of the logical model are often crucial
(How long will it take? How much will be
required?)
35Distributed Systems Dilemma
- Dilemma How can we account for the engineering
characteristics of the system without prematurely
and possibly unnecessarily committing to a
specific technology? - Proposed solution Include in the logical model a
generic (technology-neutral) specification of the
required/expected characteristics of the
engineering environment
36Viewpoint Separation
- Required Environment a technology-neutral
environment specification required by the logical
elements of a model
Logical Viewpoint
37Required Environment Specifications
- What a logical component needs in order to
perform its function according to spec
realization mapping
38Required Environment Partitions
- Logical elements often share common QoS
requirements
QoS domain (e.g.,failure unit, uniform comm
properties)
39QoS Domains
- Specify a domain in which certain QoS values
apply throughout - failure characteristics (failure modes,
availability, reliability) - CPU speeds
- communications characteristics (delay,
throughput, capacity) - etc.
- The QoS values of a domain can be compared
against those of a concrete engineering
environment to see if a given environment is
adequate for a specific model
40Physical Programming
- The notions of QoS and QoS domains enable the
design of distributed systems that properly
account for the effects of distribution and other
non-transparent physical phenomena, while
allowing for a high degree of portability and
technology independence - They are also the basis for formal verification
of realization mappings - required QoS ? QoS of the proposed engineering
environment - May also be used to automatically synthesize
engineering environments that satisfy a given QoS
specification of a logical model
41Conclusions and an Appeal...
- The physical aspects of software will not go away
- ignoring them can be perilous especially when
working with distributed systems - most interesting software systems of the future
will be distributed and will have stringent
dependability requirements (cannot reboot the
Internet) - What is needed is a proper theoretical framework
for dealing with physical types - The QoS framework described here is currently
being incorporated into a profile of UML for
real-time applications