Title: Building Stable Software Systems
1Building Stable Software Systems
- Lui Sha
- lrs_at_cs.uiuc.edu
- June 1, 2005
2The challenges of building large systems
- FAA's major modernization project, the Advanced
Automation System (AAS), was originally estimated
to cost 2.5 billion with a completion date of
1996. In 1994, FAA cancelled the AAS program,
casting aside 11 years of development time and,
according to GAO, wasting more than 1.5 billion
of taxpayer money. http//www.asiaweek.com/asiawe
ek/98/0717/nat_6_clk.html - According to a study by IBM, in a typical
commercial development organization, debugging,
testing, and verification activities can easily
range from 50 to 75 percent of the total
development cost. http//www.research.ibm.com/jou
rnal/sj/411/hailpern.html
3Unexpected interactions
Incompatible Cross Domain Protocols
Implicit and inconsistent assumptions and
abstractions
Incompatible assumptions of HW SW regarding the
operation of legs led to the loss of the Mars
Polar Lander
Pathological Interaction between RT and sync.
protocols Pathfinder caused repeated resets,
nearly doomed the mission
4Systems instabilities
Faults and failures in one component cascade
along complex and unexpected dependency relations
Overflow of a velocity variable in a reused
monitor module led to the destruction of the
Ariane 5 rocket
A divided by zero in a 3rd party component caused
a warship adrift at sea
5Sources of difficulties
- Unexpected interactions resulting from
incompatible abstractions, incorrect or implicit
assumptions in system interfaces, and
incompatible real time, fault tolerance, and
security protocols. - Inadequate development infrastructure as
reflected in the lack of domain
specific-reference architectures, tools, and
design patterns with known and parameterized real
time, robustness, and security properties. - System instabilities that result when faults and
failures in one component cascade along complex
and unexpected dependency graphs resulting in
catastrophic failures in a large part or even an
entire system.
6What needs to be done
- Interface engineering technologies Making
semantic assumptions of each component explicit
and machine checkable via component property
interface definitions and tools for two-way
synchronization for code and interface
specifications. - System integration supports A set of formally
specified and validated coherent real time,
robustness, security and networking protocols. A
set of domain models, reference architectures and
design patterns with parameterized real time,
robustness, and security properties. And tools to
support their use. - Stable software architecture Use simplicity to
control complexity replace depend relations with
use relations whenever possible ensure proper
criticality ordering along semantic, resource
sharing and timing dependency trees.
7Focusing on stability
- In the foreseeable future, we can only build a
small number of modest size defect free
components at great expense. To plan otherwise is
imprudent is overly optimistic at best. - We need to learn to build structurally stable
software systems with - A small number defect free components
- A modest number of nearly defect free components
- A majority of COTS quality components with
residual bugs - Indeed, since the dawn of civilization, there has
not been a single defect free large system. The
important role of stability control in so many
engineering disciplines is not an accident.
8Building complex and stable systems
- United States of America is a highly stable and
evolvable system. It has grown and made truly
remarkable progress by the metric of
civilization, even though many problems remain.
But its basic components, human beings, are
complex, error prone, and hard to test or verify.
- There are thousands of residual bugs in the
telecomm network and it remains highly reliable.
There are perhaps millions of bugs in the World
Wide Web system of systems, but it is remarkably
stable. - Complex but stable systems are uncommon but can
be and have been built.
9Some Questions
- What is the definition of stability in a software
system? - What is the domain of convergence in software
stability control? - How to safely use unreliable services?
- How can we deal with the infamous state explosion
problem? - How to build a reliable core service?
- How can we analyze the structural stability of a
software system? - We shall illustrate these idea by a simple
example
10An example
- Once upon a time, there was an exam on sorting
programs. Grades are given as follows - A Correct and fast n log (n) in worst case
- B Correct but slow
- F Incorrect
- Joe can verify his bubble sort, but has only 50
chance to write Heap Sort correctly. - What is his optimal strategy?
11Stability of a software system
- Often, requirements can be decomposed into
- Critical (correctness) requirements
- Sorting output numbers in correct order
- TSP visit every city exactly once
- Control stable and controllable
- Performance optimization
- Sorting faster
- TSP shorter path
- Control less time/error/energy
Heap Sort
Bubble Sort
Bounded responses to errors A stable software
system is one that can maintain key properties in
spite of errors in non-critical components
12Stability control
- What if the untrusted sorting program alters an
item in the input list? - Create a verified simple primitive called
permute - Untrusted sorting software is not allowed to
touch the input list except use the permute
primitive. - Enforce the restriction using an object with
(only) method permute - Under stability control, the untrusted Heap-sort
can only produce out of order application
errors.
Domain of convergence in software error control
is the states that satisfy the precondition of
recovery procedure. Stability control is the
mechanism used to ensure the preconditions will
hold. State explosion in stability controlled
component is a non-problem A stable system allows
for SAFE TESTING of NEW COMPONENTS
13Stability control for control software
- http//www-rtsl.cs.uiuc.edu/ click project,
click drii, click telelab download
14 Transform depend relation to USE relation
- Having a reliable controller, we identify the
recovery region within which the controller can
operate successfully. Recovery region is a subset
of the states that are admissible with respect to
operational constraints - The largest recovery region can be found using
LMI. This approach is applicable to any
linearizable systems. They cover most of the
practical control systems.
operational constraints
Recovery Region
Stability envelope
The system under new complex controller must
stay within recovery region
15Simplex Architecture for Control
Stability Monitoring
Trusted simple and reliable controller
Plant
Online upgradeable complex controller
Data Flow Block Diagram
16How to build a reliable core services?
- There two parties of thoughts
- Fault avoidance party Put all the eggs in a
bullet-proof basket - Fault tolerance party Use diversity, e.g.,
N-version programming - Which party will you vote for?
17Complexity, diversity and reliability
- To build a robust software system that can
tolerant arbitrary application software faults,
we must understand the relations between software - Complexity the root cause of software faults
- Diversity a necessary condition for software
fault tolerance. - Reliability a function of complexity and
diversity - We shall begin with postulates based self-evident
facts
18Software development postulates
- We assert that the following postulates
self-evident - P1 Complexity Breeds Bugs Everything else being
equal, the more complex the software project is,
the harder it is to make it reliable. - P2 All Bugs are Not Equal You fix a bunch of
obvious bugs quickly, but finding and fixing the
last few bugs is much harder. - P3 All Budgets are Finite There is only a
finite amount of effort (budget) that we can
spend on any project. - How can we model software complexity?
19Logical complexity
- Computational complexity gt the number of steps
in computation. - Logical complexity gt the number of
steps in verification. - A program can have different logical and
computational complexities. - Bubble-sort lower logical complexity but higher
computational complexity. - Heap sort the other way around.
-
- Residue logical complexity. A program could have
high logical complexity initially. However, if it
has been verified and can be used as is, then the
residue complexity is zero
20The implications
- P1 Complexity Breeds Bugs For a given mission
duration t, the reliability of software decreases
as complexity increases. - P2 All Bugs are Not Equal for a given degree of
complexity, the reliability function has a
monotonically decreasing rate of improvement with
respect to development effort. - P3 Budgets are finite Diversity is not free.
That is, if we go for n version diversity, we
must divide the available effort n-ways. - One simple model that satisfies P1, P2 and P3
- Sum of efforts used in diversity available
effort - Reliability function e - k (complexity / effort
) t
21Diversity, complexity and reliability
3-version programming
1-version programming
A reliable core with 10x complexity reduction
Analysis shows that what really counts is not the
degree of diversity. Rather it is the existence
of a simple and reliable core that can guarantee
the stability of the system. This result is also
robust against change of model assumptions. ---
Using Simplicity to Control Complexity, IEEE
Software 7/8, 2001, L. Sha
22Summary Keys to a stable software system
- Software bugs does not fly. Nor does it craw. It
propagates along 3 types of dependency graphs. In
the sorting example - Functional bubble sort USE but does not depend
on heap sort - Execution none, if we give each sorting task
separated and protected data, storage and
computation resources - Timing bubble does not depend on heap-sort if a
complexity based watchdog timer is set.
- 1. A simple and reliable core for critical
services - 2. A simple and well formed dependency tree
- Maximized USE relations
- Minimized dependency relation
- 3. Safely exploit useful but unreliable services
via stability control