Building Stable Software Systems - PowerPoint PPT Presentation

1 / 22

About This Presentation

Title:

Building Stable Software Systems

Description:

FAA's major modernization project, the Advanced Automation ... protocols Pathfinder caused repeated resets, nearly doomed the mission. Unexpected interactions ... – PowerPoint PPT presentation

Number of Views:33

Avg rating:3.0/5.0

Slides: 23

Provided by: lui3

Category:

more less

Transcript and Presenter's Notes

Title: Building Stable Software Systems

1
Building Stable Software Systems

Lui Sha
lrs_at_cs.uiuc.edu
June 1, 2005

2
The challenges of building large systems

FAA's major modernization project, the Advanced
Automation System (AAS), was originally estimated
to cost 2.5 billion with a completion date of
1996. In 1994, FAA cancelled the AAS program,
casting aside 11 years of development time and,
according to GAO, wasting more than 1.5 billion
of taxpayer money. http//www.asiaweek.com/asiawe
ek/98/0717/nat_6_clk.html
According to a study by IBM, in a typical
commercial development organization, debugging,
testing, and verification activities can easily
range from 50 to 75 percent of the total
development cost. http//www.research.ibm.com/jou
rnal/sj/411/hailpern.html

3
Unexpected interactions
Incompatible Cross Domain Protocols
Implicit and inconsistent assumptions and
abstractions
Incompatible assumptions of HW SW regarding the
operation of legs led to the loss of the Mars
Polar Lander
Pathological Interaction between RT and sync.
protocols Pathfinder caused repeated resets,
nearly doomed the mission
4
Systems instabilities
Faults and failures in one component cascade
along complex and unexpected dependency relations
Overflow of a velocity variable in a reused
monitor module led to the destruction of the
Ariane 5 rocket
A divided by zero in a 3rd party component caused
a warship adrift at sea
5
Sources of difficulties

Unexpected interactions resulting from
incompatible abstractions, incorrect or implicit
assumptions in system interfaces, and
incompatible real time, fault tolerance, and
security protocols.
Inadequate development infrastructure as
reflected in the lack of domain
specific-reference architectures, tools, and
design patterns with known and parameterized real
time, robustness, and security properties.
System instabilities that result when faults and
failures in one component cascade along complex
and unexpected dependency graphs resulting in
catastrophic failures in a large part or even an
entire system.

6
What needs to be done

Interface engineering technologies Making
semantic assumptions of each component explicit
and machine checkable via component property
interface definitions and tools for two-way
synchronization for code and interface
specifications.
System integration supports A set of formally
specified and validated coherent real time,
robustness, security and networking protocols. A
set of domain models, reference architectures and
design patterns with parameterized real time,
robustness, and security properties. And tools to
support their use.
Stable software architecture Use simplicity to
control complexity replace depend relations with
use relations whenever possible ensure proper
criticality ordering along semantic, resource
sharing and timing dependency trees.

7
Focusing on stability

In the foreseeable future, we can only build a
small number of modest size defect free
components at great expense. To plan otherwise is
imprudent is overly optimistic at best.
We need to learn to build structurally stable
software systems with
A small number defect free components
A modest number of nearly defect free components
A majority of COTS quality components with
residual bugs
Indeed, since the dawn of civilization, there has
not been a single defect free large system. The
important role of stability control in so many
engineering disciplines is not an accident.

8
Building complex and stable systems

United States of America is a highly stable and
evolvable system. It has grown and made truly
remarkable progress by the metric of
civilization, even though many problems remain.
But its basic components, human beings, are
complex, error prone, and hard to test or verify.
There are thousands of residual bugs in the
telecomm network and it remains highly reliable.
There are perhaps millions of bugs in the World
Wide Web system of systems, but it is remarkably
stable.
Complex but stable systems are uncommon but can
be and have been built.

9
Some Questions

What is the definition of stability in a software
system?
What is the domain of convergence in software
stability control?
How to safely use unreliable services?
How can we deal with the infamous state explosion
problem?
How to build a reliable core service?
How can we analyze the structural stability of a
software system?
We shall illustrate these idea by a simple
example

10
An example

Once upon a time, there was an exam on sorting
programs. Grades are given as follows
A Correct and fast n log (n) in worst case
B Correct but slow
F Incorrect
Joe can verify his bubble sort, but has only 50
chance to write Heap Sort correctly.
What is his optimal strategy?

11
Stability of a software system

Often, requirements can be decomposed into
Critical (correctness) requirements
Sorting output numbers in correct order
TSP visit every city exactly once
Control stable and controllable
Performance optimization
Sorting faster
TSP shorter path
Control less time/error/energy

Heap Sort
Bubble Sort
Bounded responses to errors A stable software
system is one that can maintain key properties in
spite of errors in non-critical components
12
Stability control

What if the untrusted sorting program alters an
item in the input list?
Create a verified simple primitive called
permute
Untrusted sorting software is not allowed to
touch the input list except use the permute
primitive.
Enforce the restriction using an object with
(only) method permute
Under stability control, the untrusted Heap-sort
can only produce out of order application
errors.

Domain of convergence in software error control
is the states that satisfy the precondition of
recovery procedure. Stability control is the
mechanism used to ensure the preconditions will
hold. State explosion in stability controlled
component is a non-problem A stable system allows
for SAFE TESTING of NEW COMPONENTS
13
Stability control for control software

http//www-rtsl.cs.uiuc.edu/ click project,
click drii, click telelab download

14
Transform depend relation to USE relation

Having a reliable controller, we identify the
recovery region within which the controller can
operate successfully. Recovery region is a subset
of the states that are admissible with respect to
operational constraints
The largest recovery region can be found using
LMI. This approach is applicable to any
linearizable systems. They cover most of the
practical control systems.

operational constraints
Recovery Region
Stability envelope
The system under new complex controller must
stay within recovery region
15
Simplex Architecture for Control
Stability Monitoring
Trusted simple and reliable controller
Plant
Online upgradeable complex controller
Data Flow Block Diagram
16
How to build a reliable core services?

There two parties of thoughts
Fault avoidance party Put all the eggs in a
bullet-proof basket
Fault tolerance party Use diversity, e.g.,
N-version programming
Which party will you vote for?

17
Complexity, diversity and reliability

To build a robust software system that can
tolerant arbitrary application software faults,
we must understand the relations between software
Complexity the root cause of software faults
Diversity a necessary condition for software
fault tolerance.
Reliability a function of complexity and
diversity
We shall begin with postulates based self-evident
facts

18
Software development postulates

We assert that the following postulates
self-evident
P1 Complexity Breeds Bugs Everything else being
equal, the more complex the software project is,
the harder it is to make it reliable.
P2 All Bugs are Not Equal You fix a bunch of
obvious bugs quickly, but finding and fixing the
last few bugs is much harder.
P3 All Budgets are Finite There is only a
finite amount of effort (budget) that we can
spend on any project.
How can we model software complexity?

19
Logical complexity

Computational complexity gt the number of steps
in computation.
Logical complexity gt the number of
steps in verification.
A program can have different logical and
computational complexities.
Bubble-sort lower logical complexity but higher
computational complexity.
Heap sort the other way around.
Residue logical complexity. A program could have
high logical complexity initially. However, if it
has been verified and can be used as is, then the
residue complexity is zero

20
The implications

P1 Complexity Breeds Bugs For a given mission
duration t, the reliability of software decreases
as complexity increases.
P2 All Bugs are Not Equal for a given degree of
complexity, the reliability function has a
monotonically decreasing rate of improvement with
respect to development effort.
P3 Budgets are finite Diversity is not free.
That is, if we go for n version diversity, we
must divide the available effort n-ways.
One simple model that satisfies P1, P2 and P3
Sum of efforts used in diversity available
effort
Reliability function e - k (complexity / effort
) t

21
Diversity, complexity and reliability
3-version programming
1-version programming
A reliable core with 10x complexity reduction

Analysis shows that what really counts is not the
degree of diversity. Rather it is the existence
of a simple and reliable core that can guarantee
the stability of the system. This result is also
robust against change of model assumptions. ---
Using Simplicity to Control Complexity, IEEE
Software 7/8, 2001, L. Sha
22
Summary Keys to a stable software system

Software bugs does not fly. Nor does it craw. It
propagates along 3 types of dependency graphs. In
the sorting example
Functional bubble sort USE but does not depend
on heap sort
Execution none, if we give each sorting task
separated and protected data, storage and
computation resources
Timing bubble does not depend on heap-sort if a
complexity based watchdog timer is set.