Title: Safety Related Systems
1Safety Related Systems
- The second of four lectures on the real world of
computing. - Martyn Thomas
2En Route ATC at Swanwick
3Airspace
4Control Room
5RS 6000 workstations
6A medium sized system
- 114 controller workstations
- 20 supervisory/management positions
- 10 engineering positions
- 48-workstation simulator
- 2 15-workstation test systems
- 2.5 million lines of software
- gt500 processors
7Development
- Project start 1989
- Planned operational date 1996
- Actual operational date, Jan 27th 2002
- but Mitre Corp forecast 13 years
- Safety Case audited by CAA
- top hazards
- radio failure
- plausible but incorrect data
8Operational data
- 1,667,381 flights in 2002
- Continuous operation,
- one 3-hour failure
- other flight delays caused by NAS failures at
West Drayton - 10Mb ARM data collected each minute - key
measures better than forecast.
9Challenges for the future
- Current ATC safety depends on the controllers
ability to clear their sector with radio only. - Future traffic growth requires higher densities,
controllers will not be able to maintain a mental
picture of the traffic. - So future ATC will depend on automatic systems,
which must not fail. - Target? At least the avionics standard10-8 pfh
- No current air traffic management systems are
built to such standards.
10Some safety-critical systems
- Medical radiotherapy systems
- Therac-25 deaths
- Nuclear power-station control/shutdown
- Avionics (TCAS, A320, Boeing 777, )
- Railway signalling
- Weapons systems (torpedo, Vincennes Aegis)
- Control systems (dams, mines, Thames barrier etc)
11Safety principles ALARP
Tolerable, only if further risk reduction is not
practicable (i.e. impossible, or unreasonably
expensive).
ALARP REGION The risk may be tolerable if the
benefit is sufficiently great to justify it, and
if the risk has been reduced As Low As Reasonably
Practicable.
Lower risk means that less cost is practicable
in reducing it further. This reducing pressure to
improve is represented by the shape of the
triangle
Broadly Acceptable Region No detailed
justification required
Negligible Risk
12Risk Defined in IEC 61508 Part 4 as the
probable rate of occurrence of a hazard causing
harm and the degree of severity of harm.
FREQUENCY CONSEQUENCE Catastrophic
Critical Marginal Negligible Frequent I I
I II Probable I I II
III Occasional I II III
III Remote II III III
IV Improbable III III IV
IV Incredible IV IV IV IV
I - intolerable risk II - undesirable risk, and
tolerable only if risk reduction is impractical
or if the costs are grossly disproportionate to
the improvement gained III - tolerable risk if
the cost of the risk reduction would exceed the
improvement gained IV - negligible (acceptable)
risk
13Safety Integrity Levels (SILs)
SIL Continuous / High Demand
Mode pfh 4 ³ 10-9 to lt 10-8 3 ³ 10-8 to
lt 10-7 2 ³ 10-7 to lt 10-6 1 ³ 10-6 to lt
10-5
IEC 61508 indicative probabilities
14Assurance
- Assurance showing that a system has the required
safety - Much harder than just developing a system that is
safe enough - what evidence is sufficient?
- How safe is a system that has never failed?
- What evidence does testing provide?
- How can we do better?
15How safe is a system that has never failed?
- If it has run for n hours without failure, and if
the operating conditions remain much the same,
the best estimate for the probability of failure
in the next n hours is - 0.5
- So, to show that a system has a pfh of lt10-4 with
50 confidence, we need about 14 months of
fault-free testing. - 10,000 hours is 13.89 months
16What evidence does testing provide?
- Testing shows the presence, not the absence, of
bugs - Dijkstra - We cannot test every path.
- Testing functions, or boundary conditions, may
find faults but test that work provide no
evidence of pfh. - Statistical testing, under operational
conditions, provides evidence of pfh. - But it takes a very long time.
17Statistical testing
- To show an MTBF of n hours, with 99 confidence,
takes around 10n hours of testing with no faults
found. So to show the SIL 4 claim (10-8 pfh)
takes around 109 hours (gt100,000 years.) - With good prior evidence, e.g. from a strong
process, using a Bayesian approach may reduce
this to lt10,000 years
18One can construct convincing proofs quite
readily of the ultimate futility of exhaustive
testing of a program and even of testing by
sampling. So how can one proceed? The role of
testing, in theory, is to establish the base
propositions of an inductive proof. You should
convince yourself, or other people, as firmly as
possible, that if the program works a certain
number of times on specified data, then it will
always work on any data. This can be done by an
inductive approach to the proof. C A R Hoare
1969
19The role of formal methods
- Formal specifications make safety properties
explicit and unambiguous. - Formal proof of safety invariants.
- Formal demonstration of equivalence classes
that can be tested. - Formal analysis of the impact of changes reduces
the assurance effort after maintenance.
20Dependability and strong software engineering
- is the subject of next weeks lecture.