Title: Safety, Reliability, and Robust Design
1- Safety, Reliability, and Robust Design
- in Embedded Systems
2Risk analysis managing uncertainty GOAL be
prepared for whatever happens Risk analysis
should be done for ALL PHASES of a
project ---planning phase ---development
phase ---the product itself Identify risks
What could you have done during the planning
stage to manage each of these risks? How
likely is it (what is probability) each one will
occur? How likely is it (what is probability)
more than one will occur? What actions will best
manage the risk if it occurs?
3 risk managementidentify, plan for risks
During planning, a Risk Table can be generated
Risks Type Probability Impact Plan
(Pointer) System not available Hardware
failure Color printer unavailable Person
nel absent (one meeting) Personnel
unavailable (several meetings) Personnel
have left project Type Performance
(product wont meet requirements) Cost (budget
overruns) Support (project cant be maintained
as planned) Schedule (project will fall
behind) Probability of this risk
occurring Impact e.g., catastrophic, critical,
marginal, negligible
4 Then table is sorted by probability and impact
and a cutoff line is defined. Everything above
this line must be managed (with a management
plan pointed to in the last column). Useful
reference Embedded Syst. Prog. Nov.
00--examples http//www.embedded.com/2000/0011/00
11feat1.htm Additional interesting reference
H. Petroski, To Engineer is Human The Role of
Failure in Successful Design, Vintage, 1992. .
risk managementidentify, plan for risks
5 professional risk analysis is proactive, not
reactive
6Important concepts for embedded systems Risk
(Probability of failure) Severity Increased
risk ? decreased safety Safety
failurespossible causes incorrect or
incomplete specification bad design improper
implementation faulty component improper use
RELIABILITY what is the probability of
failure?
7Some ways to determine reliability --product
performs consistently as expected --MTBF (mean
time between failures) is long --system behavior
is DETERMINISTIC --system responds or FAILS
GRACEFULLY to out-of-bounds or unexpected
conditions and recovers if possible
8Definitions Fault incorrect or unacceptable
state or condition Fault duration and frequency
determines clasification transientfrom
unexpected external condition-soft intermittent
unstable hardware or marginal design
periodic / aperiodic permanentfailed
component, e.g.hard Error static, inherent
characteristic of system Failure dynamic,
occurs at specific time Possible fault
consequences inappropriate action timingevent
occurs too early or too late sequence of events
incorrect quantitywrong amount of energy or
substance used
9- Achieving reliability
- safe design
- fault detection
- fault management
- fault tolerantsystem recovers, fault not
detected - e.g., packet transfers
- Definition of reliability for embedded system
- probability that a failure is detected by the
user is less than a specified threshold
10Examplessection 8.5read these
carefully! Ariane 5 rocket register
overflow64-bit word assigned to 16-bit register
in a reused subsystem Mars Pathfinder mission
1997lower priority tasks were allowed to hog
resources, higher priority tasks could not
execute 2004 Mars missionfile management
problems Many more examples in articles at
embedded.com
11How do we define safety? One criterion single
point failure of a single component will not
lead to unsafe condition common-mode failure
failure of multiple components due to a single
failure event will not lead to an unsafe
condition Safety must be considered THROUGHOUT
the project
12fig_08_00
Embedded system designproject components Dev
elopment process (waterfall model) Alternati
ve process models Need risk analysis AT EACH
INCREMENT (Aanalysis, Ddesign, Iimplement,
Ttest, Mmaintenance) Basic waterfall model
A--gtD--gtI--gtT--gtM Prototyping A--gtD--gtI--gtT--gtM
Incremental A--gtD--gtI--gtT--gtM--gtA--gtD--gtI--gtT
--gt --gtM Component based A--gtD--gtLibrary--gtInt
egrate--gtT--gtM I
fig_08_00
13- Specifications
- Identify hazards
- Calculate risk
- Define safety measures
- Specification document should include safety
standards and guidelines which system complies
with - e.g. Underwriters Laboratory, FCC, FDA, FAA,
AEC, NASA, ISO, NHTSA, etc. - Some industry standards / procedures
- FAA DO178B (and newer Do178C).
- Medical device industry ISO 14971
- Nuclear power industry ( others) IEC 61508,
"Functional Safety of Electrical/Electronic/Progra
mmable - Electronic Safety-related Systems (E/E/PE, or
E/E/PES)" areas
14Methods
- --Process and Tool Chain evaluation (this is the
main focus of DO178B) - --Probability-based models
- --Formal methods
- --Traditional methods for code testing, e.g.,
basis path testing - --Standard code-checking tools (e.g., avoiding
inclusion of redundant code)
15fig_08_01
Design and review process steps
fig_08_01
16fig_08_02
- Coding
- Trade-off
- traditional efficiency (speed/space) vs better
reliability - Some examples
- Array declarations const may not be required but
is preferred, e.g. - const int size 5 int myarraysize
- Make sure initialization is explicit, do not
depend on compiler, e.g. - int tot 0 for (int j0 jlt10 j) tot tot
j - Do not depend on lazy evaluation, e.g.
- if (( a ! 0) (b/a lt 0)) ? if (a!0)
- if (b/a lt 0)
17fig_08_02
Primitive C error-handling May not be
sufficient for embedded system Assert
fig_08_02
18fig_08_03
Example Good for debugging stage,
allows controlled crash Not robust enough for
final code
fig_08_03
19fig_08_04
Jump statements consequences may not be
acceptable
fig_08_04
20fig_08_05
Example Better high compiler
warning level, variable typing, e.g.
fig_08_05
21fig_08_06
Example system Control Memory Data /
comm Power / reset Peripherals Clock
fig_08_06
22fig_08_07
Basic method redundancy (triple)
fig_08_07
23fig_08_08
Higher redundancy
fig_08_08
24fig_08_09
Reduced capability in case of failure / error
fig_08_09
25fig_08_10
Alternative monitor only
fig_08_10
26fig_08_11
Bussing interconnection architectures
fig_08_11
27fig_08_12
Sequential still can fail at one point
fig_08_12
28fig_08_13
Better ring
fig_08_13
29fig_08_14
Even better ring with redundancy
fig_08_14
30fig_08_15
Signal values magnitude duration ignore det
ect / warn react
fig_08_15
31fig_08_16
Data errors detect / correct Example errors
in 3 bits
fig_08_16
32fig_08_17
Error detection example
fig_08_17
33fig_08_18
Hamming code (review)
fig_08_18
34fig_08_22
Block codes example Lateral longitudinal
parity
fig_08_22
35fig_08_23
fig_08_23
36fig_08_24
More complex codes use the field Z2
fig_08_24
37fig_08_25
Shift register for encoding, decoding
fig_08_25
38fig_08_26
Checking data
fig_08_26
39fig_08_27
syndrome calculator
fig_08_27
40fig_08_28
Encoding
fig_08_28
41fig_08_29
Some polynomials must choose correct one
fig_08_29
42fig_08_30
Power system
fig_08_30
43fig_08_31
Redundancy and power monitoring
fig_08_31
44fig_08_32
Potential actions
fig_08_32
45fig_08_33
Using backups
fig_08_33
46fig_08_34
Backups short-term fix
fig_08_34
47fig_08_35
Bus faults buffering
fig_08_35
48fig_08_36
Bus testing
fig_08_36
49fig_08_37
Interface system monitoring and testing
fig_08_37
50table_08_00
Example common fault analysis
table_08_00