Title: Safety
1Lecture 12 Safety
2- Safety-Related Faults
- Safe systems
- Random and systematic faults
- Single-point failures
- Common mode failures
- Latent faults
- Fail-safe state
- Achieving safety
3Safe systems
4Faults(???, ????) and Failures(????, ??-?????)
- Failures are events occurring at specific times.
- Errors are more static and are inherent
characteristics of the system, as in design
errors. - Fault is an unsatisfactory system condition or
state. - Failures and errors are different kinds of faults.
5- Many systems present hazards, which the system
can identify and address or ignore. - However, safety is a system issue.
- The system is either safe or isnot not the
software, not the electronics, not the mechanics.
6Random and systematic faults Errors are
systematic faults they are intrinsic in the
design or implementation. If it does the wrong
thing, it will always do the wrong thing under
identical circumstances. Failure are random
faults that occur when a component breaks in the
field. Something that once functioned properly
no longer does so. The design is fine, but the
system no longer meet their design
characteristics.
7- Hardware faults may be systematic or random.
Hardware design may have an error a systematic
fault. Random faults occur only in physical
entities such as mechanical or electronic
components. Random faults cannot be designed
away. It is possible to add redundancy for fault
detection, but no one has ever made a CPU that
cannot fail. - Software faults are always systematic, because
software neither breaks down or wears out.
Run-time checking can provide the only means
within a system to identify systematic faults
that occur in rare circumstances.
8Single-point failures
9(No Transcript)
10Safe system continues to be safe in the event of
any single-point failure.
11Common mode failures
Example If a medical system uses a primary CPU
to control a radiation dose, none of the safety
mechanisms can reside on that CPU. If a primary
CPU fails, then the safety CPU presents overdose.
12(No Transcript)
13(No Transcript)
14Latent faults
If a system cannot routinely validate a safety
measure, then the safety measure cannot be relied
on.
15(No Transcript)
16Fail-safe state
17- Achieving safety
- Firewall pattern the most fundamental safety
design concept is the separation of safety
channels and non-safety channels. - Channel static path of data and control. Failure
of any component of the channel causes a failure
of the entire channel. - Isolation of all non-safety related software and
hardware components from those with safety
responsibility. - The separation of safety-critical components
simplifies the design and is more cheaper.
18Redundancy can be homogeneous or
diverse. Homogeneous uses exact replicas of
channels and protects only against random
failures. Diverse uses different means to
perform the same function and protects against
random and systematic. Example from data
storage Homogeneous stores the data 3 times
and compares them before use. Diverse stores a
second copy with 1s complement of the data.
19- Many mechanisms identify data corruption
- Parity simple 1-bit parity identifies 1-bit
error. - Hamming codes multiple parity bits to identify
n-bit error and repair (n-1)-bit errors. - Checksums sum data within a block.
- Cyclic redundancy check (CRC) more complex
function on a data within a block. - Homogeneous multiple storage store data in
multiple locations and compare prior to use. - Complement multiple storage store data in a 1s
complement, simple bit-to-bit inversion and
compare before use.
20- Many mechanisms identify data corruption
- Redundancy feedback error detection identifies
fault, doesnt attempt to correct it, may attempt
to redo the processing step that was in error,
may terminate the processing by moving to a safe
shutdown state. - Redundancy feed-forward error detection
identifies fault, tries to correct it and keep
processing. - Used when a system doesnt have a safe shutdown
state. A common implementation reconstruction
of correct data from partially corrupted values.
21- Safety Architectures
- Single Channel Protected Design (SCPD)
- Multi-Channel Voting Pattern
- Homogeneous Redundancy Pattern
- Diverse Redundancy Pattern
- Monitor-Actuator Pattern
- Watchdog Pattern
- Safety Executive Pattern
22- Pattern general solution for a common problem.
- Patterns consists of 3 aspects
- Problem context that the pattern addresses
- Solution the pattern itself
- Consequences implications of the use of the
pattern
23Single Channel Protected Design (SCPD)
Example a train-breaking system
24- Although any stop in the chain could lead to an
accident the channel can be made safe by applying
sufficient hazard control within the channel. - For example periodic life tick ensures bus
integrity. In case of failure, the brake engages
itself and the engine disengages itself
automatically. - The computer watchdog activates the brake if it
is not updated in the proper way and with the
proper interval.
25(No Transcript)
26Multi-Channel Voting Pattern
(homogeneous or diverse)
27- Homogeneous Redundancy Pattern
28(No Transcript)
29(No Transcript)
30- Diverse Redundancy Pattern
31(No Transcript)
32(No Transcript)
33 34(No Transcript)
35(No Transcript)
36 37(No Transcript)
38(No Transcript)
39 40(No Transcript)
41- 8 steps to build safety systems
- Identify the hazards
- Determine the risks
- Define the safety measurements
- Create safe requirements
- Create safe design
- Implement safety
- Assure the safety process
- Test
42Identify the hazards
43(No Transcript)
44(No Transcript)
45(No Transcript)
46(No Transcript)
47(No Transcript)
48(No Transcript)
49(No Transcript)
50(No Transcript)
51(No Transcript)
52- Disadvantages of FTA
- Requires that all components are identified and
characterized prior to analysis. - The tree becomes very complex in non-trivial
design. Components can typically fail in many
ways.
53Determine the risks
54(No Transcript)
55(No Transcript)
56(No Transcript)
57- Define the safety measurements
- A safety measure is a behavior added to a system
to handle a hazard. - There are many ways to handle hazard
- Obviation the hazard can be made physically
impossible - Education educating the users so they wont
create hazardous conditions through equipment
misuse - Alarming announce the hazard to a user, so he
can take appropriate action.
58- Interlocks the hazard can be removed by using
secondary devices or logic to intercede when a
hazard presents itself. - Internal checking ensure that the system can
detect its malfunctioning prior to accident - Safety equipment
- Restriction of access
- Labeling
59- Many consideration arise when controlling a
hazard - Tolerance time
- Risk level
- Supervision of the device constant,occasional,
unattended - Skill level of the user
- Environment
- Likelihood of the fault that causes a hazard
60Create safe requirements Specifying a safe system
often means specifying negations The system
mustnot energize the laser when the safety
interlock is active.
61- Create safe design
- Work from safe requirements
- Adopt a fundamentally safe architecture
- Periodically, during design, revisit the hazards
analysis to add hazards specific to to the design - Select programming-in-small measures that provide
the appropriate level of detection and
correction. - Ensure that independent channels lack common mode
failures. - Adopt a consistent set of strategies to handle
faults. - Build in power-on and periodic run-time tests to
identify latent faults.
62POST power-on self test BIT build-in test
(periodic continuous tests) POST must test all
the system features used to detect faults. BIT
tests those features that periodically detect
faults. BIT typically identifies only hard
failures and doesnot identify systematic or
software failures.
63- Implement safety
- Implementation decisions affect system safety
- Language choices
- -Compile-time checking
- -Run-time checking
- Exceptions vs. error codes
- Safe language subsets, for example avoiding
void.
64- C treats you like a consenting child. Pascal
treats you like a naughty child. Ada treats you
like a criminal. - Pascal totally insulates programmers from the
most common C programming errors, but run-time
system is larger and program runs slower. - The following language features improves the
system safety - Strong compile-time checking
- Strong run-time checking
- Support for encapsulation and abstraction
- Exception handling
65Strong typed languages perform compile time check
for violating grammar rules and also execute
run-time assertions when the software breaks
grammatical rules during run-time, these are
called program-invariant assertions. For
example it is unacceptable not to check for
array index validity or subrange value bounding
in run-time.
66- The fundamental rules for safe programming are
- Make it right before you make it fast
- Verify that it remains right during program
execution - Explicitly check pre-conditional invariants.
- Explicitly check post-conditional invariants.
67- Assure the safety process
- There are a lot of standards to ensure quality
process, in which the quality of a product is
stable and repeatable from product to product. - Key process activities to improve product safety
- Continuously track hazard analysis
- Use peer reviews to assure quality
- Verify design adherence.
- Verify coding standard adherence.
- Identify how each hazard is handled.
68Testing Fault seeding simulating all faults
that affect safety to verify that system acts in
the safe, correct manner when a fault
occurs. Unit testing white-box, must
understand the detailed inner working of the
system. Fault seeding during unit tests means
violating system invariants by passing
out-of-range values or a corrupted
data. Integration testing gray-box, after
each component has passed the unit testing,
integration tests add the component together.
These tests simulate missing components,
interfaces between the components.
69- Fault seeding during integration tests may be
done by - Breaking the communication bus
- Forcing component failure
- Sending corrupted messages
- Removing power when unexpected
- Flooding the bus with messages(valid or invalid)
- Validation testing black-box, testing of
end-user requirements, environmental conditions.