Title: Industrial Automation
1Industrial Automation Automation
IndustrielleIndustrielle Automation
Dependability - Evaluation
9.2
Estimation de la fiabilité
Verlässlichkeitsabschätzung
Prof. Dr. H. Kirrmann
ABB Research Center, Baden, Switzerland
2Dependability Evaluation
This part of the course applies to any system
that may fail.
- Dependability evaluation (fiabilité
prévisionnelle, Verlässlichkeitsabschätzung)
determines - the expected reliability,
- the requirements on component reliability,
- the repair and maintenance intervals and
- the amount of necessary redundancy.
- Dependability analysis is the base on which risks
are taken and contracts established - Dependability evaluation must be part of the
design process, it is quite useless once a system
has been put into service.
39.2.1 Reliability definitions
9.2.1 Reliability definitions 9.2.2 Reliability
of series and parallel systems 9.2.3 Considering
repair 9.2.4 Markov models 9.2.5 Availability
evaluation 9.2.6 Examples
4Reliability
Reliability probability that a mission is
executed successfully (definition of success?
a question of satisfaction) Reliability depends
on duration (tant va la cruche à leau.,
"der Krug geht zum Brunnen bis er bricht))
environment temperature, vibrations, radiations,
etc...
R(t)
1,0
lim R(t) 0
t
25º
laboratory
25º
40º
vehicle
85º
85º
time
1
2
3
4
5
6
Such graphics are obtained by observing a large
number of systems, or calculated for a system
knowing the expected behaviour of the elements.
5Reliability and failure rate - Experimental view
Experiment large quantity of light bulbs
100
t
remaining good bulbs
R(t)
time
?
aging
infancy
mature
time
t Dt
t
Reliability R(t) number of good bulbs remaining
at time t divided by initial number of bulbs
Failure rate l(t) number of bulbs that failed in
interval t, tDt, divided by number of remaining
bulbs
6Reliability R(t) definition
failure
good
bad
Reliability R(t) probability that a system does
not enter a terminal state until time t,
while it was initially in a good state at time
t0"
R(0) 1 lim R(t) 0
t
Failure rate l(t) probability that a (still
good) element fails during the next time unit dt.
R(t)
definition
1
t
0
MTTF mean time to fail surface below R(t)
definition
7Assumption of constant failure rate
?(t)
Reliability probability of not having
failed until time t expressed
bathtub
aging
childhood (burn-in)
by discrete expression
mature
R (tDt) R (t) - R (t) l(t)Dt
t
by continuous expression simplified when l
constant
R(t)
R (t) e -?t
1
R(t) e -0.001 t (? 0.001/h)
0.8
assumption of l constant is justified
by experience, simplifies computations
significantly
0.6
R(t) ? bathtub
0.4
0.2
MTTF mean time to fail surface below R(t)
0
t
?
1
MTTF e -?t dt
l
MTTF
0
8Examples of failure rates
- To avoid the negative exponentials, l values are
often given in FIT (Failures in Time), - 1 fit 10-9 /h
1
Element Rating failure rate resistor 0.25 W 0.1
fitcapacitor (dry) 100 nF 0.5
fitcapacitor (elect.) 100 ?F 10
fitprocessor 486 500 fitRAM 4MB 1
fitFlash 4MB 12 fitFPGA 5000 gates 80
fitPLC compact 6500 fitdigital I/O 32
points 2000 fitanalog I/O 8 points 1000
fitbattery per element 400 fitVLSI per
package 100 fitsoldering per point 0.01 fit
114'000
years
These figures can be obtained from catalogues
such as MIL Standard 217F or from the
manufacturers data sheets.
Warning Design failures outweigh hardware
failures for small series
9MIL HDBK 217 (1)
- MIL Handbook 217B lists failure rates of common
elements. - Failure rates depend strongly on the environment
temperature, vibration, humidity, and especially
the location - - Ground benign, fixed, mobile
- - Naval sheltered, unsheltered
- - Airborne, Inhabited, Uninhabited, cargo,
fighter - - Airborne, Rotary, Helicopter
- - Space, Flight
-
- Usually the application of MIL HDBK 217 results
in pessimistic results in terms of the overall
system reliability (computed reliability is lower
than actual reliability). - To obtain more realistic estimations it is
necessary to collect failure data based on the
actual application instead of using the generic
values from MIL HDBK 217.
10Failure rate catalogue MIL HDBK 217 (2)
- Stress is expressed by lambda factors
- Basic models
- discrete components (e.g. resistor, transistor
etc.)l lb pE pQ pA - integrated components (ICs, e.g. microprocessors
etc.)l pQ pL (C1 pT pV C2 pE) - MIL handbook gives curves/rules for different
element types to compute factors, - lb based on ambient temperature QA and
electrical stress S - pE based on environmental conditions
- pQ based on production quality and burn-in
period - pA based on component characteristics and usage
in application - C1 based on the complexity
- C2 based on the number of pins and the type of
packaging - pT based on chip temperature QJ and technology
- pV based on voltage stress
Example lb usually grows exponentially with
temperature QA (Arrhenius law)
11What can go wrong
poor soldering (manufacturing)
broken wire (vibrations)
tin whiskers (lead-free soldering)
chip cracking (thermal stress)
broken isolation (assembly)
12Failures that affect logic circuits
Thermal stress (different dilatation
coefficients, contact creeping) Electrical stress
(electromagnetic fields) Radiation stress
(high-energy particles, cosmic rays in the high
atmosphere) Errors that are transient in nature
(called soft-errors) can be latched in memory
and become firm errors. Solid errors will not
disappear at restart. E.g. FPGA with 3 M gates,
exposed to 9.3 108 neutrons/cm2 exhibited 320
FIT at sea level and 150000 FIT at 20 km
altitude (see http\\www.actel.com/products/resc
enter/ser/index.html) Things are getting worse
with smaller integrated circuit geometries !
13Exercise Wearout Failures
- The development of l(t) towards the end of the
lifetime of a component is usually described by a
Weibull distribution l(t) b lb tb1 with b gt
0. - a) Draw the functions for the parameters b 1,
2, 3 in a common coordinate system. - b) Compute the reliability function R(t) from
l(t). - c) Draw the reliability functions for the
parameters b 1, 2, 3 in a common coordinate
system. - d) Compare the wearout behavior with the behavior
assuming constant failure rates l(t) l.
14Cold, Warm and Hot redundancy
Hot redundancy the reserve element is fully
operational and under stress, it has the same
failure rate as the operating element. Warm
redundancy the reserve element can take over in
a short time, it is not operational and has a
smaller failure rate.
- Cold redundancy (cold standby) the reserve is
switched off and has zero failure rate
failure of primary element switchover
R(t)
reliability of redundant element
1
0
t
R(t)
1
reliability of reserve element
0
t
159.2.2 Reliability of series and parallel systems
(combinatorial)
9.2.1 Reliability definitions 9.2.2 Reliability
of series and parallel systems 9.2.3 Considering
repair 9.2.4 Markov models 9.2.5 Availability
evaluation 9.2.6 Examples
16Reliability of a system of unreliable elements
1
2
3
4
The reliability of a system consisting of n
elements, each of which is necessary for the
function of the system, whereby the elements fail
independently is
n R total R1 R2 .. Rn P (Ri) I1
Assuming a constant failure rate ? allows to
calculate easily the failure rate of a system by
summing the failure rates of the individual
components.
R NooN e -S?i t
This is the base for the calculation of the
failure rate of systems (MIL-STD-217F)
17Example series system, combinatorial solution
controller
inverter / power supply
?control 0.00005 h-1
?supply 0.001 h-1
motor
encoder
?motor 0.0001 h-1
power supply
motorencoder
controller
Rtot Rsupply Rmotor Rcontrol
e -?supply t e -?motor t e -?control t
e -(?supply ?motor ?control) t
?total ?supply ?motor ?control 0.00115 h-1
Warning This calculation does not apply any more
for redundant system !
18Exercise Reliability estimation
An electronic circuit consists of the following
elements 1 processor MTTF 600 years 48
pins 30 resistors MTTF 100000 years 2 pins 6
plastic capacitors MTTF 50000 years 2 pins 1
FPGA MTTF 300 years 24 pins 2 tantal
capacitors MTTF 10000 years 2 pins 1
quartz MTTF 20000 years 2 pins 1
connector MTTF 5000 years 16 pins the
reliability of one solder point (pin) is 200000
years What is the expected Mean Time To Fail of
this system ? Repair of this circuit takes 10
hours, replacing it by a spare takes 1 hour. What
is the availability in both cases ? The machine
where it is used costs 100 per hour, 24
hours/24 production, 30 years installation
lifetime. What should the price of the spare be ?
19Exercise MTTF calculation
An embedded controller consists of- one
microprocessor 486 - 2 x 4 MB RAM - 1 x Flash
EPROM - 50 dry capacitors - 5 electrolytic
capacitors - 200 resistors - 1000 soldering
points - 1 battery for the real-time-clock what
is the MTTF of the controller and what is its
weakest point ? (use the numbers of a previous
slide)
20Redundant, parallel system 1-out-of-2 with no
repair - combinatorial solution
simple redundant system the system is good if
any (or both) are good
R1
ok
ok
R2
ok
ok
R2
R1
R1 good R2 down
R1 down R2 good
R1 good R2 good
R1oo2 R1R2 R1 (1-R2) (1-R1) R2
1-R1
R1
R1oo2 1 - (1-R2)(1-R1)
R2
with R1 R2 R R1oo2 2 R - R2
1-R2
with R e -?t R1oo2 2 e -?t - e -2?t
21R(t) for 1oo2 redundancy
? 1
1.000
0.800
1oo2
R
0.600
0.400
0.200
1oo1
0.000
t MTTF
0
1
2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
MTTF
22Combinatorial R1oo2, no repair
Example R1oo2 airplane with two motors
MTTF of one motor 1000 hours (this value is
rather pessimistic) Flight duration, t 2 hours
- what is the probability that any motor fails ?
- what is the probability that both motors did
not fail until time t (landing)?
single motor doesn't fail 0.998 (0.2 chance it
fails)
apply R1oo1 e -?t
R2oo2 e -2?t
no motor failure 0.996 (0.4 chance it fails)
R1oo2 2 e -?t - e -2?t
both motors fail 0.0004 chance
assuming there is no common mode of failure (bad
fuel or oil, hail, birds,)
23MIF, ARL, reliability of redundant structures
Acceptable Reliability Level
ARL
1,0
with redundancy
ARL
R2
R1
simplex
time
MT1
MT2
Mission Time Improvement Factor (for given
ARL)MIF MT2/MT1
MIF
Reliability Improvement Factor (at given Mission
Time)RIF (1-Rwith) / (1-Rwithout) quotient
of unreliability
RIF
24R1oo2 Reliability Improvement Factor
10 hours
Reliability improvement factor (RIF) (1-Rwith)
/ (1-Rwithout)
? 0.001
1
0.8
RIF for 10 hours mission R1oo1 0.990 R1oo2
0.999901 RIF 100
1oo2
0.6
0.4
1oo1
0.2
but
8
0
3
(2 e -?t - e -2?t) dt
MTTF1oo2
2l
0
no spectacular increase in MTTF !
? 1oo2 without repair is only suited when mission
time ltlt 1/?
25Combinatorial 2 out of three system
E.g. three computers, majority voting
work
fail
R3
ok
ok
ok
ok
R2
ok
ok
ok
ok
ok
ok
ok
ok
R1
R2
R3
R1
R1 good R2 good R3 bad
R1 good R2 good R3 good
R1 bad R2 good R3 good
2/3
R1 good R2 bad R3 good
R2oo3 R1R2R3 (1-R1)R2R3 R1(1-R2)R3
R1R2(1- R3)
with identical elements R1R2R3 R
R2oo3 3R2-2R3
with R e -?t R2oo3 3 e -2?t - 2 e -3?t
262 out of 3 without repair - combinatorial solution
R3
R2
R1
R2oo3 3R2 - 2R3 3e -2?t - 2e -3?t
8
5
MTTF2oo3
(3e -2?t - 2 e -3?t) dt
2/3
6l
0
1
RIF lt 1 when t gt 0.7 MTTF !
0.8
1oo1
2003 without repair is not interesting for long
mission
0.6
1oo2
0.4
0.2
2oo3
0
27General case k out of N Redundancy (1)
- K-out-of-N computer (KooN)
- N units perform the function in parallel
- K fault-free units are necessary to achieve a
correct result - N K units are reserve units, but can also
participate in the function - E.g.
- aircraft with 8 engines 6 are needed to
accomplish the mission. - voting in computers If the output is obtained
by voting among all N units - N 2K 1 worst-case assumption all faulty
units fail in same way -
28What is better ?
4 motors, three of which are sufficient to
accomplish the mission (fly 21 days, MTTF
10'000 h per motor)
12 motors, 8 of which are sufficient to
accomplish the mission (fly 21 days, MTTF 5'000
h per motor)
29General case k out of N redundancy (2)
R4
R3
Example with N 4
R2
R1
one of N fail
two of N fail
K of N fail
all fail
no fail
N 1
N 2
N K
RKooN RN ( ) (1-R) RN-1 ( )
(1-R)2RN-2 ... ( ) (1-R)KRN-K .... (1-R)N
1
N of N
N (N-1) of N
N (N-1) (N-2) of N
K
RKooN S
N i
( ) (1 R)i RN-i
i 0
30Comparison chart
1.000
1oo4
0.800
1oo1
R
0.600
2oo4
1oo2
0.400
3oo4
0.200
2oo3
1oo1
8oo12
0.000
0
1
2
0.2
0.4
0.6
0.8
1.2
1.4
1.6
1.8
t
31What does cross redundancy brings ?
Reliability chain
controller
network
separate double fault brings system down
controller
network
controller
network
cross-coupling better in principle sincesome
double faults can be outlived
controller
network
controller
network
but cross-coupling needs a switchover logic
availability sinks again.
UL
controller
network
32Summary
Assumes all units have identical failure rates
and comparison/voting hardware does not fail.
1oo2 (duplication and error detection)
2oo3 (triplication and voting)
1oo1 (non redundant)
R
R
R
R
R
R
R2oo3 3R2 2R3
R1oo1 R
R1oo2 2R R2
kooN (k out of N must work)
K
RKooN S
N i
( ) Ri (1 R)N-i
i 0
33Exercise 2oo3 considering voter unreliability
- Compute the MTTF of the following 2-out-of-3
system with the component failure rates - redundant units l1 0.1 h-1
- voter unit l2 0.001 h-1
input
R1
R1
R1
2/3
R2
output
34Complex systems
R2
R3
R7
R8
R5
R6
R9
R1
R2
R3
R7
R8
R7
R8
Reliability is dominated by the non-redundant
parts, in a first approximation, forget the
redundant parts.
35Exercise Reliability of Fault-Tolerant Structures
- Assume that all units in the sequel have a
constant failure rate l. - Compute the reliability functions (and MTTF) for
the following structures - a) non-redundant
- b) 1/2 system
- c) 2/3 system
- assuming perfect (lp 0) voters, error
detection, reconfiguration circuits etc. - d) Draw all functions in a common coordinate
system. - e) For a railway signalling system, which
structure is preferable? - f) Is the answer different for a space
application with a given mission time? Why?
369.2.3 Considering repair
9.2.1 Reliability definitions 9.2.2 Reliability
of series and parallel systems 9.2.3 Considering
repair 9.2.4 Markov Processes 9.2.5
Availability evaluation 9.2.6 Examples
37Repair
Fault-tolerance does not improve reliability
under all circumstances. It is a solution for
short mission duration
Solution repair (preventive maintenance,
off-line repair, on-line repair)
Example short Mission time, high MTTF pilot,
co-pilot
long Mission time, low MTTF how to reach the
stars ? (hibernation, reproduction in space)
Problem exchange of faulty parts during
operation (safety !)
reintegration of new parts, teaching and
synchronization
38Preventive maintenance
R(t)
1
MTBPM
Mean Time between preventive maintenance
Preventive maintenance reduces the probability of
failure, but does not prevent it. in systems
with wear, preventive maintenance prevents aging
(e.g. replace oil, filters) Preventive
maintenance is a regenerative process (maintained
parts as good as new)
39Considering Repair
beyond combinatorial reliability, more suitable
tools are required. the basic tool is the Markov
Chain (or Markov Process)
409.2.4 Markov models
9.2.1 Reliability definitions 9.2.2 Reliability
of series and parallel systems 9.2.3 Considering
repair 9.2.4 Markov models 9.2.5 Availability
evaluation 9.2.6 Examples
41Markov
Describe system through states, with transitions
depending on fault-relevant events
States must be mutually exclusive
collectively exhaustive
? pi(t) 1
Let pi (t) Probability of being in state Si at
time t -gt
all states
The probability of leaving that state depends
only on current state (is independent of how much
time was spent in state or how state was reached)
Example protection failure
protection not working
normal
l
s
OK
PD
DG
µ
lightning strikes
s
repair
danger
lightning strikes (not dangerous)
what is the probability that protection is down
when lightning strikes ?
42Continuous Markov Chains
State 1
State 2
l
P1
P2
µ
Time is considered continuous. Instead of
transition probabilities, the temporal behavior
is given by transition rates (i.e. transition
probabilities per infinitesimal time step). A
system will remain in the same state unless going
to a different state. Relationship between state
probabilities are modeled by differential
equations,e.g. dP1/dt µ P2 l P1, dP2/dt
l P1 µ P2
for any state
inflow
outflow
dpi(t) ? ?k pk(t) - ? ?i pi(t)
dt
43Markov - hydraulic analogy
?12
P4
?42
P2
P1
P3
?32
µ
Output flow probability of being in a state P ?
output rate of state
from other states
?12
p1(t)
?i
µ
State S1
pump
p2(t)
µ p2(t)
State S2
Simplification output rate ?j constant (not a
critical simplification)
44Reliability expressed as state transition
one element
good
fail
?(t)
dp0 - ? p0
dt
P0
P1
R(t) p0(t) e -?t R(t0) 1
dp1 ? p0
dt
arbitrary transitions
good
fail
down
fail1
R(t) 1 - (pfail1 pfail2 )
all
all
up1
ok
up2
fail2
terminal states
non-terminal states
45Reliability and Availability expressed in Markov
Reliability
Availability
failure rate ?
?(t)
good
bad
up
down
failure rate
repair rate µ
state
state
MTTF
good
up
up
down
bad
up
time
time
repair
MDT
definition "probability that an item will
perform its required function in the specified
manner and under specified or assumed conditions
over a given time period"
definition "probability that an item will
perform its required function in the specified
manner and under specified or assumed conditions
at a given time "
46reliable systems have absorbing states, they may
include repair, but eventually, they will fail
47Redundancy calculation with Markov 1 out of 2
(no repair)
?
good
fail
Markov
2?
P0
P1
P2
? constant
What is the probability that system be in state
S0 or S1 until time t ?
initial conditions
Linear Differential Equation
p0 (0) 1 (initially good)
dp0 - 2? p0
dt
p1 (0) 0
dp1 2? p0 - ?p1
dt
p2 (0) 0
dp2 ?p1
dt
48Reliable 1-out-of-2 with on-line repair (1oo2)
S1 on-line unit failed
dp0 - 2? p0 ? p1 ? p2
back-up also fails
dt
good
P1
?n
?b
dp1 ? p0 - (??) p1
µn
dt
P0
P3
fail
dp2 ? p0 - (??) p2
dt
?b
P2
?n
µb
dp3 ? p1 ? p2
dt
on-line unit fails
S2 back-up unit failed
is equivalent to
dp0 - 2? p0 ? p1 ? p2
dt
2?
?
dp12 2? p0 - (??) p12
P0
P12
P3
fail
dt
?
dp3 ? (p1p2)
dt
?n ?b
with mn mb
it is easier to model with a repair team for each
failed unit (no serialization of repair)
49Reliable 1-out-of-2 with on-line repair (1oo2)
What is the probability that a system fails while
one failed element awaits repair ?
failure rate
absorbing state
Markov
2?
?
P0
P1
P2
?
repair rate
initial conditions p0 (0) 1 (initially good)
dp0 - 2? p0 ? p1
Linear Differential Equations
dt
dp1 2? p0 - (??) p1
p1 (0) 0
dt
dp2 ? p1
p2 (0) 0
dt
Ultimately , the absorbing states will be
filled, the non-absorbing will be empty.
50Results reliability R(t) of 1oo2 with repair
rate µ
with
-(3?µ-W) t
-(3?µW) t
(3?µ)W
(3?µ)-W
e
e
-
R(t) P0 P1
2W
2W
W ?2 6?µ µ2
l 0.01
we do not consider short mission time
1
m 10 h-1
repair does not interrupt mission
0.8
m 1.0 h-1
0.6
1oo2 no repair
0.4
m 0.1 h-1
0.2
0
Time in hours
R(t) accurate, but not very helpful - MTTF is a
better index for long mission time
51Mean Time To Fail (MTTF)
absorbing states j
non-absorbing states i
P1
P3
P0
P2
P4
R(t)
non-absorbing states i
1.0000
0.8000
?
0.6000
?pi(t) dt
MTTF
0.4000
0
0.2000
0.0000
0
2
4
6
8
10
12
14
time
52MTTF calculation in Laplace (example 1oo2)
sP0 (s) - p0(t0) - 2? P0 (s)
?P1(s)
Laplace transform initial conditions p0 (t0)
1 (initially good)
sP1(s) - 0 2? P0(s) - (??) P1(s)
sP2(s) - 0 ?
P1(s)
?
apply boundary theorem
lim
p(t) dt lim s P(s)
s ? 0
t ? ?
0
only include non-absorbing states (number of
equations number of non-absorbing states)
-1 - 2 ? P0 ?P1
0 2? P0 - (??)P1
1
(? ?)
?/? 3
MTTF P0 P1
solution of linear equation system
2?2
?
2?
53General equation for calculating MTTF
1) Set up differential equations
2) Identify terminal states (absorbing)
3) Set up Laplace transform for the non-absorbing
states
1 0 0 ..
M Pna
the degree of the equation is equal to the
number of non-absorbing states
4) Solve the linear equation system
5) The MTTF of the system is equal to the sum of
the non-absorbing state integrals.
6) To compute the probability of not entering a
certain state, assign a dummy (very low) repair
rate to all other absorbing states and
recalculate the matrix
54Example 1oo2 control computer in standy
input
on-line
stand-by
?w
?s
idle
E D
E D
repair rate µ same for both
error detection(also of idle parts) coverage c
output
55Correct diagram for 1oo2
Consider that the failure rate ? of a device in a
1oo2 system is divided into two failure rates 1)
a benign failure, immediately discovered with
probability c - if device is on-line, switchover
to the stand-by device is successful and repair
called - if device is on stand-by, repair is
called 2) a malicious failure, which is not
discovered, with probability (1-c) - if device
is on-line, switchover to the standby device
fails, the system fails - if device is on
stand-by, switchover will be unsuccessful should
the online device fail
1 on-line fails, fault detected (successful
switchover and repair) or standby fails,
fault detected, successful repair 2 standby
fails, fault not detected 3 both fail, system
down
?w (1-c)
(?w?s) c
P3
P0
P1
?s
?
?w
P2
?s (1-c)
(absorbing state)
1 - 2? P0 ?P1
(2c) ?/? (2-c)
0 2?c P0 - (??)P1
MTTF
2 ( ? ? (1-c) )
0 ?(1-c) P0 - ?P2
56Approximation found in the literature
This simplified diagram considers that the
undetected failure of the spare
causes immediately a system failure
simplified when ?w ?s ?
2? (1-c)
absorbing state
-1 - 2? P0 ?P1
0 2?c P0 - (??)P1
?
2?c
P0
P1
P3
0 2?(1-c) P0 ?P1
?
P2
(12c) ?/?
applying Markov
MTTF
2 ( ? ? (1-c) )
The results are nearly the same as with the
previous four-state model, showing that the
state 2 has a very short duration
57Influence of coverage (2)
MTTF (c)
Example ? 10-5 h-1 (MTTF 11.4 year), µ
1 hour-1 MTTF with perfect coverage 570468
years
600000
500000
When coverage falls below 60, the redundant
(1oo2) system performs no better than a simplex
one !
400000
300000
200000
Therefore, coverage is a critical success factor
for redundant systems ! In particular,
redundancy is useless if failure of the spare
remains undetected (lurking error).
100000
0
coverage
1.000000
0.999999
0.999990
0.999900
0.999000
0.990000
0.900000
0.900000
0.600000
0.000000
1
?
3
1
lim MTTF
)
(
lim MTTF
2?
2
(1-c)
?
?/? ?0
?
? ?0
58Application 1oo2 for drive-by-wire
x
coverage is assumed to be the probability that
self-check detects an error in the
controller. when self-check detects an error,
it passivates the controller (output is
disconnected) and the other controller takes
control. one assumes that an accident occurs
if both controllers act differently, i.e. if
a computer does not fail to silent
behaviour. Self-check is not instantaneous, and
there is a probability that the self-check logic
is not operational, and fails in
underfunction (overfunction is an availability
issue)
control
self- check
control
self- check
a1
a2
59Results 1oo2c, applied to drive-by-wire
? reliability of one chain (sensor to brake)
10-5 h-1 (MTTF 10 years) c coverage
variable (expressed as uncoverage 3nines 99.9
detected) µ repair rate parameter - 1
Second reboot and restart - 6 Minutes go to
side and stop - 30 Minutes go to next garage
log (MTTF)
16.00
1 second
14.00
6 minutes
12.00
10.00
1 Mio years
30 minutes
or once per year on a million vehicles
8.00
6.00
0.1 undetected
4.00
2.00
conclusion the repair interval does not matter
when coverage is poor
0.00
1
2
3
4
5
6
7
8
9
10
uncoverage
poor
excellent
60Protection system (general)
In protection systems, the dangerous situation
occurs when the plant is threatened (e.g. short
circuit) and the protection device is unable to
respond. The threat is a stochastic event,
therefore it can be treated as a failure event.
protection failure
l
protection down
normal
OK
PD
(detection and repair)
µ
s
s
threat to plant
threat to plant (not dangerous)
DG
danger
The repair rate µ includes the detection time t
! This impacts directly the maintenance rate.
What is an acceptable repair interval ?
Note another way to express the reliability of a
protection system will be shown under
availability
61Protection system how to compute test intervals
l1 overfunction of protection
Plant down
Single fault
Plant down
l2 lurking overfunction
P2
Double fault
repaired
l3 lurking underfunction
protection
P1
failed by
lurking overfunction
s plant suffers attack
plant
µ
s
immediate
(unwanted trip at next attack)
threat
overfunction
detected
µ
P3
l2
l1
error
t test rate (e.g. 1/6 months) m repair rate
(e.g. 1/8 hours)
t
test rate
P0
P5
Normal
repaired
µ
t
l3
test rate
P4
µ
lurking
repaired
underfunction
unavailable
plant threat
states
s
s2 (unlikely)
protection failed
P6
by underfunction
(fail-to-trip)
Danger
since there exist back-up protection systems,
utilities are more concerned by non-productive
states
629.2.5 Availability evaluation
9.2.1 Reliability definitions 9.2.2 Reliability
of series and parallel systems 9.2.3 Considering
repair 9.2.4 Markov models 9.2.5 Availability
evaluation 9.2.6 Examples
63Availability
?
up
down
?
down
up
up
up
up
up
up
down
Availability expresses how often a piece of
repairable equipment is functioning it depends on
failure rate ? and repair rate µ.
Punctual availability probability that the
system working at time t (not relevant for most
processes). Stationary availability duty
cycle (impacts financial results)
? up times
A? availability lim
? (up times down times)
t??
Unavailability is the complement of availability
(U 1,0 A) as convenient expression. (e.g. 5
minutes downtime per year availability is
0.999)
64Assumption behind the model renewable system
R(t) A(t) due to repair or preventive
maintenance (exchange parts that did not yet
fail)
after repair, as new
A(t)
1
0
t
over the lifetime
Stationary availability A
65Examples of availability requirements
substation automation telecom power supply
gt 99,95 5 10-7
4 hours per year 15 seconds per year
66Availability expressed in Markov states
down
up
up states i
down states j (non-absorbing)
P1
P3
P0
P2
P4
?pi(t ?)
?pj (t oo)
Availability
Unavailability
67Availability of repairable system
Markov states
down state (but not absorbing)
?
P0
P1
?
dp0 - ? p0 ?p1
lim t? 8
stationary state dp0 dp1 0
due to linear dependency add condition p0
p1 1
dt
dt
dt
dp1 ? p0 - ? p1
dt
1
1
unavailability U (1 - A)
A
1 µ/?
?
1
?
e.g. MTBF 100 Y -gt ? 1 / (100 8765)
h-1 -gt A 99.991 MTTR 72 h -gt ? 1/ 72
h-1 -gt U 43 mn / year
68Example Availability of 1oo2 (1 out-of-2)
Markov states
2?
?
down state (but not absorbing)
P0
P1
P2
2?
?
assumption devices can be repaired independently
(little impact when ? ltlt µ)
dp0 - 2? p0 ?p1
lim t? 8
stationary state dp0 dp1 dp2
0 due to linear dependency add condition p0
p1 p2 1
dt
dt
dt
dt
dp1 2? p0 - (??) p1 2? p2
dt
dp2 ?p1 - 2? p2
dt
1
2
unavailability U (1 - A)
A
lim Ultlt1
(?/?)2 2(µ/?)
2?2
1
?2 2?µ
e.g. MTBF 100 Y -gt ? 1 / (100 8765)
h-1 -gt A 99.9999993 MTTR 72 h -gt ? 1/ 72
h-1 -gt U 0.2 s / year
69Availability calculation
1) Set up differential equations for all states
2) Identify up and down states (no absorbing
states allowed !)
3) Remove one state equation save one (arbitrary,
for numerical reasons take unlikely state)
4) Add as first equation the precondition 1 ?
p (all states)
1 0 0 ..
M Pall
5) The degree of the equation is equal to the
number of states
6) Solve the linear equation system, yielding the
of time each state is visited
7) The unavailability is equal to the sum of the
down states
We do not use Laplace for calculating the
availability !
701oo2 including coverage
2?(1-c)
Markov states
2?c
?
down state (but not absorbing)
P0
P1
P2
2?
?
assumption devices can be repaired independently
(little impact when ? ltlt µ)
dp0 - 2? p0 ?p1
lim t? 8
stationary state dp0 dp1 dp2
0 due to linear dependency add condition p0
p1 p2 1
dt
dt
dt
dt
dp1 2?c p0 - (??) p1 2? p2
dt
dp2 2?(1-c) p0 ?p1 - 2? p2
dt
1
2
lim ?/? gtgt 1
unavailability U (1 - A)
A
(?/?)2 2(?/?)
2?2
1
?2 2?µ
71Exercise
- A repairable system has a constant failure rate l
10-4 / h. - Its mean time to repair (MTTR) is one hour.
- a) Compute the mean time to failure (MTTF).
- b) Compute the MTBF and compare with the MTTF.
- c) Compute the stationary availability.
- Assume that the unavailability has to be halved.
How can this be achieved - d) by only changing the repair time?
- e) by only changing the failure rate?
- f) Make a drawing that shows how a varying repair
time influences availability.
729.2.6 Examples
9.2.1 Reliability definitions 9.2.2 Reliability
of series and parallel systems 9.2.3 Considering
repair 9.2.4 Markov models 9.2.5 Availability
evaluation with Markov 9.2.6 Examples
73Exercise Markov diagram
1
?1
?b
µ1
0
3
µ2
2
?n
?b
?n
4
Is this a reliable or an available system ? Set
up the differential equations for this Markov
model. Compute the probability of not reaching
state 4 (set up equations)
74Case study Swiss Locomotive 460 control system
availability
normal
reserve
member N
member R
member N
member R
member N
member R
MVB
I/O system
Assumption each unit has a back-up unit which is
switched on when the on-line unit fails
The error detection coverage c of each unit is
imperfect
The switchover is not always bumpless - when the
back-up unit is not correctly actualized, the
main switch trips and the locomotive is stuck on
the track
What is the probability of the locomotive to be
stuck on track ?
75Markov model SBB Locomotive 460 availability
bumpless takeover
b
member N
failure
train stop
detected
member R
and
all OK
on-line
r
reboot
(1-s-b)
l
s
member R fails
l
P0
µ
takeover
c
unsuccessful
l
member R
stuck on track
failure
member N fails
l
detected
µ
(1-c)
l
l
member N fails
member R
p
fails
undetected
µ
l
probability that member N or member R fails
? 10-4 (MTTF is 10000 hours or 1,2 years)?
0.1 (repair takes 10 hours, including travel to
the works)c 0.9 (probability is 9 out of 10
errors are detected)? 0.9 (probability is that
9 out of 10 take-over is successful)?
0.01 (probability is 1 failure in 100 cannot be
recovered)? 10 (mean time to reboot and
restart train is 6 minutes)? 1/8765 (mean time
to periodic maintenance is one year).
m
mean time to repair for member N or member P
c
probability of detected failure (coverage factor)
b
probability of bumpless recovery (train continues)
s
probability of unsuccessful recovery (train stuck)
r
time to reboot and restart train
p
periodic maintenance check
76SBB Locomotive 460 results
.
How the down-time is shared
unsuccessful recovery
7
Stuck
2nd failure before
32
maintenance
61
OK after
reboot
Under these conditions unavailability will be
0.5 hours a year. stuck on track is once every
20 years. recovery will be successful 97 of the
time.
Stuck
2nd failure before repair
Stuck after reboot
0.0009
0.00045
recommendation increase coverage by using
alternatively members N and R (at least every
start-up)
77Example protection device
Protection device
current sensor
circuit breaker
78Probability to Fail on Demand for safety
(protection) system
IEC 61508 characterizes a protection device by
its Probability to Fail on Demand (PFD)
PFD (1 - availability of the non-faulty
system) (State 0)
underfunction
good
u probability of underfunction
u?
P0
P1
P4
(1-u)?
?R
?R
plant damaged
overfunction
P3
plant down
79Protection system with error detection
(self-test) 1oo1
? protection failure
danger
overfunction
u probability of underfunction IEC 61508 50
?(1-u)
?R
C coverage, probability of failure detection by
self-check
P1
P1 protection failed in underfunction, failure
detected by self-check (instantaneous), repaired
with rate µR 1/MRT
uc?
P0
P4
P3
P2 protection failed in underfunction, failure
detected by periodic check with rate µT
2/TestPeriod
u(1-c)?
P2
P3 protection failed in overfunction, plant down
?T
P4 system threatened, protection inactive, danger
normal
1
(1-c)
c
PFD 1 - P0 1 -
? u (
)
1 ? u (1-c) ? u c
µR
µT
µR
µT
with
? 10-7 h-1
MTTR 8 hours -gt µR 0.125 h-1
PFD 1.1 10-5
Test Period 3 months -gt µT 2/4380
for S1 and S2 to have same probability c 99.8
!
coverage 90
80Example Protection System
tripping algorithm 1
overfunctions reduced
trip signal
2
inputs
Pover Po
underfunctions increased
2
tripping algorithm 2
Punder 2Pu - Pu
tripping algorithm 1
trip signal
dynamic modeling necessary
inputs
comparison
repair
tripping algorithm 2
81Markov Model for a protection system
l2(1-
c
)
latent overfunction
latent underfunction
(l1l2)(1-
c
)
1 chain, n. detectable
2 chains, n. detectable
(l1l2)
c
l3
(l1l2)
c
l3
s1l1(1-
c
)
(l1l2l3)
c
detectable error
l1(1-
c
)
OK
overfunction
1 chain, repair
m
s2
l1l2l3
c
l3(1-
c
)
latent underfunction
s2
underfunction
not detectable
l10.01, l2l30.025, s15, s21, m365,
c
0.9 1/
Y
82Analysis Results
mean time to
underfunction Y
400
permanent comparison (SW)
weekly test
assumption SW error-free
300
permanent comparison (red. HW)
200
2-yearly test
mean time to
overfunction Y
5000
500
50
83Example CIGRE model of protection device with
self-check
PLANT DOWN
PLANT DOWN
DOUBLE FAULT
SINGLE FAULT
S6
S 2
µ
self-check
µ
s1
overfunction
s1
(1-c)
l1
µ
P1
l1
l2
S10
S8
µ
S 4
dM
c
dT
l2
(1-c)
le2
l2
S1
l1
c
S 5
(1-c)
le1
l3
c
dT
l3
dM
S9
S11
S3
s2
l3
self-check
s2
dM
underfunction
s2
µ
DANGER
S7
s2
P10, P11 failure
P4, P3 failure
P8, P9 error
detectable by
detectable by
detection failed
self-check
inspection
84Summary difference reliability - availability
Reliability
Availability
fail
fail
down
down
down
fail
all
all
all
all
up
up
ok
ok
up
up
fail
fail
up
good
look for Mean Time To Fail (integral over time
of all non-absorbing states) set up linear
equation with s 0, initial conditions S(T 0)
1.0 solve linear equation
look for stationary availability A (t 8) (duty
cycle in UP states) set up differential equation
(no absorbing states!) initial condition is
irrelevant solve stationary case with ?p 1
85Exercise set up the Markov model for this system
A brake can fail open or fail close. A car is
unable to brake if both brakes fail open. A car
is unable to cruise if any of the brakes fail
close. A fail open brake is detected at the next
service (rate ?). There is an hydaulic and an
electric brake.
ce 0.9 ( 99 fail close)
? e 10 -5 h-1
electric brake
hydraulic brake
? service every month
? h 10 -6 h-1
ch .99 fail close (.01 fail open)
86(No Transcript)