Title: Review last week
1Review last week
- Redundancy in hardware
- Passive (fault masking)
- TMR, NMR
- Active (fault detection)
- Stand-by systems (hot, warm, cold)
- Hybrid
- Self purging redundancy
- NMR with spare
- Redundancy in I/O
- RAID(0),1,2,3,4,5
- Redundancy in time (Re-execution)
- Processes, threads
- Superscalar, SMT, AR-stream
2Today
- Reliability of networks
- Software reliability
- Redundancy in software
3Reliability of Networks
- Based on graph theory nodes represent computers,
branches represent communication links - Simplest model assumes nodes do not fail but
links do - Path is a collection of branches that provide
communications between specific pair of nodes - In general we are interested in knowing
- RallP(all nodes are connected)
- RstP(nodes s and t are connected)
- RkP(k nodes are connected)
4Reliability of Networks
Simple state space enumeration
b
a
1
5
4
6
2
c
3
d
If all links are equal and pprob. of being
up qprob. being down
5An example
For this graph is easier to calculate all paths
that failed instead of listing the state space
- Given a pathset defined by reachability,
- RelAlice,Bob Prob(any path from Alice to Bob)
- 1-Prob(all paths failed)
- 1 (1 - .81)(1 - .81)
- .9639
6Primary Graph Reductions
- We can perform graph reductions to facilitate
calculation - Irrelevant do not contribute to any operational
state remove - Series sequence of edges are required
simultaneously combine with axiom of
probability - P(A?B) P(A)P(B)
- Parallel network is operational if any of these
edges are operational combine with axiom of
probability - P(A?B) P(A) P(B) P(A?B)
Sequential reduction
P(A?B) .81.81-(.81.81) .9639
Parallel reduction
7Reliability of Networks
b
a
- To improve network reliability we can increase p
or add more branches to the network - There are other more efficient methods to compute
network reliability - Factoring aggregate nodes to reduce size of
graph
1
5
4
6
2
c
3
d
8SW and Reliability
- If the software, hardware and human failures are
independent the reliability of a whole computer
system could be described as - Rsystem Rhw Rsw Roperator
- Often this is not the case
9SW and Reliability
- Is the nature of software probabilistic or
deterministic? - If it were possible to perform exhaustive testing
on all possible inputs, software could be
considered of a deterministic nature - Since exhaustive testing is impossible, the
probability gt 0, that a designer had not included
some combination of inputs during testing, makes
the nature of software probabilistic - There is still controversy about this issue
- Software reliability the probability of a
failure free operation over a given time
interval
10Software Engineering
- Software can be designed in many different ways
- constraints are few compared to designing
hardware - Area, power, technology, performance
- Systematic ways for software development have
been proposed - Software Engineering - A systematic approach to
the analysis, design, implementation and
maintenance of software
11SW Development cycle
- Software development is a lengthy, complex
process - Software development process consists of
- Requirements
- Specifications
- Design
- Coding
- Prototypes
- Testing
12SW Development cycle
- Models of software development
Waterfall
Spiral
13Software Engineering/Errors
- Usually in software engineering
- Software problemsoftware errorbug
- Software is normally developed by teams
- New bugs may be produced when integrating
- Coding is about 20 of total development effort,
testing may be 40 - Software errors can occur at
- Specification and requirements stage
- Design
- Program logic
- Software does not wear out as hardware does but
may become obsolete
14Regression testing
- Regression Testing
- is any type of software testing which seeks to
uncover regression bugs. - Regression bugs occur whenever software
functionality that previously worked as desired
stops working or no longer works in the same way
that was previously planned. - Typically regression bugs occur as an unintended
consequence of program changes. - Common methods of regression testing include
re-running previously run tests and checking
whether previously-fixed faults have reemerged. - Standard practice in SW development
- automated tools available (JUnit)
15Error Removal models
- t is the number of months of development time. At
t0 software contains Et errors, as testing
progresses Ec(t) errors are corrected - Er(t)Et-Ec(t)
With no new error generation
16Error Removal models
- Constant error removal rate ?0 errors/month
Er(t)Et-?0t
Et
Errors remaining Er(t)
Errors corrected Ec(t)
t
Constant error removal ?0 rate
17Error Removal models
- Linearly decreasing error-removal rate
dEr(t)/dt -(K1-K2t) as dEr(t)/dt -gt0 at tt0,
K2K1/t0 we get dEr(t)/dt -K1(1-t/t0)
integrating (with K1K) gives Er(t)C-Kt(1-
t/2t0) at t0 Er(t)EtC, therefore Er(t) Et
Kt(1-t/2t0)
Et
Errors remaining Er(t)
Errors corrected Ec(t)
t
Linearly decreasing rate
18Error Removal models
- Exponentially decreasing error-removal rate.
Predicts harder time in finding errors as program
is perfected
dEd(t)/dtadEr(t)/dt -gt assuming remaining
errors are proportional to Ed (errors detected)
with Ed(t)Ec(t) and
Er(t)Et-Ec(t) we get dEc(t)/dt aEt-Ec(t)
solving diff. equation gives Ec(t)Ae-at B with
initial conditions t0gtEc0 gives A-BBEtEc
when t-gt8 Ec Et(1- e-at substituting in
Er(t)Et-Ec(t) gives Er(t) Et e-at
Et
Errors remaining Er(t)
Errors corrected Ec(t)
t
Exponentially decreasing rate
19Software Reliability models
- A software reliability model should be used to
answer questions such as - When should we stop testing?
- Will the software work well and be considered
reliable? - These are issues that software management should
address
20Reliability models
- Constant error removal rate is the simplest
model, assuming failure rate is
z(t)kEr(t) and Er(t)Et-?0t (constant error
removal) z(t)k(Et-?0t) (hazard function or
failure rate)
For fixed values of debugging time t
t2gtt1most debugging
R(t)
t1gtt0medium debugging
t0least debugging
t1/?
t2/?
MTTF approaches infinity when errors (ß)-gt 0!!
The model does not reflect intuition. How to
obtain Et, k, ?0?
21Reliability models
- Exponential error removal rate (fixed values of
debugging time)
R(t)
8.5 months debugging
8 months debugging
.673
6 months debugging
Example start with Et130 errors at t0,
decreasing to 10 errors in t8 months Er(t8)10
errors130e-a8 gt a0.3206 if we require
R(t).673 at t300 hrs after t8 months
debugging we get R(300).673 gt k.000132
t300
This is a better model since agrees with intuition
22Reliability models
- Constants used (e.g. Et,k,a) in a software
reliability model can be estimated empirically by - Handbook (history of bugs in a company)
- Statistical estimations plotting error removal
rate versus t to see which model best fits data
23Reliability models
- Other models have been proposed
- Scheidewind
- Generalized exponential model
- Musa/Okumoto model
- Littlewood/Verral model
24Reliability models
- Reliability models are understood/used only in a
small sector in the software industry - Criticisms
- Some models assume the waterfall software
development model (test is only done in the final
stages) - Do not take into account the complexity of a
software system as a parameter in the reliability
function - Every application and test case is treated
uniformly
25TMR and software
- If a TMR system is implemented using the same
software in each component well have - RsysRTMR Rsoftware ? assuming independence of
Hw/Sw errors - System will be very dependent on software
reliability. We need independent versions of the
software to provide better reliability
26Software analogies to TMR
- N-version programming
- Idea same specification is implemented in a
number of different versions by different teams.
All versions compute simultaneously and the
majority output is selected using a voting
system. - Airbus commercial aircraft uses this technique.
- Recovery blocks
- A number of explicitly different versions of the
same specification are written and executed in
sequence - An acceptance test is used to select the output
to be transmitted
27N-version programming
Assuming version independence
28Output comparison
- As in hardware systems, ideally the output
comparator is a simple piece of software that
uses a voting mechanism to select the output. - Note in real-time systems, there may be a
requirement that the results from the different
versions are all produced within a certain time
frame.
29N-version programming
- Different system versions are designed and
implemented by different teams. - Assumes low probability for the case teams make
the same mistakes.
30Design diversity
- Some problems with design diversity
- Empirical evidence suggests that teams not
culturally diverse tend to tackle problems in the
same way. - Different teams make the same mistakes. Some
parts of an implementation are more difficult
than others so all teams tend to make mistakes in
the same place - Specification errors
- Errors will be reflected in all implementations
- This can be addressed to some extent by using
multiple specification representations.
31N-version programming
Considering common mode dependencies in
requirements and programming
Pi1-prob indep-mode-sw fault
Pcmr1-prob common-mode-req fault Effect of
identical misinterpretation of requirements
pcmr
pcms
Pcms1-prob common-mode-sw fault Effect of
identical or equivalent incorrect designs for
different portions of the problem
pi
32NMR HW/SW
- Space shuttle computer system
Software A
Hardware A
input
Software A
Hardware B
voter
output
primary
Software A
Hardware C
Software A
Hardware D
Hardware E
Software B
backup
33Fault Tolerant Techniques in Software
- Check points and roll backs
- Applications state saved at checkpoint. Roll
back restarts execution from a previous
checkpoint - Recovery Blocks
- Alternates - secondary modules that perform same
function of a primary module - are executed when
primary fails to pass an acceptance test
34Fault Tolerant Techniques in Software
- Check points and roll backs
- Reboot/restart (human) simplest but weakest
since information may be lost - Recovery (reboot initiated automatically by the
system) - Journaling (stores all transactions)
- Retry (immediately after error detection)
- Checkpoints (state is saved only at specific
points)
35Recovery blocks
36Recovery blocks
- Force different algorithms to be used in each
version - idea is to reduce probability of common errors
- Design of the acceptance test is difficult as it
must be independent of the computation used - This approach may not be applicable in real-time
systems - sequential execution of redundant versions