Title: SENG 521 Software Reliability
1SENG 521Software Reliability Testing
- Overview of Software Reliability Engineering
Department of Electrical Computer Engineering,
University of Calgary B.H. Far (far_at_enel.ucalgary.
ca) http//www.enel.ucalgary.ca/far/Lectures/SENG
521/01/
2Contents
- About this course.
- What is software reliability?
- What factors affect software quality?
- What is software reliability engineering?
- Software reliability engineering process.
3Section 1
- Basic Concepts
- Definitions
4Realities
- Software development is a very high risk task.
- About 20 of the software projects are canceled.
(missed schedules, etc.) - About 84 of software projects are incomplete
when released (need patch, etc). - Almost all of the software projects costs exceed
initial estimations. - (cost overrun)
5Software Engineering /1
- Business software has a large number of parts
that have many interactions (i.e., complexity). - Software engineering paradigms provide models and
techniques that make it easier to handle
complexity. - A number of contemporary software engineering.
paradigms have been proposed - Object-orientation
- Component-ware
- Design patterns
- Software architectures
- etc.
6Software Engineering /2
- Evolution of software engineering paradigms
- Assembly languages
- Procedural and structured programming
- Object Oriented programming
- Component-ware
- Design patterns
- Software architectures
-
- Software Agents
time
7What Affects Software?
- Timeliness
- Meeting the project deadline.
- Reaching the market at the right time.
- Cost
- Meeting the anticipated project costs.
- Reliability
- Working fine for the designated period on the
designated system.
8Definition Failure Availability
- Failure Any departure of system behavior in
execution from user needs. - Failure intensity the number of failures per
natural or time unit. Failure intensity is way of
expressing reliability. - Availability The probability at any given time
that a system or a capability of a system
functions satisfactorily in a specified
environment. - If you are given an average down time per
failure, availability implies a certain
reliability.
9Definition Verification Validation
- Verification
- For each development phase or for each module are
the outputs and inputs generated correctly? And
do they match correctly? - Validation
- Does the software meet its requirements?
10Definition Reliability
- Reliability is the probability that a system or a
capability of a system functions without failure
for a specified time or number of natural
units in a specified environment. (Musa, et al.) - A recent survey of software consumers revealed
that reliability was the most important quality
attribute of the application software. - This course is concerned with the engineering of
reliable software products.
11About This Course
- The topics discussed include
- Concepts and relationships
- analytical models and supporting tools
- techniques for software reliability improvement,
including - fault avoidance, fault elimination, fault
tolerance - error detection and repair,
- failure detection and retraction
- risk management.
12Section 2
13Reliability Natural System
- Natural system life cycle.
- Aging effect Life span of a natural system is
limited by the maximum reproduction rate of the
cells.
14Reliability Hardware
- Hardware life cycle.
- Useful life span of a hardware system is limited
by the age (wear out) of the system.
15Reliability Software
- Software life cycle.
- Software systems are changed (updated) many times
during their life cycle. - Each update adds to the structural deterioration
of the software system.
16Software vs. Hardware
- Software reliability doesnt decrease with time.
- Hardware faults are mostly physical faults.
- Software faults are mostly design faults which
are harder to measure, model, detect and correct.
17Reliability Science
- Exploring ways of implementing reliability in
software products. - Reliability Sciences goals
- Developing models and techniques to build
reliable software. - Testing such models and techniques for adequacy,
soundness and completeness.
18Section 3
19What is Engineering?
- Engineering
- Analysis
- Design
- Construction
- Verification
- Management
- What is the problem to be solved?
- What characters of the entity are used to solve
the problem? - How will the entity be realized?
- How it is constructed?
- What approach is used to uncover errors in design
and construction? - How will the entity be supported in the long term?
20Reliability Engineering /1
- Engineering of reliability in software
products. - Reliability Engineerings goal
- developing software to reach the market
- With minimum development time
- With minimum development cost
- With maximum reliability
Software Quality
21Reliability Engineering /2
Software quality means getting the right balance
among development cost, development time and
reliability.
- Pick quantitative representations for the 3
factors (cost, time and reliability) and measure
them!
22What is SRE? /1
- Software Reliability Engineering (SRE) is a
multi-faceted discipline covering the software
product lifecycle. - It involves both technical and management
activities in three basic areas - Software Development and Maintenance
- Measurement and Analysis of Reliability Data,
- Feedback of Reliability Information into the
software lifecycle activities.
23What is SRE ? /2
- SRE is a practice for quantitatively planning and
guiding software development and test, with
emphasis on reliability and availability. - SRE simultaneously does three things
- It ensures that product reliability and
availability meet user needs. - It delivers the product to market faster.
- It increases productivity, lowering product
life-cycle cost. - In applying SRE, one can vary relative emphasis
placed on these three factors.
24Section 4
- Software Reliability
- Engineering (SRE) Process
25SRE Process /1
- There are 5 steps in SRE process (for each system
to test) - Define necessary reliability
- Develop operational profiles
- Prepare for test
- Execute test
- Apply failure data to guide decisions
26SRE Process /2
- The Develop Operational Profiles, and Prepare for
Test activities all start during the Requirements
and Architecture phases of the software
development process. - They all extend to varying degrees into the
Design and Implementation phase, as they can be
affected by it. - The Execute Test and Guide Test activities
coincide with the Test phase.
27SRE Necessary Reliability
- Define what failure means for the product.
- Choose a common measure for all failure
intensities, either failures per some natural
unit or failures per hour. - Set the total system failure intensity objective
(FIO). - Compute a developed software FIO by subtracting
the total of the FIOs of all hardware and
acquired software components from the system
FIOs. - Use the developed software FIOs to track the
reliability growth during system test.
28SRE Operational Profile /1
- An operation is a major system logical task,
which returns control to the system when
complete. - An operational profile is a complete set of
operations with their probabilities of occurrence.
29SRE Operational Profile /2
- There are four principal steps in developing an
operational profile - Identify the operation initiators
- List the operations invoked by each initiator
- Determine the occurrence rates
- Determine the occurrence probabilities by
dividing the occurrence rates by the total
occurrence rate - There are three kinds of initiators user types,
external systems, and the system itself.
30SRE Operational Profile /3
- Review Operational profile
- Review the functionality to be implemented to
remove operations that are not likely to be worth
their cost - Suggest operations where opportunities for reuse
will be most cost-effective - Plan a more competitive release strategy using
operational development. With operational
development, development proceeds operation by
operation, ordered by the operational profile.
This makes it possible to deliver the most used,
most critical capabilities to customers earlier
than scheduled. - Allocate resources for requirements, design, and
code reviews among operations to cut schedules
and costs - Allocate system engineering, architectural
design, development, and code resources among
operations to cut schedules and costs - Allocate development, code, and test resources
among modules to cut schedules and costs
31SRE Prepare for Test
- The Prepare for Test activity uses the
operational profiles to prepare test cases and
test procedures. - Test cases are allocated in accordance with the
operational profile. - Test cases are assigned to the operations by
selecting from all the possible intra-operation
choices with equal probability. - The test procedure is the controller that invokes
test cases during execution.
32SRE Execute Test
- Allocate test time among the associated systems
and types of test (feature, load, regression,
etc.). - Invoke the test cases at random times, choosing
operations randomly in accordance with the
operational profile. - Identify failures, along with when they occur.
- This information will be used in Apply Failure
Data and Guide Test.
33Types of Test
- Reliability Growth Test
- Certification Test
34SRE Apply Failure Data
- Plot each new failure as it occurs on a
reliability demonstration chart. - Accept or reject software (operations) using
reliability demonstration chart. - Track reliability growth as faults are removed.
35Collect Field Data
- SRE for the software product lifecycle.
- Collect field data to use in succeeding releases
either using automatic reporting routines or
manual collection, using a random sample of field
sites. - Collect data on failure intensity and on customer
satisfaction and use this information in setting
the failure intensity objective for the next
release. - Measure operational profiles in the field and use
this information to correct the operational
profiles we estimated. - Collect information to refine the process of
choosing reliability strategies in future
projects.
36Section 5
37Definition Fault
- A fault is a cause for either a failure of the
program or an internal error (e.g., an incorrect
state, incorrect timing) - A fault must be detected and then removed
- Fault can be removed without execution (e.g.,
code inspection, design review) - Fault removal due to execution depends on the
occurrence of associated failure. - Occurrence depends on length of execution time
and operational profile.
38Definition Error
- Error has two meanings
- A discrepancy between a computed, observed or
measured value or condition and the true,
specified or theoretically correct value or
condition. - A human action that results in software
containing a fault. - Human errors are the hardest to detect.
39More Definitions
- Defect refers to either fault (cause) or failure
(effect) - Service expected behavior of a software system
- Availability system uptime divided by the sum of
system uptime and downtime.
40Failure Specification /1
Time based failure specification
- Time of failure
- Time interval between failures
- Cumulative failure up to a given time
- Failures experienced in a time interval
Failure no. Failure times (hours) Failure interval (hours)
1 10 10
2 19 9
3 32 13
4 43 11
5 58 15
6 70 12
7 88 18
8 103 15
9 125 22
10 150 25
11 169 19
12 199 30
13 231 32
14 256 25
15 296 40
41Failure Specification /2
Failure based failure specification
- Time of failure
- Time interval between failures
- Cumulative failure up to a given time
- Failures experienced in a time interval
Time(s) Cumulative Failures Failures in interval
30 2 2
60 5 3
90 7 2
120 8 1
150 10 2
180 11 1
210 12 1
240 13 1
270 14 1
42Failure Specification /3
- Many reliability modeling programs and tools
based on them (e.g., SMERFS, and CASRE) have the
capability to estimate model parameters from
either failure count or time interval between
failures data.
43Failure Functions /1
Failure distribution
- Cumulative Failure Function (mean value function)
denotes the average cumulative failures
associated with each time point.
Failures in time period Probability Value X Probability
0 0.10 0.00
1 0.18 0.18
2 0.22 0.44
3 0.16 0.48
4 0.11 0.44
5 0.08 0.40
6 0.05 0.30
7 0.04 0.28
8 0.03 0.24
9 0.02 0.18
10 0.01 0.10
Cumulative failure Cumulative failure 3.04
44Failure Functions /2
- Failure Intensity Function (FIF) represents the
rate of change of cumulative failure function. - As faults are removed, failure intensity tends to
drop and reliability tends to increase.
45Failure Functions /3
- Meantime to Failure (MTTF) expected time that
next failure will be observed. - R(x) is the reliability.
- Meantime to Repair (MTTR) expected time until
the system will be repaired.
46Failure Functions /4
- Failure Rate Function the probability that a
failure per unit time occurs in the interval - t, t?t given the failure has not occurred
before t. - Meantime Between Failures (MTBF)
- MTBF MTTF MTTR
- Availability can also be defined as
47Failure Functions /5
Failure(s) in time period Probability Probability
Failure(s) in time period Elapsed time (1 hour) Elapsed time (5 hours)
0 0.10 0.01
1 0.18 0.02
2 0.22 0.03
3 0.16 0.04
4 0.11 0.05
5 0.08 0.07
6 0.05 0.09
7 0.04 0.12
8 0.03 0.16
9 0.02 0.13
10 0.01 0.10
11 0 0.07
12 0 0.05
13 0 0.03
14 0 0.02
15 0 0.01
Mean 3.04 7.77
48Reliability Model
Fault removal Failure discovery (e.g., extent of
execution, operational profile) Quality of repair
activity
Fault introduction Characteristics of the
product (e.g., program size) Development process
(e.g., SE tools and techniques, staff
experiences, etc.)
Reliability Model
Environment
49Conclusion
- Software Reliability Engineering (SRE) can offer
metrics to help elevate a software development
organization to the upper levels of software
development maturity.