Reliability - PowerPoint PPT Presentation

About This Presentation
Title:

Reliability

Description:

Reliability Reliability Starter Questions Q: What does reliability have to do with social implications of computing? Q: How reliable is software? – PowerPoint PPT presentation

Number of Views:84
Avg rating:3.0/5.0
Slides: 24
Provided by: StephenD46
Category:

less

Transcript and Presenter's Notes

Title: Reliability


1
Reliability
Reliability
2
Starter Questions
  • Q What does reliability have to do with social
    implications of computing?
  • Q How reliable is software?
  • Q What can software developers do to improve
    reliability?
  • Q Should software developers be held responsible
    for faulty software?

3
Example Problems O' Plenty
  • Disenfranchised Voters
  • Patriot Missile
  • NASA Mars Polar Lander
  • 2003 Power Blackout
  • Therac-25

4
Disenfranchised Voters
  • Florida, 2000
  • Florida was the only state that paid a private
    company to purge the voter file of ineligible
    voters
  • approximately 8,000 voters improperly excluded
    from voting
  • general population 11 black
  • incorrectly purged from voter registration list
    88 black
  • Bush beat Gore by 327 votes

5
Patriot Missile System
  • On February 25, 1991, the Patriot missile battery
    at Dharan, Saudi Arabia had been in operation for
    100 hours, by which time the system's internal
    clock had drifted by one third of a second. For a
    target moving as fast as an inbound TBM, this was
    equivalent to a position error of 600 meters.
  • The radar system had successfully detected the
    Scud and predicted where to look for it next, but
    because of the time error, looked in the wrong
    part of the sky and found no missile. With no
    missile, the initial detection was assumed to be
    a spurious track and the missile was removed from
    the system. No interception was attempted, and
    the missile impacted on a barracks killing 28
    soldiers.

6
Mars Polar Lander
  • The last telemetry from Mars Polar Lander was
    sent just prior to atmospheric entry on December
    3, 1999. No further signals have been received
    from the lander. The cause of this loss of
    communication is unknown.
  • According to the investigation that followed, the
    most likely cause of the failure of the mission
    was a software error that mistakenly identified
    the vibration caused by the deployment of the
    lander's legs as being caused by the vehicle
    touching down on the Martian surface, resulting
    in the vehicle's descent engines being cut off
    whilst it was still 40 meters above the surface,
    rather than on touchdown as planned.
  • Another possible reason for failure was
    inadequate preheating of catalysis beds for the
    pulsing rocket thrusters

7
Ariane 5 Rocket
  • June 4, 1996 was the first test flight of the
    Ariane 5 launch system. The rocket tore itself
    apart 37 seconds after launch, making the fault
    one of the most expensive computer bugs in
    history.
  • The Ariane 5 software reused the specifications
    from the Ariane 4, but the Ariane 5's flight path
    was considerably different and beyond the range
    for which the reused code had been designed.
    Specifically, the Ariane 5's greater acceleration
    caused the back-up and primary inertial guidance
    computers to crash, after which the launcher's
    nozzles were directed by spurious data.
    Pre-flight tests had never been performed on the
    re-alignment code under simulated Ariane 5 flight
    conditions, so the error was not discovered
    before launch.
  • Because of the different flight path, a data
    conversion from a 64-bit floating point to 16-bit
    signed integer caused a hardware exception (more
    specifically, an arithmetic overflow, as the
    floating point number had a value too large to be
    represented by a 16-bit signed integer).
    Efficiency considerations had led to the
    disabling of the exception handler for this
    error. This led to a cascade of problems,
    culminating in destruction of the entire flight.

8
2003 North America Blackout
  • August 14, 2003
  • 1215 p.m. Inaccurate data input renders a system
    monitoring tool in Ohio ineffective.
  • 131 p.m. The Eastlake, Ohio, generating plant
    shuts down.
  • 202 p.m. First 345-kV line in Ohio fails due to
    contact with a tree in Walton Hills, Ohio.
  • 214 p.m. An alarm system fails at FirstEnergy's
    control room and is not repaired.
  • 227 p.m. Second 345-kV line fails due to tree.
  • 305 p.m. A 345-kV transmission line fails in
    Parma, south of Cleveland due to a tree.
  • 317 p.m. Voltage dips temporarily on the Ohio
    portion of the grid. Controllers take no action,
    but power shifted by the first failure onto
    another 345-kV power line causes it to sag into a
    tree. While Mid West ISO and FirstEnergy
    controllers try to understand the failures, they
    fail to inform system controllers in nearby
    states.
  • 339 p.m. A First Energy 138-kV line fails.
  • 341 and 346 p.m. Two breakers connecting
    FirstEnergys grid with American Electric Power
    are tripped as a 345-kV power line and 15 138-kV
    lines fail in northern Ohio. Later analysis
    suggests that this could have been the last
    possible chance to save the grid if controllers
    had cut off power to Cleveland at this time.
  • 406 p.m. A sustained power surge on some Ohio
    lines begins uncontrollable cascade after another
    345-kV line fails.
  • 40902 p.m. Voltage sags deeply as Ohio draws 2
    GW of power from Michigan.
  • 41034 p.m. Many transmission lines trip out,
    first in Michigan and then in Ohio, blocking the
    eastward flow of power. Generators go down,
    creating a huge power deficit. In seconds, power
    surges out of the East, tripping East coast
    generators to protect them, and the blackout is
    on.
  • 41037 p.m. Eastern Michigan grid disconnects
    from western part of state.
  • 41038 p.m. Cleveland separates from
    Pennsylvania grid.
  • 41039 p.m. 3.7 GW power flow from East through
    Ontario to southern Michigan and northern Ohio,
    more than ten times larger than the condition 30
    seconds earlier, causing voltage drop across
    system.
  • 41040 p.m. Flow flips to 2 GW eastward from
    Michigan through Ontario, then flip westward
    again in a half second.
  • 41043 p.m. International connections begin
    failing.
  • 41045 p.m. Western Ontario separates from east
    when power line north of Lake Superior
    disconnects. First Ontario plants go offline in
    response to unstable system. Quebec is protected
    because its lines are DC, not AC.

9
Therac-25 - the problem
  • When operating in soft X-ray mode, the machine
    was designed to rotate three components into the
    path of the electron beam, in order to shape and
    moderate the power of the beam.
  • The accidents occurred when the high-energy
    electron-beam was activated without the target
    having been rotated into place the machine's
    software did not detect that this had occurred,
    and did not therefore determine that the patient
    was receiving a potentially lethal dose of
    radiation, or prevent this from occurring.

10
Therac-25 - the reasons
  • The design lacked hardware interlocks to prevent
    the electron-beam from operating in its
    high-energy mode without the target in place.
  • The engineer had reused software from older
    models. These models had hardware interlocks and
    were therefore not as vulnerable to the software
    defects.
  • The hardware provided no way for the software to
    verify that sensors were working correctly.
  • The equipment control task did not properly
    synchronize with the operator interface task, so
    that race conditions occurred if the operator
    changed the setup too quickly. This was evidently
    missed during testing, since it took some
    practice before operators were able to work
    quickly enough for the problem to occur.
  • The software set a flag variable by incrementing
    it. Occasionally an arithmetic overflow occurred,
    causing the software to bypass safety checks.

11
Question
  • What are the common features of those famous
    failures?

12
Software Warranties
  • DISCLAIMER OF WARRANTIES. TO THE MAXIMUM EXTENT
    PERMITTED BY APPLICABLE LAW, MICROSOFT AND ITS
    SUPPLIERS PROVIDE TO YOU THE SOFTWARE COMPONENT,
    AND ANY (IF ANY) SUPPORT SERVICES RELATED TO THE
    SOFTWARE COMPONENT ("SUPPORT SERVICES") AS IS AND
    WITH ALL FAULTS AND MICROSOFT AND ITS SUPPLIERS
    HEREBY DISCLAIM WITH RESPECT TO THE SOFTWARE
    COMPONENT AND SUPPORT SERVICES ALL WARRANTIES AND
    CONDITIONS, WHETHER EXPRESS, IMPLIED OR
    STATUTORY, INCLUDING, BUT NOT LIMITED TO, ANY (IF
    ANY) WARRANTIES OR CONDITIONS OF OR RELATED TO
    TITLE, NON-INFRINGEMENT, MERCHANTABILITY, FITNESS
    FOR A PARTICULAR PURPOSE, LACK OF VIRUSES,
    ACCURACY OR COMPLETENESS OF RESPONSES, RESULTS,
    LACK OF NEGLIGENCE OR LACK OF WORKMANLIKE EFFORT,
    QUIET ENJOYMENT, QUIET POSSESSION, AND
    CORRESPONDENCE TO DESCRIPTION. THE ENTIRE RISK
    ARISING OUT OF USE OR PERFORMANCE OF THE SOFTWARE
    COMPONENT AND ANY SUPPORT SERVICES REMAINS WITH
    YOU.

13
Warranty Laws
  • Article 2 of the Uniform Commercial Code
  • What specifically is at issue in many cases are
    the disks you buy with software to load onto your
    computer or the updates which are internally
    loaded when you agree to provisions of what are
    called licensing agreements. Are these
    purchases/updates "transactions in goods" under
    the UCC Article 2?
  • At first the ALI and NCCUSL decided to handle
    this problem by a separate section of the UCC,
    which it would have called Article 2B. However,
    the ALI withdrew from the project when there
    seemed to be no attempt to bring all such
    transactions under the scope of Article 2. Thus,
    the remaining pieces of what was formerly 2B
    became a statute UCITA ("Uniform Computer
    Information Transactions Act"). Article 2 would
    then be revised to eliminate all reference to
    information and UCITA would carry the burden on
    that front. In Article 2, the term "goods" does
    not include information. UCITA was supposed to
    pick up the slack. But, UCITA ran into a lot of
    difficulties and only two states have approved
    it. Thus, that leaves us potentially in legal
    limbo regarding whether these software packages
    and other similar transactions are really Article
    2 transactions.
  • http//www.drbilllong.com/Sales/ScopeII.html

14
Warranty Laws
  • Uniform Computer Information Transaction Act
    (UCITA) allows software manufacturers to
  • disclaim all liability for defects
  • prevent the transfer of software from person to
    person
  • remotely disable licensed software during a
    dispute
  • does not apply to embedded systems

15
Warranty Lawsuits
  • Mortenson v. Timeberline Software (1993)
  • Mortenson used a TS application when creating a
    bid to build a hospital.
  • The software created a bid that was 2M too low.
  • TS knew about the bug, but had not sent an update
    to Mortenson.
  • The State of Washington Supreme Court ruled in
    favor of TS.

16
Liability
  • Q Can you be held criminally liable for your
    software's defects?
  • A Generally, no.
  • You are liable for embedded systems.
  • E.g., Toyota cruise control
  • Software Errors are usually covered by other
    laws
  • COPPA - illegal to collect data on users under
    age 13 w/o parental consent
  • FERPA - protects student information
  • HIPAA - protects patients' information
  • FDA examines medical devices for safety

17
How can we improve reliability?
  • Use of Software Engineering practices
  • Make software development more science and less
    art.
  • A science is only as mature as its measurement
    devices.

18
Software Engineering, the early yearsThe
"Software Crisis"
P r o g r a m m e r s
Demand
Supply
T i m e
1960
today
19
Software Engineering
covered in CSCI 475/476
  • Methods
  • e.g. how to test modules of code
  • Procedures / Best Practices
  • e.g. successful requirements engineering
  • Metrics
  • what do we do well?
  • what do we do poorly?
  • how productive are we?

20
Software Quality Assurance
covered in CSCI 521
  • Formal Peer Reviews
  • Coding and Design Standards
  • Employee Training
  • Risk Management

21
How much quality is cost effective?
SQA Failure
Costs
Development costs and SQA costs
Cost of Failure
Software Quality
Optimal Quality Level
22
How SQA Pays for Itself
SQA Failure
Initial Cost of SQA
Costs
Eventual Cost of SQA
Cost of Failure
Software Quality
Optimal Quality Level
23
Next Classes
  • Professional Codes of Ethics
  • Your Project Proposals
Write a Comment
User Comments (0)
About PowerShow.com