Network Fault Tolerance - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

Network Fault Tolerance

Description:

Dependability Concepts, Measures and Models (UNIT ... Fault-tolerant and Fault-secure Memories (UNIT FRTT) Fault-tolerant ... Landwehr, C. E., B. ... – PowerPoint PPT presentation

Number of Views:150
Avg rating:3.0/5.0
Slides: 35
Provided by: AF129
Category:

less

Transcript and Presenter's Notes

Title: Network Fault Tolerance


1
HUMBOLDT-UNIVERSITÄT ZU BERLIN INSTITUT FÜR
INFORMATIK
DEPENDABLE SYSTEMS Vorlesung 1 INTRODUCTION Wi
ntersemester 2000/2001 Leitung Prof. Dr.
Miroslaw Malek www.informatik.hu-berlin.de/rok/f
tc
2
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline
  • Introduction (Unit I)
  • Motivation
  • System views
  • Dependability rings
  • Dependable design methodology
  • Dependability Concepts, Measures and Models (UNIT
    DCMM)
  • Basic definitions
  • Dependability measures
  • Dependability models
  • Examples
  • Dependability evaluation tools
  • Testing Techniques (UNIT TT)
  • Testing techniques principles
  • Processor testing
  • Memory testing
  • Network testing

3
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline
  • Fault Diagnosis Techniques (UNIT FST)
  • Fault detection techniques
  • Fault location (isolation) methods
  • Fault Recovery and Tolerance Techniques (UNIT
    FRTT) (System Level)
  • Dynamic techniques
  • Static techniques
  • Hybrid techniques
  • Fault-tolerant and Fault-secure Memories (UNIT
    FRTT)
  • Fault-tolerant techniques in manufacturing
  • Replication
  • Coding
  • Reconfiguration

4
FAULT-TOLERANT COMPUTING SYSTEMSTopical Outline
  • Network Fault Tolerance (UNIT NFT)
  • Computer networks
  • Basic techniques
  • Example multistage networks
  • Case Studies (UNIT CS)
  • ESS and 3B20
  • FTMP Fault-tolerant Multiprocessor
  • SIFT Software-implemented Fault Tolerance
  • Communication controller
  • Fault-tolerant Building Block Architecture

5
COURSE ACTIVITIES
  • PROJECT
  • PRESENTATION
  • INVITED SPEAKERS
  • CONFERENCES AND WORKSHOPS
  • Some Websites
  • www.dependability.org
  • www.paradise.caltech.edu
  • www.milan.eas.asu.edu
  • www.crhc.uiuc.edu

6
Major References on Fault-tolerant Computing
(Books/General) 1
  • Chang, H. Y., E.G. Manning and G. Metze, Fault
    Diagnosis in Digital Systems, Wiley
    Interscience, 1970.
  • Friedman, A. D. and P. R. Menon, Fault Detection
    in Digital Circuits, Prentice-Hall, 1971.
  • Breuer, M. A. and A.D. Friedman, Diagnosis and
    Reliable Design of Digital Systems, Computer
    Science Press, 1976.
  • Kraft, G. D. and W. N. Toy, Microprogrammed
    Control and Reliable Design of Small Computers,
    Prentice-Hall, 1981.
  • Anderson, T. and P.A. Lee, Fault Tolerance
    Principles and Practice, Prentice-Hall, 1982.
  • Siewiorek, D.P. and R. S. Swarz, The Theory and
    Practice of Reliable Systems Design, Digital
    Press, 1982 1995.
  • Lala, P.K., Fault Tolerant and Fault Testable
    Hardware Design, Prentice-Hall International,
    1985.
  • Pradhan, D. K. (ed.), Fault Tolerant Computing
    Theory and Techniques, Vols. I and II,
    Prentice-Hall, 1986.

7
Major References on Fault-tolerant Computing
(Books/General) 2
  • Avizienis, A., H. Kopetz and J. C. Laprie (eds.),
    The Evolution of Fault-Tolerant Computing,
    Springer-Verlag, 1987.
  • Johnson, B. W., Design and Analysis of Fault
    Tolerant Digital Systems, Addison-Wesley, 1989.
  • Negrini, R., M. G. Sami and R. Stefanelli, Fault
    Tolerance Through Reconfiguration in VLSI and WSI
    Arrays, MIT Press, 1989.
  • Laprie, J. C. (ed.), Dependable computing and
    Fault-Tolerant Systems, Vol. 5 Dependability
    Basic Concepts and Terminology, Springer-Verlag
    Wien New York, 1992.
  • Landwehr, C. E., B. Randell, L. Simoncini (eds.),
    Dependable Computing and Fault-Tolerant Systems,
    Vol. 8, Dependable Computing for Critical
    Applications 3, Springer-Verlag Wien New York,
    1993.
  • Koob, G. M. and C. G. Lau (eds.), Foundations of
    Dependable Comp-uting, System Implementation,
    Kluwer Academic Publishers, 1994.
  • Koob, G. M. and C. G. Lau (eds.), Foundations of
    Dependable Comp-uting, Paradigms for Dependable
    Applications, Kluwer Academic Publishers, 1994.

8
Major References on Fault-tolerant Computing
(Books/General) 3
  • Koob, G. M. and C. G. Lau (eds.), Foundations of
    Dependable Comp-uting, Models and Frameworks for
    Dependable Systems, Kluwer Academic Publishers,
    1994.
  • Malek, M. (ed.), Responsive Computing, Kluwer
    Acad. Publish., 1994.
  • Fussel, D. S. and M. Malek (eds.), Responsive
    Computer Systems, Steps Toward Fault-Tolerant
    Real-Time Systems, Kluwer Academic Publishers,
    1995.
  • Cristian, F., G. Le Lann and T. Lunt (eds.),
    Dependable computing and Fault-Tolerant Systems,
    Vol. 9, Dependable Computing for Critical
    Applications 4, Springer-Verlag Wien New York,
    1995.
  • Dhiraj K. Pradhan, Fault-Tolerant Computer System
    Design, Textbook Binding, 1996.
  • A. A. Shvartsman, Fault-Tolerant Parallel
    Computation, Kluwer, 1997
  • W. Schneeweiss, Die Fehlerbaum-Methode,
    LiLoLe-Verlag, 1999
  • S. Montenegro, Sichere und fehlertolerante
    Steuerungen, Hanser Muenchen, 1999.

9
Major References on Fault-tolerant Computing
(Books/Reliability Evaluation)
  • Myers, G. J., Software Reliability Principles and
    Practice, Wiley-Interscience, 1976.
  • Trivedi, K. S., Probability and Statistics with
    Reliability Queuing and Computer Science
    Applications, Prentice-Hall, 1982.
  • Asche, H. and H. Feingold, Repairable Systems
    Reliability, Marcel Dekker, 1984.
  • Musa, J. D., A. Iannino and K. Okumoto, Software
    Reliability Measurement, Prediction,
    Application, McGraw-Hill, 1987.
  • W. Schneeweiss, Petri Nets for Reliability
    Modeling, LiLoLe, 1999

10
Major References on Fault-tolerant Computing
(Books/Coding)
  • Sellers, E. F., M. Y. Hsiao and L. W. Bearnson,
    Error Detecting Logic for Digital Computers,
    McGraw-Hill, 1968.
  • Peterson, W. and E. Welding, Error-Correcting
    Codes (2nd ed.), MIT Press, 1972.
  • Wakerly, J., Errors Detecting Codes,
    Self-Checking Circuits and Applications, The
    Computer Science Library, 1978.
  • Lin, S. and D. J. Castello, Error Control Coding
    Fundamentals and Application, Prentice-Hall,
    1983.
  • Nagle, H. T., J. D. Irwin and D. Hoffman, Error
    Detecting and Correcting Codes for Computer
    Scientist and Engineers, MacMillan Publishers,
    1986.
  • Rao, T. R. N. and E. Fujiwara, Error-Control
    Coding for Computer Systems, Prentice-Hall, 1989.

11
Major References on Fault-tolerant Computing
(Books/Software)
  • Myers, G. J., The Art of Software Testing,
    Wiley-Interscience, 1970.
  • Deutsch, M. D., Software Verification and
    Validation, Prent.-Hall, 1982.
  • Shooman, M. L., Software Engineering,
    McGraw-Hill, 1983.
  • Beizer, B., Software Testing Techniques, Van
    Nostrand Reinhold, 1983.
  • Bernstein, P. A., V. Hadzlacos and N. Goodman,
    Concurrency Control and Recovery in Database
    Systems, Addison-Wesley, 1987.
  • Neufelder, A. M., Earning Software Reliability,
    Marcel Dekker Inc., 1993.
  • Lyu, M. R. (ed.), Software Fault Tolerance, John
    Wiley and Sons, 1995.
  • Lyu, M. R. (ed.), Handbook of Software
    Reliability Engineering, Computer Science Press,
    1995.

12
Major References on Fault-tolerant Computing
(Journals)
  • Special Issue of Proc. Of IEEE, October 1978
  • Special Issue of Computer, October 1979
  • Special Issue of Computer, March 1980
  • Special Issue of Computer, August 1984
  • Special Issue of IEEE Software, May 1995
  • IEEE Trans. on Reliability
  • IEEE Trans. On Software Engineering
  • Computer
  • Design and Test
  • Electronics
  • Proc. Of IEEE
  • Computer Design
  • Journal of Electronic Testing Theory and
    Applications
  • Journal of Parallel and Distributed Computing
  • IEEE Trans. on Parallel and Distributed Computing
  • Real-Time Systems Journal

13
Major References on Fault-tolerant Computing
(Conference Proceedings)
  • Fault-Tolerant Computing Symposium
  • Reliability and Maintainability Symposium
  • Reliability in Distributed Software and Database
    Systems Symposium
  • Test Conference
  • Distributed Computing Systems Conference
  • Parallel Processing Conference
  • Real-Time Systems Symposium
  • Computer Architecture Symposium

14
INTRODUCTION
  • OBJECTIVES
  • MOTIVATION FOR FAULT-TOLERANT SYSTEMS
  • TO INTRODUCE VARIOUS VIEWS OF COMPUTER SYSTEMS
    AND THEIR RELATIONS TO COMPUTER SYSTEM
    DEPENDABILITY
  • TO PRESENT BASIC CONCEPTS AND APPROACHES
  • TO INTRODUCE DEPENDABLE DESIGN METHODOLOGY
  • CONTENTS
  • MOTIVATION
  • SYSTEM VIEWS
  • SYSTEM DEPENDABILITY CONCEPTS
  • APPROACHES TO DEPENDABLE DESIGN
  • DEPENDABILITY RINGS
  • DEPENDABLE DESIGN METHODOLOGY

15
TYPES OF SYSTEMS
  • Dependable (Reliable) System
  • A system which delivers a required service during
    its lifetime
  • Fault-Tolerant Computer Systems
  • A system that has the capability to continue the
    correct execution of its programs and
    input/output functions in the presence of faults
  • Real-Time-Computer Systems
  • are the ones that deliver service to a user
    within a specified deadline (physical time,
    duration, etc.)
  • Responsive Computer System
  • are Fault-Tolerant Real-Time Systems that deliver
    satisfactory service in a timely manner

16
MOTIVATION FOR RELIABLE AND FAULT-TOLERANT
COMPUTING
  • ECONOMIC NECESSITY
  • LIFE SAVING
  • NOVICE USERS
  • HARSH ENVIRONMENTS
  • MORE COMPLEX SYSTEMS

17
DEVICE RELIABILITY AND SYSTEM RELIABILITY
Equivalent Device Reliability
106 105 104 103 102 10 1
Mean Time between Failures (MTBF) in Years
Minimum Acceptable Reliability
System Reliability
1950 1960 1970 1980 1990
Relays Vacuum Tubes Semiconductors SSI
MSI LSI - VLSI
18
DEPENDABILITY PERFORMANCE TRADE-OFF
Ultra Reliable Systems
0.99999 0.9999 0.999 0.99 0.9
Commercial Fault-Tolerant Systems
Availability
Massively Parallel/ Distributed Systems
1 10 100 1000 10000 100000
Throughput (MIPS)
19
EXAMPLES
  • DEFENSE SYSTEMS
  • FLIGHT SYSTEMS
  • AIR TRAFFIC CONTROL
  • COMMUNICATION SYSTEMS
  • BANKING SYSTEMS
  • AIRLINE SEAT RESERVATIONS
  • TELEPHONE SYSTEMS
  • HOUSEHOLD APPLIANCES
  • VIDEO GAMES

20
VIEW 1 SYSTEM LIFE CYCLE
SYSTEM CONSTRAINTS
NEW TECHNOLOGY
OBSOLESCENCE
NEEDS
CONCEPT FORMULATION SYSTEM SPECIFICATION DESIGN PR
OTOTYPE PRODUCTION INSTALLATION OPERATIONAL
LIFE MODIFICATION AND RETIREMENT
  • Notice that testing, verification or validation
    should occur after every phase of life cycle
  • Very few tools exist, and for some steps of the
    cycle only

21
VIEW 2 PACKAGING LEVELS OF INTEGRATION
  • APPLICATIONS
  • APPLICATIONS MODULES
  • SPECIAL-PURPOSE LANGUAGES
  • STANDARD LANGUAGES
  • OPERATING SYSTEMS
  • CABINETS/FRAMES
  • BOXES/CAGES
  • PRINTED CIRCUIT BOARDS/CARDS, WAFERS, TCMs
  • INTEGRATED CIRCUITS (CHIPS)
  • Dependability must be considered at every level
  • System decomposition (partitioning) may have a
    significant impact on dependability

22
VIEW 3 WORKLOAD VIEW
LIVEWARE
USEFUL WORK
PREPARATION
SEMI USEFUL WORK
HARDWARE/ SOFTWARE
IDLING
FAULT SERVICING
  • ELIMINATE IDLING AND USE IT FOR TESTING TO
    IMPROVE DEPENDABILITY

23
VIEW 4 LEVELS OF ABSTRACTION FOR DIGITAL
COMPUTERS
  • DEPENDABILITY AND TESTING MUST BE CONSIDERED AT
    EVERY LEVEL

24
VIEW 5 COMPUTER SYSTEM
LIVEWARE MAINTENANCE PERSONNEL OPERATORS SYSTEM
DESIGNERS SYSTEM ANALYSTS PROGRAMMERS USERS
SOFTWARE PACKAGES ASSEMBLERS COMPILERS OPERATING
SYSTEMS UTILITY PROGRAMS DEBUGGING PROGRAMS FILE
PROCESSING PROGRAMS
FIRMWARE MICROPROGRAM MICROPRO- GRAMMING
SYSTEMS
HARDWARE CPUs I/O DEVICES MEMORIES INTERCONNECTION
NETWORKS
FAULTS ARE ATTRIBUTED TO HARDWARE 20-65
SOFTWARE 20-80 PEOPLE 15-40 ATTs
20-40-40 (2/3 applications 1/3 OS)
25
(WARNING!!!)VIEW 6 IF YOU DO NOT FOLLOW
DEPENDABLE DESIGN METHODOLOGY YOU MAY END UP WITH
THE FOLLOWING
  • SIX PHASES OF A PROJECT
  • ENTHUSIASM
  • DISILLUSIONMENT
  • PANIC AND HYSTERIA
  • SEARCH FOR THE GUILTY
  • PUNISHMENT OF THE INNOCENT
  • PRAISE AND AWARDS FOR THE NON-PARTICIPANTS
  • (Author unknown found in one of the computer
    companies)

26
SYSTEM DEPENDABILITY CONCEPTS
  • RELIABILITY
  • Is a conditional probability that the system will
    perform its intended function without failure at
    time t provided it was fully operational at time
    t 0
  • AVAILABILITY
  • Instantaneous availability is the probability
    that a system is performing correctly at time t
    and is equal to reliability of non-repairable
    systems
  • A (t) R (t)
  • Steady-state availability is the probability
    that a system will be operational at any random
    point of time and is expressed as the fraction of
    time a system is operational during its expected
    lifetime
  • As (t)
  • SURVIVABILITY is the probability that a system
    will deliver the required service in the presence
    of a defined a priori set of faults or any of its
    subset

27
APPROACHES
  • FAULT INTOLERANCE
  • FAULT TOLERANCE
  • MAINTAINABILITY
  • HARDWARE/SOFTWARE TRADE-OFFS

28
HARDWARE/SOFTWARE CONTINUUM AND VERTICAL MIGRATION
HARDWARE
EXAMPLES M6800 MC68000 VAX-11/780
IBM-30XX CRAY-XMP C-205 SYSTOLIC ARRAYS,
RECONFIGURABLE OR EXPERIMENTAL MULTICOMPUTERS
INSTRUCTIONS INTEGER ARITHMETIC
ADD/SUB MPY/DIV FLOATING-POINT ARITHMETIC VECTOR
PROCESSING MULTIPROCESSING (e.g., submachine
set-up)
SOFTWARE
VERTICAL MIGRATION is a transfer of functions
implementation from software to firmware and/or
hardware or vice-versa. Vertical Migration
improves performance and dependability, and
reduces cost.
29
DEPENDABILITY (RELIABILITY) RINGS FOR FAULT
TOLERANCE
Dependability Rings
Acceptance Test
Operating System, Languages and Application
Acceptance Test
System Hardware
Acceptance Test
Register-Transfer Level
Acceptance Test
Logic Level
Each Dependability Ring should provide measures
and mechanisms for Fault Tolerance (Detection,
Location, Testability and Recovery)
30
A BOOTSTRAP TEST RINGS IN A MULTICOMPUTER SYSTEM
Network
Memories
Processor
Diagnostic and Maintenance Processor (s)
(Hardcore)
Test Rings
31
DEPENDABLE DESIGN METHODOLOGY
  • Identify fault classes, fault latency and fault
    impact
  • Determine qualitative and quantitative specs for
    fault tolerance and evaluate your design in
    specific environment
  • Identify weak spots and assess potential damage
  • Decompose the system
  • Develop fault and error detection techniques and
    algorithms
  • Develop fault isolation techniques and algorithms
  • Develop recovery/reintegration/restart
  • Evaluate degree of fault tolerance
  • Refine, iterate for improvement try to eliminate
    weak spots and minimize potential damage

32
REAL-TIME SYSTEMS DESIGN
  • Identify time/critical tasks and specify their
    timing (deadlines, durations, frequency,
    periodicity, if any). Characterize the system
    load and environment.
  • Characterize timing of a system (hardware and
    software).
  • Map timing specification onto a system timing
    (find the best resource allocation and scheduling
    methods), and incorporate concurrent monitoring.
  • Verify and validate the design for quantitative
    and qualitative specifications.
  • Refine, iterate and fine-tune the design.

33
RESPONSIVE SYSTEM DESIGN
  • Determine qualitative and quantitative
    specifications for fault tolerance and task
    timeliness which meet user requirements.
  • Determine system timing (hardware and software)
    assess damage, availability and responsiveness.
  • Develop and time fault and error detection
    techniques and algorithms.
  • Develop and time fault isolation techniques and
    algorithms.
  • Develop time recovery/reintegration/restart.
  • Map timing specification onto system timing under
    appropriate assumptions and incorporate
    concurrent monitoring.
  • Evaluate responsiveness.
  • Refine and iterate for improvement.
  • RESPONSIVE SYSTEMS NEED ARCHITECTS OF SPACE AND
    ARCHITECTS OF TIME

34
REFERENCES(TEXTBOOK)
  • C. G. Bell, J. C. Mudge and J. E. McNamara Seven
    Views of Computer Systems, Chapter 1 in the book
    by the same authors titled Computer
    Engineering, Digital Press, 1978.
  • G.J. Lipovski and M. Malek, Parallel Computing
    Theory and Comparisons, Wiley-Interscience, New
    York, 1987.
  • M. Malek, Parallel Computer Systems Testing and
    Integration, in the book titled Testing and
    Diagnosis of VLSI and LSI, M. G. Sami and F.
    Lombardi (eds.), Kluwer, 1988.
  • Pankaj Jalote, Fault Tolerance in Distributed
    Systems / Textbook Binding / Published 1994
  • Dhiraj K. Pradhan, Fault-Tolerant Computer System
    Design, Textbook Binding, 1996.
Write a Comment
User Comments (0)
About PowerShow.com