A Case Study In Reliability Analysis - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

A Case Study In Reliability Analysis

Description:

A Case Study In Reliability Analysis Lewis Sykalski – PowerPoint PPT presentation

Number of Views:206
Avg rating:3.0/5.0
Slides: 29
Provided by: Lewi1175
Category:

less

Transcript and Presenter's Notes

Title: A Case Study In Reliability Analysis


1
A Case Study In Reliability Analysis
  • Lewis Sykalski

2
Background (cont.)
  • Net Centric Warfare Data Collector
  • Approximately 180KLOC
  • Written in Java and heavily uses JDBC and RMI
    from J2EE package
  • CMMI Level 1
  • Utilizes Oracle 9.2 EE OTS DBMS
  • Reliability Required Moderate

3
Background
GLOBAL VISION NETWORK (GVN)
FUSION
CAOC
DC
VBMS
LM Mission Sys Colorado Springs, CO
DC
WCS
JSAF
JTAC
Light House Suffolk, VA
JIMM
VBMS
JABE
Other Simulators
Threat Sims
Integrated Warfare Development Center Fort Worth,
TX
LM Sim Training Orlando, FL
4
Design Diversity (Part I)
  • Part I Oracle DBMS Design Diversity
  • Acquire 20 bug reports each from Oracle 9.2
    Oracle 10.0
  • Bugs had to be Date Independent, Easy To
    Reproduce, Type Independent
  • Results would then be classified by self-evidence
    divergence

5
Design Diversity Results 9.2 Bugs
Bug Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent
2357784 Internal Error X NO N/A X
2299898 Performance/Hang X NO N/A X
2202561 Incorrect Results NO N/A
2221401 Incorrect Results NO N/A
2739068 Incorrect Results NO N/A
2683540 Incorrect Results NO N/A
2991842 Incorrect Results NO N/A
2200057 Internal Error X NO N/A
2405258 Internal Error X NO N/A
2716265 Internal Error X NO N/A
2054241 Performance/Hang X NO N/A
2485871 Internal Error X NO N/A
2670497 Internal Error X NO N/A
2659126 Internal Error X NO N/A X
2064478 Internal Error X NO N/A
2624737 Internal Error X NO N/A X
1918751 Internal Error X NO N/A
2286290 Incorrect Results NO N/A X
2700474 Incorrect Results NO N/A
2576353 Internal Error X NO N/A
6
Design Diversity Results 10.0 Bugs
Bug Type 10.0 SE 9.2 Fails? 9.2 SE Divergent
5731063 Internal Error X NO N/A
3664284 Incorrect Results NO N/A
4582808 Incorrect Results NO N/A
3895678 Internal Error X YES X
3893571 Internal Error X YES X
3903063 Incorrect Results YES
3912423 Internal Error X NO N/A
4029857 Engine Crash X YES X
4156695 Incorrect Results YES
2929556 Internal Error X YES X X
3255350 Performance / Hang X NO N/A
3887704 Internal Error X NO N/A
3405237 Engine Crash X YES X
3952322 Feature Unusable X YES X
4033889 Incorrect Results NO N/A
4060997 Internal Error X YES X
4134776 Internal Error X NO N/A
4149779 Incorrect Results NO N/A
2964132 Internal Error X YES X
3361118 Internal Error X YES X
7
Design Diversity More Analysis
8
Design Diversity Even More Analysis
Total Bug Scripts Failures 1 out of 2 Bug Scripts Failing 1 out of 2 Bug Scripts Failing Both DBMS Products Failing Both DBMS Products Failing Both DBMS Products Failing Both DBMS Products Failing
Total Bug Scripts Failures S.E N.S.E Non-Divergent Non-Divergent Divergent Divergent
S.E N.S.E S.E. N.S.E
40 40 18 11 8 2 1 0
  • Bottom Line
  • Not a Statistical Sample (Not Enough Time)
  • 2/40 10 of Failures not detected across both
    products
  • Out of the 20 failures for Oracle 10.0, 6 were
    N.S.E 4 out of 6 of these failures would be
    resolved by utilizing a past release in tangent
    with future release

9
Reliability Analysis (Part II)
  • Part II CASRE Reliability Analysis of NCW Data
    Collector
  • Extract the following from Failure Logs using
    JavaScript Time of Program Start, Time of
    Program Termination, Time of Thread Terminations,
    and Exception or Failure Messages
  • Parse failures manually into CASRE input format
  • Categorize by severity utilizing chart on next
    slide
  • Compare 2 consecutive events (CALOE08 MAGTF08)
    as well as 2 consecutives lifecycles within same
    event (Integration Execution)

10
Severity
Severity Code Failure Description
9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss
8 Failure Causes Program Abort
7 Failure Causes Program Thread Abort
5 Failure Causes Record Not to be Written, Thread Continues
3 Failure Causes Incorrect Data to be Written, Thread Continues
1 Failure is Caught, Handled and Recovers Correctly
11
Using CASRE
12
Using CASRE (cont.)
13
CASRE Input Format
TIME BETWEEN FAILURES FORMAT N/A
FAILURE COUNT FORMAT
Interval Number of Interval Error Number
Errors Length Severity (int) (float)
(float) (int) Example Hours 1 5.0 40.0
1 1 3.0 40.0 2 1 2.0 40.0 3 2 4.0 40.0
1 2 3.0 40.0 3 3 7.0 40.0 1 4 5.0 40.0
1 5 4.0 40.0 1
14
CASRE Failure Counts
CALOEMAGTF Execution
MAGTF Integration Execution
15
CASRE Time Between Failures
CALOEMAGTF Execution
MAGTF Integration Execution
16
CASRE Failure Intensity
CALOEMAGTF Execution
MAGTF Integration Execution
17
CASRE Cummulative Failures
CALOEMAGTF Execution
MAGTF Integration Execution
18
CASRE Test Interval Length
CALOEMAGTF Execution
MAGTF Integration Execution
19
Detecting Reliability Trends
  • Running Average
  • Not as Useful for Failure Count Data (unless test
    intervals are equal length)
  • Computes the running average of the time between
    successive failures for time between failures
    data, or the running average of number of
    failures per interval for failure count data.
  • If the running average decreases with time (fewer
    failures per test interval), reliability growth
    is indicated.
  • Laplace Test
  • Not as Useful for Failure Count Data (unless test
    intervals are equal length)
  • Occurrences of failures homogeneous Poisson
    process
  • If the test statistic decreases with increasing
    failure, then the null hypothesis can be
    rejected in favor of reliability growth at an
    appropriate significance level. Opposite for
    increases with increasing failure

20
Running Average
CALOEMAGTF Execution
MAGTF Integration Execution
21
Laplace Test
CALOEMAGTF Execution
MAGTF Integration Execution
22
CASRE Cum Failure Predictions
CALOEMAGTF Execution
MAGTF Integration Execution
23
CASRE Prediction Setup
CALOEMAGTF Execution
MAGTF Integration Execution
24
CASRE Reliability Prediction
CALOEMAGTF Execution
MAGTF Integration Execution
25
CASRE Prequential Likelihood
CALOEMAGTF Execution
MAGTF Integration Execution
26
CASRE Model-Ranking
CALOEMAGTF Execution
MAGTF Integration Execution
27
Reliability Models
  • Havent been able to get these to run yet.
  • Instruction manual says many of the built-in
    models only work with Time Between Failures Data.
  • Doubt there would be much utility with Failure
    Count Data

28
Conclusion/Follow-Up
  • It actually would be QUITE easy to integrate
    Failure Count or Time Between Failures Output
    Auto-Generation into my environment
  • This would facilitate quick trend-analysis
  • Reliability trends and not the actual numbers is
    what is important
Write a Comment
User Comments (0)
About PowerShow.com