Title: A Case Study In Reliability Analysis
1A Case Study In Reliability Analysis
2Background (cont.)
- Net Centric Warfare Data Collector
- Approximately 180KLOC
- Written in Java and heavily uses JDBC and RMI
from J2EE package - CMMI Level 1
- Utilizes Oracle 9.2 EE OTS DBMS
- Reliability Required Moderate
3Background
GLOBAL VISION NETWORK (GVN)
FUSION
CAOC
DC
VBMS
LM Mission Sys Colorado Springs, CO
DC
WCS
JSAF
JTAC
Light House Suffolk, VA
JIMM
VBMS
JABE
Other Simulators
Threat Sims
Integrated Warfare Development Center Fort Worth,
TX
LM Sim Training Orlando, FL
4Design Diversity (Part I)
- Part I Oracle DBMS Design Diversity
- Acquire 20 bug reports each from Oracle 9.2
Oracle 10.0 - Bugs had to be Date Independent, Easy To
Reproduce, Type Independent - Results would then be classified by self-evidence
divergence
5Design Diversity Results 9.2 Bugs
Bug Type 9.2 S.E 10.0 Fails? 10.0 S.E. Divergent
2357784 Internal Error X NO N/A X
2299898 Performance/Hang X NO N/A X
2202561 Incorrect Results NO N/A
2221401 Incorrect Results NO N/A
2739068 Incorrect Results NO N/A
2683540 Incorrect Results NO N/A
2991842 Incorrect Results NO N/A
2200057 Internal Error X NO N/A
2405258 Internal Error X NO N/A
2716265 Internal Error X NO N/A
2054241 Performance/Hang X NO N/A
2485871 Internal Error X NO N/A
2670497 Internal Error X NO N/A
2659126 Internal Error X NO N/A X
2064478 Internal Error X NO N/A
2624737 Internal Error X NO N/A X
1918751 Internal Error X NO N/A
2286290 Incorrect Results NO N/A X
2700474 Incorrect Results NO N/A
2576353 Internal Error X NO N/A
6Design Diversity Results 10.0 Bugs
Bug Type 10.0 SE 9.2 Fails? 9.2 SE Divergent
5731063 Internal Error X NO N/A
3664284 Incorrect Results NO N/A
4582808 Incorrect Results NO N/A
3895678 Internal Error X YES X
3893571 Internal Error X YES X
3903063 Incorrect Results YES
3912423 Internal Error X NO N/A
4029857 Engine Crash X YES X
4156695 Incorrect Results YES
2929556 Internal Error X YES X X
3255350 Performance / Hang X NO N/A
3887704 Internal Error X NO N/A
3405237 Engine Crash X YES X
3952322 Feature Unusable X YES X
4033889 Incorrect Results NO N/A
4060997 Internal Error X YES X
4134776 Internal Error X NO N/A
4149779 Incorrect Results NO N/A
2964132 Internal Error X YES X
3361118 Internal Error X YES X
7Design Diversity More Analysis
8Design Diversity Even More Analysis
Total Bug Scripts Failures 1 out of 2 Bug Scripts Failing 1 out of 2 Bug Scripts Failing Both DBMS Products Failing Both DBMS Products Failing Both DBMS Products Failing Both DBMS Products Failing
Total Bug Scripts Failures S.E N.S.E Non-Divergent Non-Divergent Divergent Divergent
S.E N.S.E S.E. N.S.E
40 40 18 11 8 2 1 0
- Bottom Line
- Not a Statistical Sample (Not Enough Time)
- 2/40 10 of Failures not detected across both
products - Out of the 20 failures for Oracle 10.0, 6 were
N.S.E 4 out of 6 of these failures would be
resolved by utilizing a past release in tangent
with future release
9Reliability Analysis (Part II)
- Part II CASRE Reliability Analysis of NCW Data
Collector - Extract the following from Failure Logs using
JavaScript Time of Program Start, Time of
Program Termination, Time of Thread Terminations,
and Exception or Failure Messages - Parse failures manually into CASRE input format
- Categorize by severity utilizing chart on next
slide - Compare 2 consecutive events (CALOE08 MAGTF08)
as well as 2 consecutives lifecycles within same
event (Integration Execution)
10Severity
Severity Code Failure Description
9 Failure Causes Machine to be Rebooted Causing Catastrophic Loss
8 Failure Causes Program Abort
7 Failure Causes Program Thread Abort
5 Failure Causes Record Not to be Written, Thread Continues
3 Failure Causes Incorrect Data to be Written, Thread Continues
1 Failure is Caught, Handled and Recovers Correctly
11Using CASRE
12Using CASRE (cont.)
13CASRE Input Format
TIME BETWEEN FAILURES FORMAT N/A
FAILURE COUNT FORMAT
Interval Number of Interval Error Number
Errors Length Severity (int) (float)
(float) (int) Example Hours 1 5.0 40.0
1 1 3.0 40.0 2 1 2.0 40.0 3 2 4.0 40.0
1 2 3.0 40.0 3 3 7.0 40.0 1 4 5.0 40.0
1 5 4.0 40.0 1
14CASRE Failure Counts
CALOEMAGTF Execution
MAGTF Integration Execution
15CASRE Time Between Failures
CALOEMAGTF Execution
MAGTF Integration Execution
16CASRE Failure Intensity
CALOEMAGTF Execution
MAGTF Integration Execution
17CASRE Cummulative Failures
CALOEMAGTF Execution
MAGTF Integration Execution
18CASRE Test Interval Length
CALOEMAGTF Execution
MAGTF Integration Execution
19Detecting Reliability Trends
- Running Average
- Not as Useful for Failure Count Data (unless test
intervals are equal length) - Computes the running average of the time between
successive failures for time between failures
data, or the running average of number of
failures per interval for failure count data. - If the running average decreases with time (fewer
failures per test interval), reliability growth
is indicated. - Laplace Test
- Not as Useful for Failure Count Data (unless test
intervals are equal length) - Occurrences of failures homogeneous Poisson
process - If the test statistic decreases with increasing
failure, then the null hypothesis can be
rejected in favor of reliability growth at an
appropriate significance level. Opposite for
increases with increasing failure
20Running Average
CALOEMAGTF Execution
MAGTF Integration Execution
21Laplace Test
CALOEMAGTF Execution
MAGTF Integration Execution
22CASRE Cum Failure Predictions
CALOEMAGTF Execution
MAGTF Integration Execution
23CASRE Prediction Setup
CALOEMAGTF Execution
MAGTF Integration Execution
24CASRE Reliability Prediction
CALOEMAGTF Execution
MAGTF Integration Execution
25CASRE Prequential Likelihood
CALOEMAGTF Execution
MAGTF Integration Execution
26CASRE Model-Ranking
CALOEMAGTF Execution
MAGTF Integration Execution
27Reliability Models
- Havent been able to get these to run yet.
- Instruction manual says many of the built-in
models only work with Time Between Failures Data. - Doubt there would be much utility with Failure
Count Data
28Conclusion/Follow-Up
- It actually would be QUITE easy to integrate
Failure Count or Time Between Failures Output
Auto-Generation into my environment - This would facilitate quick trend-analysis
- Reliability trends and not the actual numbers is
what is important