Title: Probabilistic fault detection in network communication
1Probabilistic fault detection in network
communication
- Developing a causal probabilistic network for
fault detection
16-06-2006
Gr839b Anders Nickelsen and Jesper Grønbæk
2Disposition
- Project background
- Development of the CPN
- Fault detector structure
- Evaluating the CPN
- Conclusion
- Dynamic Bayesian Network
3Project background
4Project background
Dependability and fault tolerance
- Basis high dependability in car-to-car scenario
- Critical dependability requirements to services
- Availability and reliability
- Faults and failures
- The effects of a fault propagatepossibly
resulting in a new fault. - A fault causes another fault
- Service failure ? Service specification
- Purpose of fault tolerance is to handle faults
before they lead to failure.
5Project background
Case scenario
- Scenario of accident
- Black-box information sent to road directorate
- Accident information for insurance and emergency
services - Application layer fault tolerance
- End-to-end service perspective
- Faults may be unobservable
- Observations
- Ambiguous
- Inconsistent
- Missing
- Use of multi-layer observations to determine what
fault has appeared
6Development of the CPN
7Development of the CPN
CPN background I
- In general humans do inference by intuition
- Combines knowledge of influencing factors and
belief - Examples are football game, wet grass,
- Reasoning under uncertainty
- Enable computers to make inference based on
observations - Artificial Intelligence
- C, S, R, W are stochastic variables
- Probabilistic relations between the variables
- Use of probability theory
- Wet grass example of factors and relations
- Inference of requires the jointprobability of
all variables - P(S R, W, C)
- Handling complete joint probability is
intractable,as number of probabilities O(2n)
for binary variables
Cloudy
Sprinkler
Rain
Wet grass
8Development of the CPN
CPN background II
- Reduce the joint probability to tractable size
- Utilize assumptions of variable independence
- A CPN, N (G, P), consists of
- G A directed acyclic graph containing the
stochastic variables and edges representing
causal relations - dependencies - between the
variables - Leaving out irrelevant edges represent
independencies - P Prior and conditional probabilities represent
the strengths of the relations - Chain rule with Markov property
- P(C,S,R,W) P(C)P(SC)P(RC)P(WS,R)
- Number of probabilities reduced to O(2kn), k is
max number of parents of a variable - Inference possible P(S R, W, C)
P(C)
Cloudy
P(SC)
P(RC)
Sprinkler
Rain
Wet grass
P(WS,R)
9Development of the CPN
Motivation for considering CPNs for fault
detection
- Areas of application
- Decision support systems for medical diagnosis
Andreassen, 2001 - Fault localization in network management
Steinder and Sethi, 2004 - Medical diagnosis ? fault detection
- Diseases ? network faults
- Symptoms ? observation points
- Knowledge ? causal relations between faults and
observation points
10Development of the CPN
Steps in development
- Four step process
- Explore knowledge domain
- Fault causes and effects
- Choice of basic fault model
- Communication process and observations
- Develop CPN structure from knowledge
- Identify stochastic variables
- Define dependencies
- Define states
- Attain probabilities
- Prior and conditional probability distributions
- Verify CPN
- Structure and probabilities
11Development of the CPN
Knowledge domain
Case
- Aim find cause for insufficient throughput
- Link breakdown ? Causes congestion
- Source of noise ? Affects link condition (Bad
link) - Both considered permanent faults in the case
- Communication controlled by TCP
- Congestion avoidance and flow control by
windowing. - Reacts to packet loss by reducing packet
transmissions. Leads to reduction in throughput. - Can inherently not distinguish packet loss
causes - Observations
- Made available by the TCP communication process
e.g. RTT - Unreliable missing, delayed and noisy
Basic fault model graph
12Development of the CPN
Structure development
- Identified variables
- Model based on faults (Hidden hypothesis nodes)
- Intermediate variables (Hidden nodes)
- Observations
- Other considered elements
- TCP
- Application data, router service time,
- Each variable has been defined with a finite set
of states - Hypothesis nodes Based on network states normal
and fault - Observations Based on features
Basic model
13Development of the CPN
Obtaining probabilities
- Conditional probability
- P(FRR high bad link yes)
- Methods for obtaining probabilities
- Knowledge based approach
- Captures the knowledge and experience of experts
- Used when reliable measurements are not available
- Data based approach
- Fitting distributions to data sets
- Learning probabilities
- Based on representative learning cases of the
conditionals - May be difficult to obtain in reality, e.g. fault
states - Offline learning, batch learning
- Handles hidden nodes and missing observations
14Development of the CPN
Verifying the CPN
- Verifying structure
- Arrows and their direction is important
- Wrong directions and dependencies may lead to
wrong conclusions when inferring in the CPN. - Verifying probabilities
- Difficult to verify probabilities accurately.
Congestion
Bad link
Congestion
Bad link
Packet loss
Packet loss
15Fault detector structure
16Fault detector structure
Presenting the components of the detector
- Input Observations based on network traffic
- Output Detection of a particular fault
17Fault detector structure
Network and observation points I
- Observations collected from network traffic
- Example NS-2 trace (Network)
. . . 1.446678 0 1 tcp 1340 --- 101- 1.446678
0 1 tcp 1340 --- 101d 1.446678 0 1 tcp 1340 ---
101 - Dropped packet, FRRr 1.450758 1 0 ack 40
--- 91 - RTT sample 1.450758 0 1 tcp 1340 ---
102- 1.450758 0 1 tcp 1340 --- 102. . . r
1.520422 1 0 ack 40 --- 100 - 2nd dupack (no
RTT sample) 1.520422 0 1 tcp 1340 --- 114-
1.520422 0 1 tcp 1340 --- 114r 1.52471 1 0 ack
40 --- 100 - 3rd dupack (no RTT sample)
1.52471 0 1 tcp 1340 --- 101 - Packet
retransmission, PRR- 1.52471 0 1 tcp 1340 ---
101 . . .
FRR
RTT
PRR
18Fault detector structure
Network and observation points II
- Congestion fault
- Observations and throughput
Congestion
? load
Round trip time
Time ms
Packet retransmission-rate
Rate
Frame retransmission-rate
Rate
Throughput
KB/s
19Fault detector structure
Observation processing and evidence I
- Sampling based approach
- Observation processing
- Moving Average
- Observations missing, delayed or noisy
- Fixed time window
- New observation sample at time t
- An observation consists of zero or more samples
- Discretization of observations
- Mapping observations to states
- RTT discretized in 10 equally sized bins 62-105
ms, 4.3 ms intervals - PRR and FRR
- High/Low state
- Threshold setting
- Evidence vector
e RTT 5, PRR High, FRR Low
20Fault detector structure
Inference in the CPN, decision and detection I
- Evidence is acquired
- Observable nodes are initialized
- Inference is conducted and marginal probabilities
are available - A decision can be made of the network state
- Detection is based on transition from normal
tofault state
21Evaluating the CPN
22Evaluating the CPN
Objectives and metrics
- Evaluate the applicability of the CPN for fault
detection - Parametric analysis
- Comparison
- High dependability
- Classification metrics
- Accuracy (TNTP) (TNTPFPFN)
- Fault detection metrics
- Reactivity time
- False alarms
23Evaluating the CPN
Important properties
- Observation processing properties
- Moving average window size
- Observation node state-space
- Probabilities
- Prior probabilities of C and BL
- Basic CPN properties
- Impact of links and nodes
- Fault detection properties
- Individual faults (congestion vs. bad link)
- Detection of multiple simultaneous faults
- Performance compared to Vegas state predictor
- Computational complexity
BL
C
PL
PRR
RTT
FRR
24Evaluating the CPN
Observation processing I
- Moving average mean estimator
- Simple method to implement for observation
processing - Use of window averaging
- increases accuracy
- lowers amount of false alarms
- increases reactivity time
- RTT window size impact on congestion
classification - Window size change 100ms 1100ms
- Accuracy change 82 90
- Impact on reactivity time and false alarms
- Reactivity time 370ms 960ms
- False alarms 14 2
25Evaluating the CPN
Observation properties II
- Utilization of continuous variables from the
system in the discrete CPN. - Features of observations need to be captured by a
set of states - Features may be few and hard to distinguish.
- PRR and FRR Binary variables
- RTT 12 states
- Few and/or alike features ? few states
- Threshold location is critical
- Number of states less essential in this case.
- More and/or distinguishable features ? feasible
with more states - Increases the resolution
- Threshold location less essential
- Raises the number of conditional probabilities.
- Level of difference in observation statistics
important for defining the amount of states and
setting thresholds.
26Evaluating the CPN
Prior probabilities
- P(C) and P(BL) represent prior belief of network
states - Sensitivity control
- Decreased belief in fault state requires stronger
evidence - ROC-curve
- Trade-off between true positivesand false
positives - Reactivity time not depicted
- 50/50 (fault/normal)
- Reactivity time 583 ms
- 5/95 (fault/normal)
- Reactivity time 1380 ms
- Represents a comparisonfoundation for
classifiers
27Evaluating the CPN
Structure
- Intermediate packet loss node
- Connects network and node from scenario
- Needed to introduce e.g. TCP
- Important for conditional (in)dependence
properties and efficient inferenceKjærulff,
2005 - Can reduce CPN complexity by reducing amount of
conditional probabilities Andreassen, 2001 - Insignificant impact in the basic CPN
- Causal links in the CPN
- Enables handling of
- Ambiguity in observations
- Inconsistent observations
- Missing observations
- Verified performance impact
- RTT quality decisive
BL
C
PL
PRR
RTT
FRR
e RTT n/a, PRR 1, FRR 1
28Evaluating the CPN
Fault detection evaluation
- Detection of congestion more accurate than bad
link - Good quality of RTT and lower quality of PRR and
FRR - Introduce more observations, e.g. RSSI
- Better quality in observations
- Detecting of simultaneous faults
- Decrease in accuracy
- Congestion 86 82
- Bad link 79 74
- No significant impact on reactivity time or
false alarms
29Evaluating the CPN
Fault detection evaluation
- CPN compared to Vegas
- State predictor functionality in TCP Vegas
- Predicts congestion network state
- Simple statistics RTT and window size
- Vegas not designed for the application layer
fault detection - CPN outperforms Vegas measured in accuracy and
reactivity time - Vegas
- False positives 5
- True positives 12
- Reactivity time 1900 ms
- CPN
- False positives 5
- True positives 65
- Reactivity time 875 ms
- CPN uses multiple observations and averaging
?uses several samples and filters out noise
30Evaluating the CPN
Computational complexity
- Inference in multiply connected CPNs is NP-hard
Kjærulff, 2005. - Junction Tree algorithm (JTA) Method to compile
CPN to singly connected graph - Inference performance exponential in maximum
clique-size. - Optimal compilation (minimum clique-size) is
NP-hard Mahadevan, 2002. - CPN will become more complex when extending with
more faults and observations - Additional link increases clique-size
- Clique-size dependent on CPN density
- Low density ? Linear
- High density ? Exponential growth
- Exact inference not feasible in large, complex
CPNs - Real-time requirement
- Alternative approximation methods must be
investigated - Stochastic sampling, model simplification,
search-based inference
31Evaluating the CPN
Relating to external requirements
- Fault detector in high dependability context
- Scenario requirements to the fault detector
- Reactivity time
- Amount of false alarms
- Example lfalse alarms 2.37 10-7 FA/s
- 1 fault/24h, 99 success-rate, 2 false alarms
- Mean time before failure (false alarm) is approx.
49 days - Relate performance of CPN to requirements
- Common congestion test scenario
- Basic setup 6.6 10-2 FA/s
- Tweaked setup 0 FA/s
- Is zero good enough?
- Total test duration is 7.5 minutes
- Confidence not good enough from test scenario
- Total operation time for l is 49 days
- IEC 61508 part 7, annex D
- Techniques and measures concerning functional
safety in safety-related PESs - Test duration t 3 49 days 147 days, using
a 95 confidence interval
32Conclusion
33Conclusion
Achievements in project
- Defined overall fault model
- Defined a basic model with two faults and
established knowledge domain - Developed a CPN structure based on knowledge
- Developed a simulation environment to test
implementation - Verified the correctness of the structure using
empirical approach - Learned probabilistic information from
observation points from data traffic - Implemented a fault detector based on the CPN
- Conducted parametric analysis of implemented
detector
34Conclusion
Pros
- Decision under uncertainty
- Ambiguous data
- Inconsistent
- Missing data
- Utilizes multiples observations
- Detects multiple simultaneous faults
- Modelling of complex systems using probabilistic
relations - Graphic models provide a basis to assess
dependency relations - Models may be based on statistics or knowledge
- Probabilities can be learned from data and
combined with knowledge from experts - Potential to handle many faults, and take
advantage of probabilistic relations between the
faults, observations and functionalities of a
communication system
35Conclusion
Cons
- High computational complexity, approximate
methods are needed - Incapable of incorporating dynamics
- Much effort required to construct model
- Requires a sound understanding of the problem
domain. - Dependencies and conditional probabilities
- Few formal design patterns
36Conclusion
Further investigations
- Efficiency in high dependability scenario
- Still to be evaluated
- Extend fault model
- Detection capabilities of more than 2 faults
needed - Comparison to existing methods and high
dependability requirements - Observations
- Utilize framework and add more observations
- throughput, RSSI, SNR, packet inter-arrival time
- Draw other (and possibly better) statistics from
existing observation points - Generality
- Probabilities are expectedly different with
varying scenarios - On-line learning Murphy, 2002
- Impact of extending the CPN structure
- Mainly relates to computational complexity
- Size
- Structure
- Unknown how easy it is to implement new
functionalities
37Dynamic Bayesian Network
38Dynamic Bayesian Network
Motivation to explore Dynamic Bayesian Networks
- Causal Probabilistic Network (CPN) Bayesian
Network (BN) - Represent a static view of the process
- Cannot include dynamics or changes in time
- Sufficient in many decision support systems
- Medical diagnoses Andreassen, 2001
- Network diagnosis (Network management) Steinder
and Sethi, 2004 -
- Extension to BNs called Dynamic Bayesian Networks
- Used to model a dynamic process by including
temporal causality. - Medical example Probability of having the flu
today given I had the flu yesterday. - Network example Probability of being in a high
congestion state currently given the network was
in a high congestion state previously. - Example of applications
- Prediction of blood glucose levels Andreassen,
2001 - Fault diagnosis in process monitoring Lauber,
1999 - Speech recognition Zweig, 1998
39Dynamic Bayesian Network
Defining the DBN
- Dynamic Bayesian Network
- Bayesian Network extended in time, i.e. causal
links in time - Discrete-time stochastic processes
- Slice at time t 1 defines priors
- Under assumption of first-order Markov DBN may be
represented by a 2TBN - First slice contains no parameters, second slice
contains conditional probabilities. - Parents may exist in both slice t or t-1
40Dynamic Bayesian Network
Improving the basic CPN
- Improve model to consider dynamics of
communication process - Handle cycle induced by the feedback of TCP
- Static view of observables no longer needed.
Possibly take advantage of sequences of
observations to improve statistics. Liu et al.,
2003 - Same methods for learning and inference in CPNs
possible for DBNs Murphy, 2002
41Questions
42References
Andreassen, 2001 Andreassen, S. (2001). Medical
decision support systems based on causal
probabilistic networks. Technical report,
Department of Medical Informatics and Image
Analysis, Aalborg University. Steinder and
Sethi, 2004 Steinder, M. and Sethi, A.S. (2004).
Probabilistic fault localization in communication
systems using belief networks. Technical repost,
IEEE. Kjærulff, 2005 Uffe Kjærulff (2005).
Constructing Bayesian Networks. Presentation,
Group of Machine Intelligence Department of
Computer Science, Aalborg University. Mahadevan,
2002 Sridhar Mahadevan (2002). The Junction
Tree Algorithm. Presentation, University of
Massachusetts. Murphy, 2002 Murphey, K. P.
(2002). Dynamic bayesian networks
Representation, inference and learning. Technical
report, University of California,
Berkeley. Lauber, 1999 Lauber, J. Steger, C.
and Weiss, R. Autonomous Agents for Online
Diagnosis of a Safety-critical System Based on
Probabilistic Causal Reasoning. Technical report,
Institute for Technical Informatics Technical
University Graz. Zweig, 1998 Zweig, G. and
Russell, S. Speech recognition with Dynamic
Bayesian Networks. Computer Science Division, UC
Berkeley. Liu et al., 2003 Liu, J., Matta, I.
and Crovella, M. (2003). End-to-end inference of
loss nature in a hybrid wired/wireless
environment from wireless losses. Technical
report, Department of Computer Science
University of North Dakota.
43Additional slides
44Additional slides
Learning cases
Learning cases
C, BL, RTT, PRR, FRR, PL... High, No, n/a,
High, Low, n/aHigh, No, 6, High, Low,
n/aHigh, No, 2, High, Low, n/aHigh, No, 0,
High, Low, n/aHigh, No, 4, High, Low, n/a ...
45Additional slides
Extended model
46Additional slides
Long congestion simulation run
47Additional slides
Connection counts
48Additional slides
Congestion classification