Probabilistic fault detection in network communication

About This Presentation

Title:

Probabilistic fault detection in network communication

Description:

Missing. Use of multi-layer observations to determine what fault has appeared ... Wet grass: example of factors and relations. Inference of requires the joint ... – PowerPoint PPT presentation

Number of Views:41

Avg rating:3.0/5.0

Slides: 49

Provided by: jesper2

Category:

more less

Transcript and Presenter's Notes

Title: Probabilistic fault detection in network communication

1
Probabilistic fault detection in network
communication

Developing a causal probabilistic network for
fault detection

16-06-2006
Gr839b Anders Nickelsen and Jesper Grønbæk
2
Disposition

Project background
Development of the CPN
Fault detector structure
Evaluating the CPN
Conclusion
Dynamic Bayesian Network

3
Project background

Jesper Grønbæk

4
Project background
Dependability and fault tolerance

Basis high dependability in car-to-car scenario
Critical dependability requirements to services
Availability and reliability
Faults and failures
The effects of a fault propagatepossibly
resulting in a new fault.
A fault causes another fault
Service failure ? Service specification
Purpose of fault tolerance is to handle faults
before they lead to failure.

5
Project background
Case scenario

Scenario of accident
Black-box information sent to road directorate
Accident information for insurance and emergency
services
Application layer fault tolerance
End-to-end service perspective
Faults may be unobservable
Observations
Ambiguous
Inconsistent
Missing
Use of multi-layer observations to determine what
fault has appeared

6
Development of the CPN

Jesper Grønbæk

7
Development of the CPN
CPN background I

In general humans do inference by intuition
Combines knowledge of influencing factors and
belief
Examples are football game, wet grass,
Reasoning under uncertainty
Enable computers to make inference based on
observations
Artificial Intelligence
C, S, R, W are stochastic variables
Probabilistic relations between the variables
Use of probability theory
Wet grass example of factors and relations
Inference of requires the jointprobability of
all variables
P(S R, W, C)
Handling complete joint probability is
intractable,as number of probabilities O(2n)
for binary variables

Cloudy
Sprinkler
Rain
Wet grass
8
Development of the CPN
CPN background II

Reduce the joint probability to tractable size
Utilize assumptions of variable independence
A CPN, N (G, P), consists of
G A directed acyclic graph containing the
stochastic variables and edges representing
causal relations - dependencies - between the
variables
Leaving out irrelevant edges represent
independencies
P Prior and conditional probabilities represent
the strengths of the relations
Chain rule with Markov property
P(C,S,R,W) P(C)P(SC)P(RC)P(WS,R)
Number of probabilities reduced to O(2kn), k is
max number of parents of a variable
Inference possible P(S R, W, C)

P(C)
Cloudy
P(SC)
P(RC)
Sprinkler
Rain
Wet grass
P(WS,R)
9
Development of the CPN
Motivation for considering CPNs for fault
detection

Areas of application
Decision support systems for medical diagnosis
Andreassen, 2001
Fault localization in network management
Steinder and Sethi, 2004
Medical diagnosis ? fault detection
Diseases ? network faults
Symptoms ? observation points
Knowledge ? causal relations between faults and
observation points

10
Development of the CPN
Steps in development

Four step process
Explore knowledge domain
Fault causes and effects
Choice of basic fault model
Communication process and observations
Develop CPN structure from knowledge
Identify stochastic variables
Define dependencies
Define states
Attain probabilities
Prior and conditional probability distributions
Verify CPN
Structure and probabilities

11
Development of the CPN
Knowledge domain
Case

Aim find cause for insufficient throughput
Link breakdown ? Causes congestion
Source of noise ? Affects link condition (Bad
link)
Both considered permanent faults in the case
Communication controlled by TCP
Congestion avoidance and flow control by
windowing.
Reacts to packet loss by reducing packet
transmissions. Leads to reduction in throughput.
Can inherently not distinguish packet loss
causes
Observations
Made available by the TCP communication process
e.g. RTT
Unreliable missing, delayed and noisy

Basic fault model graph
12
Development of the CPN
Structure development

Identified variables
Model based on faults (Hidden hypothesis nodes)
Intermediate variables (Hidden nodes)
Observations
Other considered elements
TCP
Application data, router service time,
Each variable has been defined with a finite set
of states
Hypothesis nodes Based on network states normal
and fault
Observations Based on features

Basic model
13
Development of the CPN
Obtaining probabilities

Conditional probability
P(FRR high bad link yes)
Methods for obtaining probabilities
Knowledge based approach
Captures the knowledge and experience of experts
Used when reliable measurements are not available
Data based approach
Fitting distributions to data sets
Learning probabilities
Based on representative learning cases of the
conditionals
May be difficult to obtain in reality, e.g. fault
states
Offline learning, batch learning
Handles hidden nodes and missing observations

14
Development of the CPN
Verifying the CPN

Verifying structure
Arrows and their direction is important
Wrong directions and dependencies may lead to
wrong conclusions when inferring in the CPN.
Verifying probabilities
Difficult to verify probabilities accurately.

Congestion
Bad link
Congestion
Bad link
Packet loss
Packet loss
15
Fault detector structure

Jesper Grønbæk

16
Fault detector structure
Presenting the components of the detector

Input Observations based on network traffic
Output Detection of a particular fault

17
Fault detector structure
Network and observation points I

Observations collected from network traffic
Example NS-2 trace (Network)

. . . 1.446678 0 1 tcp 1340 --- 101- 1.446678
0 1 tcp 1340 --- 101d 1.446678 0 1 tcp 1340 ---
101 - Dropped packet, FRRr 1.450758 1 0 ack 40
--- 91 - RTT sample 1.450758 0 1 tcp 1340 ---
102- 1.450758 0 1 tcp 1340 --- 102. . . r
1.520422 1 0 ack 40 --- 100 - 2nd dupack (no
RTT sample) 1.520422 0 1 tcp 1340 --- 114-
1.520422 0 1 tcp 1340 --- 114r 1.52471 1 0 ack
40 --- 100 - 3rd dupack (no RTT sample)
1.52471 0 1 tcp 1340 --- 101 - Packet
retransmission, PRR- 1.52471 0 1 tcp 1340 ---
101 . . .
FRR
RTT
PRR
18
Fault detector structure
Network and observation points II

Congestion fault
Observations and throughput

Congestion
? load
Round trip time
Time ms
Packet retransmission-rate
Rate
Frame retransmission-rate
Rate
Throughput
KB/s
19
Fault detector structure
Observation processing and evidence I

Sampling based approach
Observation processing
Moving Average
Observations missing, delayed or noisy
Fixed time window
New observation sample at time t
An observation consists of zero or more samples
Discretization of observations
Mapping observations to states
RTT discretized in 10 equally sized bins 62-105
ms, 4.3 ms intervals
PRR and FRR
High/Low state
Threshold setting
Evidence vector

e RTT 5, PRR High, FRR Low
20
Fault detector structure
Inference in the CPN, decision and detection I

Evidence is acquired
Observable nodes are initialized
Inference is conducted and marginal probabilities
are available
A decision can be made of the network state
Detection is based on transition from normal
tofault state

21
Evaluating the CPN

Anders Nickelsen

22
Evaluating the CPN
Objectives and metrics

Evaluate the applicability of the CPN for fault
detection
Parametric analysis
Comparison
High dependability
Classification metrics
Accuracy (TNTP) (TNTPFPFN)
Fault detection metrics
Reactivity time
False alarms

23
Evaluating the CPN
Important properties

Observation processing properties
Moving average window size
Observation node state-space
Probabilities
Prior probabilities of C and BL
Basic CPN properties
Impact of links and nodes
Fault detection properties
Individual faults (congestion vs. bad link)
Detection of multiple simultaneous faults
Performance compared to Vegas state predictor
Computational complexity

BL
C
PL
PRR
RTT
FRR
24
Evaluating the CPN
Observation processing I

Moving average mean estimator
Simple method to implement for observation
processing
Use of window averaging
increases accuracy
lowers amount of false alarms
increases reactivity time
RTT window size impact on congestion
classification
Window size change 100ms 1100ms
Accuracy change 82 90
Impact on reactivity time and false alarms
Reactivity time 370ms 960ms
False alarms 14 2

25
Evaluating the CPN
Observation properties II

Utilization of continuous variables from the
system in the discrete CPN.
Features of observations need to be captured by a
set of states
Features may be few and hard to distinguish.
PRR and FRR Binary variables
RTT 12 states
Few and/or alike features ? few states
Threshold location is critical
Number of states less essential in this case.
More and/or distinguishable features ? feasible
with more states
Increases the resolution
Threshold location less essential
Raises the number of conditional probabilities.
Level of difference in observation statistics
important for defining the amount of states and
setting thresholds.

26
Evaluating the CPN
Prior probabilities

P(C) and P(BL) represent prior belief of network
states
Sensitivity control
Decreased belief in fault state requires stronger
evidence
ROC-curve
Trade-off between true positivesand false
positives
Reactivity time not depicted
50/50 (fault/normal)
Reactivity time 583 ms
5/95 (fault/normal)
Reactivity time 1380 ms
Represents a comparisonfoundation for
classifiers

27
Evaluating the CPN
Structure

Intermediate packet loss node
Connects network and node from scenario
Needed to introduce e.g. TCP
Important for conditional (in)dependence
properties and efficient inferenceKjærulff,
2005
Can reduce CPN complexity by reducing amount of
conditional probabilities Andreassen, 2001
Insignificant impact in the basic CPN
Causal links in the CPN
Enables handling of
Ambiguity in observations
Inconsistent observations
Missing observations
Verified performance impact
RTT quality decisive

BL
C
PL
PRR
RTT
FRR
e RTT n/a, PRR 1, FRR 1
28
Evaluating the CPN
Fault detection evaluation

Detection of congestion more accurate than bad
link
Good quality of RTT and lower quality of PRR and
FRR
Introduce more observations, e.g. RSSI
Better quality in observations
Detecting of simultaneous faults
Decrease in accuracy
Congestion 86 82
Bad link 79 74
No significant impact on reactivity time or
false alarms

29
Evaluating the CPN
Fault detection evaluation

CPN compared to Vegas
State predictor functionality in TCP Vegas
Predicts congestion network state
Simple statistics RTT and window size
Vegas not designed for the application layer
fault detection
CPN outperforms Vegas measured in accuracy and
reactivity time
Vegas
False positives 5
True positives 12
Reactivity time 1900 ms
CPN
False positives 5
True positives 65
Reactivity time 875 ms
CPN uses multiple observations and averaging
?uses several samples and filters out noise

30
Evaluating the CPN
Computational complexity

Inference in multiply connected CPNs is NP-hard
Kjærulff, 2005.
Junction Tree algorithm (JTA) Method to compile
CPN to singly connected graph
Inference performance exponential in maximum
clique-size.
Optimal compilation (minimum clique-size) is
NP-hard Mahadevan, 2002.
CPN will become more complex when extending with
more faults and observations
Additional link increases clique-size
Clique-size dependent on CPN density
Low density ? Linear
High density ? Exponential growth
Exact inference not feasible in large, complex
CPNs
Real-time requirement
Alternative approximation methods must be
investigated
Stochastic sampling, model simplification,
search-based inference

31
Evaluating the CPN
Relating to external requirements

Fault detector in high dependability context
Scenario requirements to the fault detector
Reactivity time
Amount of false alarms
Example lfalse alarms 2.37 10-7 FA/s
1 fault/24h, 99 success-rate, 2 false alarms
Mean time before failure (false alarm) is approx.
49 days
Relate performance of CPN to requirements
Common congestion test scenario
Basic setup 6.6 10-2 FA/s
Tweaked setup 0 FA/s
Is zero good enough?
Total test duration is 7.5 minutes
Confidence not good enough from test scenario
Total operation time for l is 49 days
IEC 61508 part 7, annex D
Techniques and measures concerning functional
safety in safety-related PESs
Test duration t 3 49 days 147 days, using
a 95 confidence interval

32
Conclusion

Anders Nickelsen

33
Conclusion
Achievements in project

Defined overall fault model
Defined a basic model with two faults and
established knowledge domain
Developed a CPN structure based on knowledge
Developed a simulation environment to test
implementation
Verified the correctness of the structure using
empirical approach
Learned probabilistic information from
observation points from data traffic
Implemented a fault detector based on the CPN
Conducted parametric analysis of implemented
detector

34
Conclusion
Pros

Decision under uncertainty
Ambiguous data
Inconsistent
Missing data
Utilizes multiples observations
Detects multiple simultaneous faults
Modelling of complex systems using probabilistic
relations
Graphic models provide a basis to assess
dependency relations
Models may be based on statistics or knowledge
Probabilities can be learned from data and
combined with knowledge from experts
Potential to handle many faults, and take
advantage of probabilistic relations between the
faults, observations and functionalities of a
communication system

35
Conclusion
Cons

High computational complexity, approximate
methods are needed
Incapable of incorporating dynamics
Much effort required to construct model
Requires a sound understanding of the problem
domain.
Dependencies and conditional probabilities
Few formal design patterns

36
Conclusion
Further investigations

Efficiency in high dependability scenario
Still to be evaluated
Extend fault model
Detection capabilities of more than 2 faults
needed
Comparison to existing methods and high
dependability requirements
Observations
Utilize framework and add more observations
throughput, RSSI, SNR, packet inter-arrival time
Draw other (and possibly better) statistics from
existing observation points
Generality
Probabilities are expectedly different with
varying scenarios
On-line learning Murphy, 2002
Impact of extending the CPN structure
Mainly relates to computational complexity
Size
Structure
Unknown how easy it is to implement new
functionalities

37
Dynamic Bayesian Network

Jesper Grønbæk

38
Dynamic Bayesian Network
Motivation to explore Dynamic Bayesian Networks

Causal Probabilistic Network (CPN) Bayesian
Network (BN)
Represent a static view of the process
Cannot include dynamics or changes in time
Sufficient in many decision support systems
Medical diagnoses Andreassen, 2001
Network diagnosis (Network management) Steinder
and Sethi, 2004
Extension to BNs called Dynamic Bayesian Networks
Used to model a dynamic process by including
temporal causality.
Medical example Probability of having the flu
today given I had the flu yesterday.
Network example Probability of being in a high
congestion state currently given the network was
in a high congestion state previously.
Example of applications
Prediction of blood glucose levels Andreassen,
2001
Fault diagnosis in process monitoring Lauber,
1999
Speech recognition Zweig, 1998

39
Dynamic Bayesian Network
Defining the DBN

Dynamic Bayesian Network
Bayesian Network extended in time, i.e. causal
links in time
Discrete-time stochastic processes
Slice at time t 1 defines priors
Under assumption of first-order Markov DBN may be
represented by a 2TBN
First slice contains no parameters, second slice
contains conditional probabilities.
Parents may exist in both slice t or t-1

40
Dynamic Bayesian Network
Improving the basic CPN

Improve model to consider dynamics of
communication process
Handle cycle induced by the feedback of TCP
Static view of observables no longer needed.
Possibly take advantage of sequences of
observations to improve statistics. Liu et al.,
2003
Same methods for learning and inference in CPNs
possible for DBNs Murphy, 2002

41
Questions

End of presentation

42
References
Andreassen, 2001 Andreassen, S. (2001). Medical
decision support systems based on causal
probabilistic networks. Technical report,
Department of Medical Informatics and Image
Analysis, Aalborg University. Steinder and
Sethi, 2004 Steinder, M. and Sethi, A.S. (2004).
Probabilistic fault localization in communication
systems using belief networks. Technical repost,
IEEE. Kjærulff, 2005 Uffe Kjærulff (2005).
Constructing Bayesian Networks. Presentation,
Group of Machine Intelligence Department of
Computer Science, Aalborg University. Mahadevan,
2002 Sridhar Mahadevan (2002). The Junction
Tree Algorithm. Presentation, University of
Massachusetts. Murphy, 2002 Murphey, K. P.
(2002). Dynamic bayesian networks
Representation, inference and learning. Technical
report, University of California,
Berkeley. Lauber, 1999 Lauber, J. Steger, C.
and Weiss, R. Autonomous Agents for Online
Diagnosis of a Safety-critical System Based on
Probabilistic Causal Reasoning. Technical report,
Institute for Technical Informatics Technical
University Graz. Zweig, 1998 Zweig, G. and
Russell, S. Speech recognition with Dynamic
Bayesian Networks. Computer Science Division, UC
Berkeley. Liu et al., 2003 Liu, J., Matta, I.
and Crovella, M. (2003). End-to-end inference of
loss nature in a hybrid wired/wireless
environment from wireless losses. Technical
report, Department of Computer Science
University of North Dakota.
43
Additional slides
44
Additional slides
Learning cases
Learning cases
C, BL, RTT, PRR, FRR, PL... High, No, n/a,
High, Low, n/aHigh, No, 6, High, Low,
n/aHigh, No, 2, High, Low, n/aHigh, No, 0,
High, Low, n/aHigh, No, 4, High, Low, n/a ...
45
Additional slides
Extended model
46
Additional slides
Long congestion simulation run
47
Additional slides
Connection counts
48
Additional slides
Congestion classification

Write a Comment

User Comments (0)