Probabilistic fault detection in network communication - PowerPoint PPT Presentation

1 / 48
About This Presentation
Title:

Probabilistic fault detection in network communication

Description:

Missing. Use of multi-layer observations to determine what fault has appeared ... Wet grass: example of factors and relations. Inference of requires the joint ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 49
Provided by: jesper2
Category:

less

Transcript and Presenter's Notes

Title: Probabilistic fault detection in network communication


1
Probabilistic fault detection in network
communication
  • Developing a causal probabilistic network for
    fault detection

16-06-2006
Gr839b Anders Nickelsen and Jesper Grønbæk
2
Disposition
  • Project background
  • Development of the CPN
  • Fault detector structure
  • Evaluating the CPN
  • Conclusion
  • Dynamic Bayesian Network

3
Project background
  • Jesper Grønbæk

4
Project background
Dependability and fault tolerance
  • Basis high dependability in car-to-car scenario
  • Critical dependability requirements to services
  • Availability and reliability
  • Faults and failures
  • The effects of a fault propagatepossibly
    resulting in a new fault.
  • A fault causes another fault
  • Service failure ? Service specification
  • Purpose of fault tolerance is to handle faults
    before they lead to failure.

5
Project background
Case scenario
  • Scenario of accident
  • Black-box information sent to road directorate
  • Accident information for insurance and emergency
    services
  • Application layer fault tolerance
  • End-to-end service perspective
  • Faults may be unobservable
  • Observations
  • Ambiguous
  • Inconsistent
  • Missing
  • Use of multi-layer observations to determine what
    fault has appeared

6
Development of the CPN
  • Jesper Grønbæk

7
Development of the CPN
CPN background I
  • In general humans do inference by intuition
  • Combines knowledge of influencing factors and
    belief
  • Examples are football game, wet grass,
  • Reasoning under uncertainty
  • Enable computers to make inference based on
    observations
  • Artificial Intelligence
  • C, S, R, W are stochastic variables
  • Probabilistic relations between the variables
  • Use of probability theory
  • Wet grass example of factors and relations
  • Inference of requires the jointprobability of
    all variables
  • P(S R, W, C)
  • Handling complete joint probability is
    intractable,as number of probabilities O(2n)
    for binary variables

Cloudy
Sprinkler
Rain
Wet grass
8
Development of the CPN
CPN background II
  • Reduce the joint probability to tractable size
  • Utilize assumptions of variable independence
  • A CPN, N (G, P), consists of
  • G A directed acyclic graph containing the
    stochastic variables and edges representing
    causal relations - dependencies - between the
    variables
  • Leaving out irrelevant edges represent
    independencies
  • P Prior and conditional probabilities represent
    the strengths of the relations
  • Chain rule with Markov property
  • P(C,S,R,W) P(C)P(SC)P(RC)P(WS,R)
  • Number of probabilities reduced to O(2kn), k is
    max number of parents of a variable
  • Inference possible P(S R, W, C)

P(C)
Cloudy
P(SC)
P(RC)
Sprinkler
Rain
Wet grass
P(WS,R)
9
Development of the CPN
Motivation for considering CPNs for fault
detection
  • Areas of application
  • Decision support systems for medical diagnosis
    Andreassen, 2001
  • Fault localization in network management
    Steinder and Sethi, 2004
  • Medical diagnosis ? fault detection
  • Diseases ? network faults
  • Symptoms ? observation points
  • Knowledge ? causal relations between faults and
    observation points

10
Development of the CPN
Steps in development
  • Four step process
  • Explore knowledge domain
  • Fault causes and effects
  • Choice of basic fault model
  • Communication process and observations
  • Develop CPN structure from knowledge
  • Identify stochastic variables
  • Define dependencies
  • Define states
  • Attain probabilities
  • Prior and conditional probability distributions
  • Verify CPN
  • Structure and probabilities

11
Development of the CPN
Knowledge domain
Case
  • Aim find cause for insufficient throughput
  • Link breakdown ? Causes congestion
  • Source of noise ? Affects link condition (Bad
    link)
  • Both considered permanent faults in the case
  • Communication controlled by TCP
  • Congestion avoidance and flow control by
    windowing.
  • Reacts to packet loss by reducing packet
    transmissions. Leads to reduction in throughput.
  • Can inherently not distinguish packet loss
    causes
  • Observations
  • Made available by the TCP communication process
    e.g. RTT
  • Unreliable missing, delayed and noisy

Basic fault model graph
12
Development of the CPN
Structure development
  • Identified variables
  • Model based on faults (Hidden hypothesis nodes)
  • Intermediate variables (Hidden nodes)
  • Observations
  • Other considered elements
  • TCP
  • Application data, router service time,
  • Each variable has been defined with a finite set
    of states
  • Hypothesis nodes Based on network states normal
    and fault
  • Observations Based on features

Basic model
13
Development of the CPN
Obtaining probabilities
  • Conditional probability
  • P(FRR high bad link yes)
  • Methods for obtaining probabilities
  • Knowledge based approach
  • Captures the knowledge and experience of experts
  • Used when reliable measurements are not available
  • Data based approach
  • Fitting distributions to data sets
  • Learning probabilities
  • Based on representative learning cases of the
    conditionals
  • May be difficult to obtain in reality, e.g. fault
    states
  • Offline learning, batch learning
  • Handles hidden nodes and missing observations

14
Development of the CPN
Verifying the CPN
  • Verifying structure
  • Arrows and their direction is important
  • Wrong directions and dependencies may lead to
    wrong conclusions when inferring in the CPN.
  • Verifying probabilities
  • Difficult to verify probabilities accurately.

Congestion
Bad link
Congestion
Bad link
Packet loss
Packet loss
15
Fault detector structure
  • Jesper Grønbæk

16
Fault detector structure
Presenting the components of the detector
  • Input Observations based on network traffic
  • Output Detection of a particular fault

17
Fault detector structure
Network and observation points I
  • Observations collected from network traffic
  • Example NS-2 trace (Network)

. . . 1.446678 0 1 tcp 1340 --- 101- 1.446678
0 1 tcp 1340 --- 101d 1.446678 0 1 tcp 1340 ---
101 - Dropped packet, FRRr 1.450758 1 0 ack 40
--- 91 - RTT sample 1.450758 0 1 tcp 1340 ---
102- 1.450758 0 1 tcp 1340 --- 102. . . r
1.520422 1 0 ack 40 --- 100 - 2nd dupack (no
RTT sample) 1.520422 0 1 tcp 1340 --- 114-
1.520422 0 1 tcp 1340 --- 114r 1.52471 1 0 ack
40 --- 100 - 3rd dupack (no RTT sample)
1.52471 0 1 tcp 1340 --- 101 - Packet
retransmission, PRR- 1.52471 0 1 tcp 1340 ---
101 . . .
FRR
RTT
PRR
18
Fault detector structure
Network and observation points II
  • Congestion fault
  • Observations and throughput

Congestion
? load
Round trip time
Time ms
Packet retransmission-rate
Rate
Frame retransmission-rate
Rate
Throughput
KB/s
19
Fault detector structure
Observation processing and evidence I
  • Sampling based approach
  • Observation processing
  • Moving Average
  • Observations missing, delayed or noisy
  • Fixed time window
  • New observation sample at time t
  • An observation consists of zero or more samples
  • Discretization of observations
  • Mapping observations to states
  • RTT discretized in 10 equally sized bins 62-105
    ms, 4.3 ms intervals
  • PRR and FRR
  • High/Low state
  • Threshold setting
  • Evidence vector

e RTT 5, PRR High, FRR Low
20
Fault detector structure
Inference in the CPN, decision and detection I
  • Evidence is acquired
  • Observable nodes are initialized
  • Inference is conducted and marginal probabilities
    are available
  • A decision can be made of the network state
  • Detection is based on transition from normal
    tofault state

21
Evaluating the CPN
  • Anders Nickelsen

22
Evaluating the CPN
Objectives and metrics
  • Evaluate the applicability of the CPN for fault
    detection
  • Parametric analysis
  • Comparison
  • High dependability
  • Classification metrics
  • Accuracy (TNTP) (TNTPFPFN)
  • Fault detection metrics
  • Reactivity time
  • False alarms

23
Evaluating the CPN
Important properties
  • Observation processing properties
  • Moving average window size
  • Observation node state-space
  • Probabilities
  • Prior probabilities of C and BL
  • Basic CPN properties
  • Impact of links and nodes
  • Fault detection properties
  • Individual faults (congestion vs. bad link)
  • Detection of multiple simultaneous faults
  • Performance compared to Vegas state predictor
  • Computational complexity

BL
C
PL
PRR
RTT
FRR
24
Evaluating the CPN
Observation processing I
  • Moving average mean estimator
  • Simple method to implement for observation
    processing
  • Use of window averaging
  • increases accuracy
  • lowers amount of false alarms
  • increases reactivity time
  • RTT window size impact on congestion
    classification
  • Window size change 100ms 1100ms
  • Accuracy change 82 90
  • Impact on reactivity time and false alarms
  • Reactivity time 370ms 960ms
  • False alarms 14 2

25
Evaluating the CPN
Observation properties II
  • Utilization of continuous variables from the
    system in the discrete CPN.
  • Features of observations need to be captured by a
    set of states
  • Features may be few and hard to distinguish.
  • PRR and FRR Binary variables
  • RTT 12 states
  • Few and/or alike features ? few states
  • Threshold location is critical
  • Number of states less essential in this case.
  • More and/or distinguishable features ? feasible
    with more states
  • Increases the resolution
  • Threshold location less essential
  • Raises the number of conditional probabilities.
  • Level of difference in observation statistics
    important for defining the amount of states and
    setting thresholds.

26
Evaluating the CPN
Prior probabilities
  • P(C) and P(BL) represent prior belief of network
    states
  • Sensitivity control
  • Decreased belief in fault state requires stronger
    evidence
  • ROC-curve
  • Trade-off between true positivesand false
    positives
  • Reactivity time not depicted
  • 50/50 (fault/normal)
  • Reactivity time 583 ms
  • 5/95 (fault/normal)
  • Reactivity time 1380 ms
  • Represents a comparisonfoundation for
    classifiers

27
Evaluating the CPN
Structure
  • Intermediate packet loss node
  • Connects network and node from scenario
  • Needed to introduce e.g. TCP
  • Important for conditional (in)dependence
    properties and efficient inferenceKjærulff,
    2005
  • Can reduce CPN complexity by reducing amount of
    conditional probabilities Andreassen, 2001
  • Insignificant impact in the basic CPN
  • Causal links in the CPN
  • Enables handling of
  • Ambiguity in observations
  • Inconsistent observations
  • Missing observations
  • Verified performance impact
  • RTT quality decisive

BL
C
PL
PRR
RTT
FRR
e RTT n/a, PRR 1, FRR 1
28
Evaluating the CPN
Fault detection evaluation
  • Detection of congestion more accurate than bad
    link
  • Good quality of RTT and lower quality of PRR and
    FRR
  • Introduce more observations, e.g. RSSI
  • Better quality in observations
  • Detecting of simultaneous faults
  • Decrease in accuracy
  • Congestion 86 82
  • Bad link 79 74
  • No significant impact on reactivity time or
    false alarms

29
Evaluating the CPN
Fault detection evaluation
  • CPN compared to Vegas
  • State predictor functionality in TCP Vegas
  • Predicts congestion network state
  • Simple statistics RTT and window size
  • Vegas not designed for the application layer
    fault detection
  • CPN outperforms Vegas measured in accuracy and
    reactivity time
  • Vegas
  • False positives 5
  • True positives 12
  • Reactivity time 1900 ms
  • CPN
  • False positives 5
  • True positives 65
  • Reactivity time 875 ms
  • CPN uses multiple observations and averaging
    ?uses several samples and filters out noise

30
Evaluating the CPN
Computational complexity
  • Inference in multiply connected CPNs is NP-hard
    Kjærulff, 2005.
  • Junction Tree algorithm (JTA) Method to compile
    CPN to singly connected graph
  • Inference performance exponential in maximum
    clique-size.
  • Optimal compilation (minimum clique-size) is
    NP-hard Mahadevan, 2002.
  • CPN will become more complex when extending with
    more faults and observations
  • Additional link increases clique-size
  • Clique-size dependent on CPN density
  • Low density ? Linear
  • High density ? Exponential growth
  • Exact inference not feasible in large, complex
    CPNs
  • Real-time requirement
  • Alternative approximation methods must be
    investigated
  • Stochastic sampling, model simplification,
    search-based inference

31
Evaluating the CPN
Relating to external requirements
  • Fault detector in high dependability context
  • Scenario requirements to the fault detector
  • Reactivity time
  • Amount of false alarms
  • Example lfalse alarms 2.37 10-7 FA/s
  • 1 fault/24h, 99 success-rate, 2 false alarms
  • Mean time before failure (false alarm) is approx.
    49 days
  • Relate performance of CPN to requirements
  • Common congestion test scenario
  • Basic setup 6.6 10-2 FA/s
  • Tweaked setup 0 FA/s
  • Is zero good enough?
  • Total test duration is 7.5 minutes
  • Confidence not good enough from test scenario
  • Total operation time for l is 49 days
  • IEC 61508 part 7, annex D
  • Techniques and measures concerning functional
    safety in safety-related PESs
  • Test duration t 3 49 days 147 days, using
    a 95 confidence interval

32
Conclusion
  • Anders Nickelsen

33
Conclusion
Achievements in project
  • Defined overall fault model
  • Defined a basic model with two faults and
    established knowledge domain
  • Developed a CPN structure based on knowledge
  • Developed a simulation environment to test
    implementation
  • Verified the correctness of the structure using
    empirical approach
  • Learned probabilistic information from
    observation points from data traffic
  • Implemented a fault detector based on the CPN
  • Conducted parametric analysis of implemented
    detector

34
Conclusion
Pros
  • Decision under uncertainty
  • Ambiguous data
  • Inconsistent
  • Missing data
  • Utilizes multiples observations
  • Detects multiple simultaneous faults
  • Modelling of complex systems using probabilistic
    relations
  • Graphic models provide a basis to assess
    dependency relations
  • Models may be based on statistics or knowledge
  • Probabilities can be learned from data and
    combined with knowledge from experts
  • Potential to handle many faults, and take
    advantage of probabilistic relations between the
    faults, observations and functionalities of a
    communication system

35
Conclusion
Cons
  • High computational complexity, approximate
    methods are needed
  • Incapable of incorporating dynamics
  • Much effort required to construct model
  • Requires a sound understanding of the problem
    domain.
  • Dependencies and conditional probabilities
  • Few formal design patterns

36
Conclusion
Further investigations
  • Efficiency in high dependability scenario
  • Still to be evaluated
  • Extend fault model
  • Detection capabilities of more than 2 faults
    needed
  • Comparison to existing methods and high
    dependability requirements
  • Observations
  • Utilize framework and add more observations
  • throughput, RSSI, SNR, packet inter-arrival time
  • Draw other (and possibly better) statistics from
    existing observation points
  • Generality
  • Probabilities are expectedly different with
    varying scenarios
  • On-line learning Murphy, 2002
  • Impact of extending the CPN structure
  • Mainly relates to computational complexity
  • Size
  • Structure
  • Unknown how easy it is to implement new
    functionalities

37
Dynamic Bayesian Network
  • Jesper Grønbæk

38
Dynamic Bayesian Network
Motivation to explore Dynamic Bayesian Networks
  • Causal Probabilistic Network (CPN) Bayesian
    Network (BN)
  • Represent a static view of the process
  • Cannot include dynamics or changes in time
  • Sufficient in many decision support systems
  • Medical diagnoses Andreassen, 2001
  • Network diagnosis (Network management) Steinder
    and Sethi, 2004
  • Extension to BNs called Dynamic Bayesian Networks
  • Used to model a dynamic process by including
    temporal causality.
  • Medical example Probability of having the flu
    today given I had the flu yesterday.
  • Network example Probability of being in a high
    congestion state currently given the network was
    in a high congestion state previously.
  • Example of applications
  • Prediction of blood glucose levels Andreassen,
    2001
  • Fault diagnosis in process monitoring Lauber,
    1999
  • Speech recognition Zweig, 1998

39
Dynamic Bayesian Network
Defining the DBN
  • Dynamic Bayesian Network
  • Bayesian Network extended in time, i.e. causal
    links in time
  • Discrete-time stochastic processes
  • Slice at time t 1 defines priors
  • Under assumption of first-order Markov DBN may be
    represented by a 2TBN
  • First slice contains no parameters, second slice
    contains conditional probabilities.
  • Parents may exist in both slice t or t-1

40
Dynamic Bayesian Network
Improving the basic CPN
  • Improve model to consider dynamics of
    communication process
  • Handle cycle induced by the feedback of TCP
  • Static view of observables no longer needed.
    Possibly take advantage of sequences of
    observations to improve statistics. Liu et al.,
    2003
  • Same methods for learning and inference in CPNs
    possible for DBNs Murphy, 2002

41
Questions
  • End of presentation

42
References
Andreassen, 2001 Andreassen, S. (2001). Medical
decision support systems based on causal
probabilistic networks. Technical report,
Department of Medical Informatics and Image
Analysis, Aalborg University. Steinder and
Sethi, 2004 Steinder, M. and Sethi, A.S. (2004).
Probabilistic fault localization in communication
systems using belief networks. Technical repost,
IEEE. Kjærulff, 2005 Uffe Kjærulff (2005).
Constructing Bayesian Networks. Presentation,
Group of Machine Intelligence Department of
Computer Science, Aalborg University. Mahadevan,
2002 Sridhar Mahadevan (2002). The Junction
Tree Algorithm. Presentation, University of
Massachusetts. Murphy, 2002 Murphey, K. P.
(2002). Dynamic bayesian networks
Representation, inference and learning. Technical
report, University of California,
Berkeley. Lauber, 1999 Lauber, J. Steger, C.
and Weiss, R. Autonomous Agents for Online
Diagnosis of a Safety-critical System Based on
Probabilistic Causal Reasoning. Technical report,
Institute for Technical Informatics Technical
University Graz. Zweig, 1998 Zweig, G. and
Russell, S. Speech recognition with Dynamic
Bayesian Networks. Computer Science Division, UC
Berkeley. Liu et al., 2003 Liu, J., Matta, I.
and Crovella, M. (2003). End-to-end inference of
loss nature in a hybrid wired/wireless
environment from wireless losses. Technical
report, Department of Computer Science
University of North Dakota.
43
Additional slides
44
Additional slides
Learning cases
Learning cases
C, BL, RTT, PRR, FRR, PL... High, No, n/a,
High, Low, n/aHigh, No, 6, High, Low,
n/aHigh, No, 2, High, Low, n/aHigh, No, 0,
High, Low, n/aHigh, No, 4, High, Low, n/a ...
45
Additional slides
Extended model
46
Additional slides
Long congestion simulation run
47
Additional slides
Connection counts
48
Additional slides
Congestion classification
Write a Comment
User Comments (0)
About PowerShow.com