Title: Automated Fault diagnosis in VoIP
1Automated Fault diagnosis in VoIP
- 31st March,2006
- Vishal Kumar Singh and Henning Schulzrinne
2VoIP Diagnosis
- What is automated VoIP diagnosis
- Determining failures in network
- Automatically finding the root cause of the
failure - Why VoIP diagnosis
- Networks are complex, making it difficult to
troubleshoot problems - Automatic fault diagnosis reduces human
intervention - Issues in VoIP diagnosis
- Detecting failures/faults
- Finding the cause of failure, determining
dependency relationships among different
components for diagnosis - Solution steps and approaches
3Issues in Automated VoIP Diagnosis
- Increasingly complex and diverse network elements
- Complex interactions/relationships between
different network elements - Different run time bindings for each application
usage instance, e.g., different calls may use
different DNS, SIP proxy servers, media path - Problem in one network element may manifest
itself as user perceived failure of another
element
4Fault Identification
- Service unavailability reporting
- Node/Device/UA generates faults (failure events)
e.g. SNMP Traps, failure messages - Monitoring application e.g., SNMP based
application detects service unavailability and
reports the failure event - Affected user reports service unavailability ,
e.g., by e-mail, calling to helpdesk,
automatically by pressing a button on phone while
in a call and experiencing echo - Dependent application detects service
unavailability and generates fault (failure
events)
5Fault Localization Determining the Source of
Problem
- Fault Classification Local Vs. Global
- (Does it affect only me or Does it affect
others also) - Global failures
- Server failure e.g. SIP proxy, DNS failure, DB
failures - Network failures
- Local failures
- Specific Source failure e.g. node A cannot make
call to anyone - Specific destination or participant failure e.g.
No one can make call to node B - Locally observed but global failures e.g., DNS
service failed, but only B observed it.
6Solution Approach
- DYSWIS Do you see what I see 1
- Peers (Nodes) perform diagnostic tests when
another peer reports or detects failure - Nodes can choose the diagnostic test depending on
dependency encoded as decision tree - Nodes (at least some) will be initially preloaded
with the dependency relationship in some format
(e.g., XML based) - Nodes (at least some) may build and update the
dependency relationship based on statistical and
temporal analysis of failure events which they
receive and diagnostic tests which they perform
7Solution Approach
- Store context information of past failures
experienced by each node - E.g., specific server that was acting as the
proxy server (for my call which failed) - Store locality of past failures instances
- LAN, domain, subnet
- First hop at each layer e.g., switch (MAC),
default gateway (IP), domains proxy (Application
layer), - Failure count for each network element
(statistical) - Last failure timestamp for each network element
- Last successfully seen timestamp for each network
element (why do I need to test the proxy for you,
my call just went through) - Temporal correlation of past failures (proxy
seems to be failing after DNS fails) - Each node has a runtime dependency list based on
past failures and diagnostic tests
8Solution Architecture
Nodes in different domains cooperating to
determine cause of failure
9Solution Architecture Logical View
Failures in Network
Dependency graph generation Bayesian network
based, Inference, other models
Test results
Decision Tree updates
Triggers to perform TESTS. (Peer selection
and Probe selection.
Dependency relationships and tests (XML)
The above figure shows logical entities and
separation of dependency graph generation and
Distributed diagnostic infrastructure (enclosed
in blue).
10Solution Requirements
- Request-Response protocol between the node which
experiences the failure and the peer nodes - Nodes capability to perform diagnostic tests
(probes), probe selection based on cost/result - Encoding the dependency relationship into a
decision tree (giving as an input from an expert
e.g., as XML) - Peer node discovery, based on
- Location (local network, domain)
- Capability to perform tests (based on specific
tests) - Dependency graph generation and updation, based
on - Network failure events
- Diagnostic test results correlated with failures
11Test/ Probe Selection
- Which diagnostic probe to run network layer or
application layer and for what kind of failures. - A probe covering broad range of failures can give
faster and crude but less accurate results - E.g. PING vs TCP Connect vs. SIP PING tests
- Cost of Probe
12Dependency Classifications
- Functional dependency
- At generic service level e.g. SIP proxy depends
on DB service, DNS service - Structural dependency
- Configuration time e.g. Columbia CS SIP proxy is
configured to use mysql database on metro-north - Operational dependency
- Runtime dependencies or run time bindings, e.g.,
the call which failed was using failover SIP
server obtained from DNS which was running on
host a.b.c.d in IRT lab
13Dependency classifications Layered Approach
- Vertical and Lateral dependencies Applications
depends on other application layer services
(e.g., SIP service depends on DB, DNS service) as
well as lower layer services - OSI layers as service dependency layers
- Application layer service also depends on
transport layer service which in turn depends on
network layer service - MAC layer Access point, Switch
- Network layer Router
- Application layer DNS, SIP, Database
- Topology based dependency
- e.g., calls from CS domain depends on specific
SIP server, calls from lab phones depends on
specific switches and routers
14Dependency Graph
15Dependency Graph Encoded to Decision Tree
16Diagnostic Tests
- SIP proxy
- Proxy server availability
- SIP PING
- Call Routing availability
- Invite tests
- Call Path determination
- SIP TraceRoute
- Media path
- Quality related
- Speech quality degradation - MOS
- Echo
- jitter- MOS, PESQ
- QoS RTCP
- NAT/Firewall
- Checking binding expiration.
- Firewall failure to open a port - One way media.
- How to determine which Firewall in the path ?
SIP signaling ?
17Diagnostic Tests
- DNS tests
- DHCP
- Switch/Router
- ARP/RARP/Multicast
- BGP failures
- Conference mixers
- Gateway
- Echo return loss- readings- Analysis
- DB
- XCAP server tests
- Presence service availability tests
18Example
- Call Failure Possible Causes
- SIP Proxy server
- Database
- Authentication
- Media path failure
- Gateway
- Specific call legs ERL, Authentication, etc.
- DNS server failure
- End station failure
- Network failure, e.g., router, switch failure
- Different calls will have different run time
dependencies
19Mapping to a Human Medical System
- Doctors perform diagnostic tests to find out the
cause of disease when the symptoms are mentioned
They may learn new things about the disease as
a part of diagnostic tests - Failures and triggered tests update the
dependency graph - Medical researchers do different types of tests
to learn about new diseases, determine the cause
and relationship of a disease with other
physiological system - Set of tests that can run periodically and can be
used to build dependency graph independent of
failures
20Solution Evolution
- Learning the dependency graph from failure events
and diagnostic tests - Learning using random/periodic testing to
identify failures and determine relationships
21Future Directions
- Self healing
- Predicting failures
- Protocols for labeling event failures which
would enable automatically incorporating new
devices/applications to the dependency system - Decision tree (dependency graph) based event
correlation
22Reference
- 1 User-oriented Management of VoIP Applications
(http//www.ibr.cs.tu-bs.de/projects/nmrg/meeting
s/2005/nancy/dyswis.pdf)