Title: Effective Diagnosis of Routing Disruptions from End Systems
1Effective Diagnosis of Routing Disruptions from
End Systems
- Ying Zhang Z. Morley Mao Ming
Zhang
2Routing disruptions impact application performance
- More applications today have high QoS
requirements - Routing events can cause high loss and long delays
AS B
AS C
AS D
AS E
Internet
Dst
Src
3Existing approaches to diagnose routing
disruptions are ISP-centric
- Require routing data from many routers in ISPs
- Feldmann04, Teixeira04, Wu05
- Passive and accurate
AS D
AS C
AS B
Internet
4Limitations of ISP-centric approaches
- Difficult to gain access to data from many ISPs
- BGP data reflects expected data-plane paths
ISP
?
?
?
End-systems
AS D
AS C
AS B
?
?
?
?
Internet
5Can we diagnose entirely from end systems?
- Goal infer data-plane paths of many routers
Probing host
AS C
ISP A
AS B
AS D
Dst
6Our approach end systems based monitoring
- Only require probing from end hosts
- Cover all the PoPs of a target ISP
Probing host
AS C
Target ISP
AS B
AS D
Dst
7Our approach end systems based monitoring
- Cover most of the destinations on the Internet
Probing host
Dst
Dst
AS C
ISP A
AS B
AS D
Dst
Dst
8Our approach end systems based monitoring
- Identify routing changes by comparing paths
measured consecutively
Probing host
AS C
ISP A
AS B
AS D
Dst
9Advantages and challenges
- Advantages
- No need to access to ISP-propriety data
- Identify actual data-plane paths
- Monitor data plane performance
- Challenges
- Limited resources to probe
- Coverage of probed paths
- Timing granularity
- Measurement noise
10System architecture
Collaborative probing
Event identification and classification
Event correlation and inference
Event impact analysis
Reports
11Outline
- Collaborative probing
- Event identification and classification
- Event correlation and inference
- Result and validation
12Collaborative probing
- Using a set of hosts
- To learn the routing state
- To improve coverage
- To reduce overhead
Probing host
AS C
ISP A
AS B
AS D
13Outline
- Collaborative probing
- Event identification and classification
- Event correlation and inference
- Result and validation
14Event classification
- Classify events according to ingress/egress
changes
Type2 Ingress PoP same, egress PoP different
Type1 Ingress PoP changes
Type3 Ingress PoP same, egress PoP same
Destination Prefix P
Target ISP
Probing host
15Outline
- Collaborative probing
- Event identification and classification
- Event correlation and inference
- Result and validation
16Likely causes link failures
Neighbor AS
Destination Prefix P
Old egress PoP
New egress PoP
Old path
New path
Target ISP
Probing host
16
17Likely causes internal distance changes
- Hot potato changes
- Cost of old internal path increases
- Cost of new internal path decreases
Neighbor AS
Old egress PoP
New egress PoP
distance 120
distance 80
distance 100
distance 120
17
Probing host
18Event correlation
- Spatial correlation a single network failure
often affects multiple routers - Temporal correlation routing events occurring
close together are likely due to only a few causes
19Inference methodology
- An evidence an event that supports the cause
Destination prefix P
Link L
New egress
Cause Link L is down
New path
Probing host
Target ISP
Probing host
20Inference methodology
- A conflict a measurement trace that conflicts
with the cause
Destination prefix P
Link L
New egress
Cause Link L is down
New path
Probing host
Target ISP
Probing host
21Inference methodology
Evidence node 1,2,3-gt1,2,4
AS 3
AS 4
Withdrawal
AS 2
Cause node 3 withdraws the route
AS 1
Cause link 2-3 down
22Inference methodology
Evidence Graph
Evidence node 1,2,3-gt1,2,4
Evidence node 0,2,3-gt0,2,4
AS 3
AS 4
Withdrawal
AS 2
Cause node 3 withdraws the route
AS 1
AS 0
Cause link 2-3 down
23Inference methodology
Conflict Graph
AS 6
Conflict node 1,2,3,6
Conflict node 0,2,3,6
Conflict node 0,2,3
AS 3
AS 2
Cause link 2-3 down
Cause node 3 withdraws the route
AS 1
AS 0
24Inference methodology
Evidence Graph
Conflict Graph
Conflict node 1,2,3,6
Conflict node 0,2,3,6
Conflict node 0,2,3
Evidence node 1,2,3-gt1,2,4
Evidence node 0,2,3-gt0,2,4
Evidence 2 Conflicts 3
Evidence 2 Conflicts 0
- Greedy algorithm minimum set of causes that can
explain all the evidence while minimizing
conflicts
25Outline
- Collaborative probing
- Event identification and classification
- Event correlation and inference
- Result and validation
26ISPs studied
27Results of event classification
- Many events are internal changes
- Abilene has many ingress changes
28Validation with BGP based approach Wu05
- Hot potato changes egress point changes due to
internal distance changes
Number of incidences identified by both
Number of incidences identified by our method
Number of incidences identified by BGP method
False negative, false positives
29Validation with BGP based approach
- Session resets peering link up/down
- Inaccuracy reasons
- Limited coverage
- Coarse-grained probing
- Measurement noise
30System performance
- Can keep up with generated routing state
- Applicable for real-time diagnosis and mitigation
- Reactive construct alternate paths to bypass the
problem - Proactive avoid paths with many historical
routing disruptions
31Conclusion
- Developed the first system to diagnose routing
disruptions purely from end systems - Used a simple greedy algorithm on two bipartite
graphs to infer causes - Comprehensively validated the accuracy
32Thank you!
33Performance impact analysis
- End-to-end latency changes caused by different
types of routing events
34Validation with BGP data
- BGP feeds from RouteView, RIPE, Abilene, and 29
BGP feeds from a Tier-1 ISP - The destination prefix coverage and the routing
event detection rate
35Event classification same ingress PoP,
different egress PoP
- Policy changes
- Local preference in the old route decreases
- Local preference in the new route increases
Neighbor AS
Local Pref 60-gt110
Local Pref 100-gt50
Old egress PoP
New egress PoP
Old path
New path
Target ISP
35
Probing host
36Event classification same ingress PoP,
different egress PoP
- External routing changes
- Old route worsens due to external factors
(withdrawal, longer AS path) - New route improves due to external factors
AS A
AS B
ABCD-gtABEFD
BCEFD-gtBEFD
Old egress PoP
New egress PoP
Old path
New path
Target ISP
36
Probing host
37Event classification same ingress PoP, same
egress PoP
- Internal PoP path changes
- Cost of old internal path increases
- Cost of new internal path decreases
- External AS path changes
Destination Prefix P
New path
Old path
Target ISP
37
Probing host
38Results of cause inference
- Effectiveness of inference algorithm
- Clusters a group of events with the same root
cause
39Event identification
- A routing event path changes
- Event identificationomparing continuous routing
snapshots
Probing host
AS C
ISP A
AS B
AS D
Dst