Title: Internet Routing (COS 598A) Today: Root-Cause Analysis
1Internet Routing (COS 598A)Today Root-Cause
Analysis
- Jennifer Rexford
- http//www.cs.princeton.edu/jrex/teaching/spring2
005 - Tuesdays/Thursdays 1100am-1220pm
2Outline
- Network troubleshooting
- Motivation for network troubleshooting
- Investigating from the edge vs. inside
- Active probing
- Traceroute
- Mapping IP addresses to AS numbers
- Passive monitoring
- Analyzing BGP update streams
- Identifying location and cause of routing change
- Limitations of the approach
3Network Troubleshooting
Why cant I reach www.cnn.com?
Why is the performance bad?
Internet
www.cnn.com
4Reachability Problems What Could be Wrong?
- End-host problem
- Web server down
- DNS server down, or misconfigured
- Forwarding-path problem
- Packet filter or firewall restricting access
- Mismatch in Maximum Transmission Unit (MTU)
- Routing problem
- User or server disconnected from Internet
- Blackhole dropping all packets
- Persistent loop
5Performance Problem What Could be Wrong?
- End-host problems
- Overloaded Web server
- Overloaded DNS server
- Overloaded user machine
- Forwarding-path problem
- High round-trip time
- Link congestion
- Routing problem
- Long-term routing instability
- Transient disruption during convergence
6Motivation for Troubleshooting
- Improving performance
- Detect, diagnose, and fix the problem
- Pick a path through another provider
- Pick a different path in any overlay network
- Establishing accountability
- Enforce Service Level Agreements
- Rate service providers
- Characterizing the Internet
- Understand causes of performance problems
- Understand challenges of troubleshooting
7Troubleshooting Outside vs. Inside
- Outside from network edge
- Who users and researchers, and operators
troubleshooting problems outside their network - Data ping/traceroute, public feeds of BGP
updates, and public measurement platforms - Challenges inference from very limited data
- Inside from inside the network
- Who operators running a network
- Data SNMP, fault data, traffic measurement,
route monitors, and router configuration files - Challenges collecting and joining the data
Today
8Active Probing
9Pros and Cons of Active Probing
- Advantages
- Can run from any end system
- Measure the actual forwarding path
- See black-holes, loops, and delays directly
- Disadvantages
- Effects of routing changes, not the cause
- Current path, not the path used in the past
- Requires frequent probes to observe the changes
- Shows only properties of round-trip path
- Hard to tell if problem is on forward vs. reverse
10Traceroute Measuring the Forwarding Path
- Time-To-Live field in IP packet header
- Source sends a packet with a TTL of n
- Each router along the path decrements the TTL
- TTL exceeded sent when TTL reaches 0
- Traceroute tool exploits this TTL behavior
destination
source
Send packets with TTL1, 2, 3, and record
source of time exceeded message
11Example Traceroute Output (Berkeley to CNN)
Hop number, IP address, DNS name
1 169.229.62.1 2 169.229.59.225 3
128.32.255.169 4 128.32.0.249 5 128.32.0.66
6 209.247.159.109 7 8 64.159.1.46 9
209.247.9.170 10 66.185.138.33 11 12
66.185.136.17 13 64.236.16.52
inr-daedalus-0.CS.Berkeley.EDU soda-cr-1-1-soda-br
-6-2 vlan242.inr-202-doecev.Berkeley.EDU gigE6-0-
0.inr-666-doecev.Berkeley.EDU qsv-juniper--ucb-gw.
calren2.net POS1-0.hsipaccess1.SanJose1.Level3.net
? ? pos8-0.hsa2.Atlanta2.Level3.net pop2-atm-P0-2
.atdn.net ? pop1-atl-P4-0.atdn.net www4.cnn.com
12Example Troubleshooting Results
- No packets go beyond your gateway
- Gateways connection to Internet is dead
- Traceroute stops at intermediate point
- Perhaps a blackhole
- Traceroute path has a loop
- Transient or persistent forwarding loop
- Traceroute shows a very long path
- Routing anomaly, route hijacking, etc.
- Traceroute shows very long delays
- Delay or congestion on forward or reverse path
13Problems with Traceroute
- Missing responses
- Routers might not send Time-Exceeded
- Firewalls may drop the probe packets
- Time-Exceeded reply may be dropped
- Misleading responses
- Probes taken while the path is changing
- Name not in DNS, or DNS entry misconfigured
- Mapping IP addresses
- Mapping interfaces to a common router
- Mapping interface/router to Autonomous System
14Map Traceroute Hops to ASes
Traceroute output (hop number, IP)
1 169.229.62.1 2 169.229.59.225 3
128.32.255.169 4 128.32.0.249 5 128.32.0.66
6 209.247.159.109 7 8 64.159.1.46 9
209.247.9.170 10 66.185.138.33 11 12
66.185.136.17 13 64.236.16.52
Need accurate IP-to-AS mappings (for network
equipment).
15Candidate Ways to Get IP-to-AS Mapping
- Routing address registry
- Voluntary public registry such as whois.radb.net
- Used by prtraceroute and NANOG traceroute
- Incomplete and quite out-of-date
- Mergers, acquisitions, delegation to customers
- Origin AS in BGP paths
- Public BGP routing tables such as RouteViews
- Used to translate traceroute data to an AS graph
- Incomplete and inaccurate but usually right
- Multiple Origin ASes, no mapping, wrong mapping
16Example BGP Table (show ip bgp at RouteViews)
Network Next Hop Metric
LocPrf Weight Path 3.0.0.0/8
205.215.45.50
0 4006 701 80 i
167.142.3.6
0 5056 701 80 i
157.22.9.7
0 715 1 701 80 i
195.219.96.239
0 8297 6453 701 80
i 195.211.29.254
0 5409
6667 6427 3356 701 80 i gt
12.127.0.249
0 7018 701 80 i
213.200.87.254 929
0 3257 701 80 i 9.184.112.0/20
205.215.45.50
0 4006 6461 3786 i
195.66.225.254
0 5459 6461 3786 i gt
203.62.248.4
0 1221 3786 i
167.142.3.6
0 5056 6461 6461
3786 i
195.219.96.239
0 8297 6461 3786 i
195.211.29.254
0 5409 6461 3786 i
AS 80 is General Electric, AS 701 is UUNET, AS
7018 is ATT AS 3786 is DACOM (Korea), AS 1221 is
Telstra
17Why Would IP-to-AS Mapping Be Wrong?
- IP addresses of equipment
- Interfaces on the routers, not end hosts
- Identifies equipment in routing protocols
- Doesnt need to be globally visible consistent
- Three reasons the mappings may be wrong
- Addresses of Internet Exchange Points
- Sibling ASes that share address space
- ASes that dont announce their addresses
- Look at traceroute path vs. BGP AS path
- Traceroute path after IP-to-AS mapping
- BGP AS path taken from the BGP table
18Extra AS due to Internet eXchange Points
- IXP shared place where providers meet
- E.g., Mae-East, Mae-West, PAIX
- Large number of fan-in and fan-out ASes
A
E
A
E
F
B
F
B
D
G
C
G
C
Traceroute AS path
BGP AS path
Ignore extra traceroute AS hop with high fan-in
and fan-out
19Extra AS due to Sibling ASes
- Sibling organizations with multiple ASes
- E.g., Sprint AS 1239 and AS 1791
- AS numbers equipment with addresses of another
A
E
A
E
F
B
D
H
F
B
D
G
C
G
C
Traceroute AS path
BGP AS path
Merge sibling ASes belong together as if they
were one AS.
20Unannounced Infrastructure Addresses
12.0.0.0/8
A
B
C does not announce part of its address space in
BGP(e.g., 12.1.2.0/24)
C
Fix the IP-to-AS map to associate 12.1.2.0/24
with C
21Refining Initial IP-to-AS Mapping
- Start with initial IP-to-AS mapping
- Mapping from BGP tables is usually correct
- Good starting point for computing the mapping
- Collect many BGP and traceroute paths
- Signaling and forwarding AS path usually match
- Good way to identify mistakes in IP-to-AS map
- Successively refine the IP-to-AS mapping
- Find add/change/delete that makes big difference
- Base these edits on operational realities
http//www.cs.princeton.edu/jrex/papers/sigcomm03
.pdf http//www.cs.princeton.edu/jrex/papers/info
com04.pdf
22Research Areas
- Better version of traceroute
- Router support for active measurement
- IPPM (IP Performance Measurement)
- http//www1.ietf.org/mail-archive/web/imrg/current
/msg00154.html - Peer-to-peer troubleshooting
www.cnn.com
Yes
No
23Passive Monitoring
24Limitations of Active Measurements
- Active measurements traceroute-like tools
- Cant probe in the past
- Shows the effect, not the cause
Web Server (d)
AS 2
AS 4
AS 1
User (s)
AS 3
25Appealing to Peek Inside
- Passive measurements public BGP data
BGP update feeds
Data Correlation
Data Collection (RouteViews, RIPE)
root cause
26Inspect BGP Routing Changes
- Changes in paths to reach destination d
- AS 1 1 3 4 ? 1 2 4
- AS 2 2 4 (no change)
- AS 3 3 4 ? 3 1 2 4
- AS 4 4 (no change)
Web Server (d)
AS 2
AS 4
AS 1
User (s)
AS 3
27Idea 1 ASes in Paths Undergoing Change
- Key assumption
- The AS responsible for the change appears in the
old and/or the new AS path to the destination. - If an AS has a routing change
- All ASes in old and new paths may be responsible
- Call these ASes the suspect set
- Combining across vantage points
- Consider all ASes that had a routing change
- Perform the intersection across the suspect sets
28Idea 2 Excluding ASes in Non-Changing Paths
- Key assumption
- If an AS has no routing change, the ASes in the
path are not responsible and can be excluded. - Example
- AS 1 1 2 4 ? 1 2 3 4 suspects 1, 2, 3, 4
- AS 2 2 4 ? 2 3 4 suspects 2, 3, 4
- AS 3 3 4 (no change) non-suspects 3, 4
AS 3
AS 2
AS 1
AS 4
29Idea 3 Blaming the ASes in the Better Path
- Key assumption
- The better path is the one that contains the AS
responsible for the change. - Example
- 1 2 4 ? 1 2 3 4 better path to worse path,
with ASes 1,2,4 as the suspects (not AS 3) - Heuristics for identifying the better path
- E.g., the shorter AS path
AS 3
AS 2
AS 1
AS 4
30Idea 4 Combining Across Destinations
- Key assumption
- All destinations experiencing routing changes in
a short period of time have a common cause. - Exploiting the observation
- Form suspect sets for each destination
- Perform intersections of the sets across the
destinations
31Difficulties With Root-Cause Analysis
- Misleading BGP routing changes
- Responsible AS not on old or new path
- Looking across destinations doesnt resolve
- Missing routing changes
- Some routers in an AS dont have a change
- Some subnets are not visible in BGP
- Some internal changes are not visible in BGP
32Misleading BGP Changes
MythThe AS responsible for the change appears in
the old or the new AS path.
BGP data collection
old 1,2,8,9,10
new 1,4,5,6,7,10
33Misleading BGP Changes
MythLooking at routing changes across prefixes
resolves causes
d2
AS 3
d3
AS 2
AS 1
d1
A
B
7
10
C
Changes for d2, but not for d1 and d3
34Missing Routing Changes
Myth The BGP updates from a single router
accurately represent the AS
dst
AS 2
AS 1
7
6
10
12
35Missing Routing Changes
MythBGP data from a router accurately represents
changes on that router.
12.1.1.0/24
A
BGP data collection
12.1.0.0/16
36Missing Routing Changes
MythRouting changes visible in eBGP have greater
impact end-to-end impact than changes
with local scope.
dst
AS 2
AS 1
5
7
6
10
12
37Hybrid of Active and Passive Monitoring
Omni 2
Omni 4
Web Server (d)
AS 2
AS 4
AS 1
i
User (s)
AS 3
Omni 1
j
Omni 3
38Research Questions
- Understanding if root-cause analysis can work
- How many vantage points are needed?
- Do the assumptions usually hold?
- Can algorithms tolerate occasional violations?
- Can some additional information help?
- Distributed algorithms for root-cause analysis
- Can ASes cooperate in distributed fashion?
- How to prevent or detect ASes that cheat?
- Do all ASes have to participate?
- Other hybrids of active and passive monitoring?
39Conclusions
- Troubleshooting is important
- Detect, diagnose, and fix problems
- Accountability and service-level agreements
- Troubleshooting is hard
- Active measurement (e.g., traceroute) not enough
- Root-cause analysis techniques are not enough
- New innovation necessary
- Hybrid active/passive approaches
- Router support for active measurement
- Routing protocol extensions for troubleshooting
40For Next Time From Inside an AS
- Two papers
- OSPF monitoring Architecture, design, and
deployment experience - Finding a needle in a haystack Pinpointing
significant BGP routing changes in an IP network - Optional reading
- Materials from Packet Design and Ipsum Networks
- Review only of first paper
- Summary
- Why accept
- Why reject
- Future work