Title: Routing Measurements: Three Case Studies
1Routing MeasurementsThree Case Studies
2Motivations for Measuring the Routing System
- Characterizing the Internet
- Internet path properties
- Demands on Internet routers
- Routing convergence
- Improving Internet health
- Protocol design problems
- Protocol implementation problems
- Configuration errors or attacks
- Operating a network
- Detecting and diagnosing routing problems
- Traffic shifts, routing attacks, flaky equipment,
3Techniques for Measuring Internet Routing
- Active probing
- Inject probes along path through the data plane
- E.g., using traceroute
- Passive route monitoring
- Capture control-plane messages between routers
- E.g., using tcpdump or a software router
- E.g., dumping the routing table on a router
- Injecting network events
- Cause failure/recovery at planned time and place
- E.g., BGP route beacon, or planned maintenance
4Challenges in Measuring Routing
- Data vs. control plane
- Understand relationship between routing protocol
messages and the impact on data traffic - Cause vs. effect
- Identify the root cause for a change in the
forwarding path or control-plane messages - Visibility and representativeness
- Collect routing data from many vantage points
- Across many Autonomous Systems, or within
- Large volume of data
- Many end-to-end paths
- Many prefixes and update measurements
5Measurement Tools Traceroute
- Traceroute tool exploits TTL-limited probes
- Observation of the forwarding path
- Useful, but introduces many challenges
- Path changes
- Non-participating nodes
- Inaccurate, two-way measurements
- Hard to map interfaces to routers and ASes
destination
source
Send packets with TTL1, 2, 3, and record
source of time exceeded message
6Measurement Intradomain Route Monitoring
- OSPF is a flooding protocol
- Every link-state advertisements sent on every
link - Very helpful for simplifying the monitor
- Can participate in the protocol
- Shared media (e.g., Ethernet)
- Join multicast group and listen to LSAs
- Point-to-point links
- Establish an adjacency with a router
- or passively monitor packets on a link
- Tap a link and capture the OSPF packets
7Measurement Interdomain Route Monitoring
Establish a passive BGP session from a
workstation running BGP software
Talk to operational routers using SNMP or telnet
at command line
BGP session over TCP
() BGP table dumps do not burden
operational routers (-) Receives only best
routes from BGP neighbor () Update
dynamics captured () not restricted to
interfaces provided by vendors
(-) BGP table dumps are expensive () Table
dumps show all alternate routes (-) Update
dynamics lost (-) restricted to interfaces
provided by vendors
8Collect BGP Data From Many Routers
Seattle
Cambridge
Chicago
Detroit
New York
Kansas City
Philadelphia
Denver
San Francisco
St. Louis
Washington, D.C.
2
Los Angeles
Dallas
Atlanta
San Diego
Phoenix
Austin
Orlando
Houston
Route Monitor
BGP is not a flooding protocol
9Two Kinds of BGP Monitoring Data
- Wide-area, from many ASes
- RouteViews or RIPE-NCC data
- Pro available from many vantage points
- Con often just one or two views per AS
- Single AS, from many routers
- Abilene and GEANT public repositories
- Proprietary data at individual ISPs
- Pro comprehensive view of a single AS
- Con limited public examples, mostly research
nets
10Measurement Injecting Events
- Equipment failure/recovery
- Unplug/reconnect the equipment ?
- Packet filters that block all packets
- Knowing when planned event will take place
- Shutting down a routing-protocol adjacency
- Injecting route announcements
- Acquire some blocks of IP addresses
- Acquire a routing-protocol adjacency to a router
- Announce/withdraw routes on a schedule
- Beacons http//psg.com/zmao/BGPBeacon.html
11Two Papers for Today
- Both early measurement studies
- Initially appeared at SIGCOMM96 and 97
- Both won the best student paper award ?
- Early glimpses into the health of Internet
routing - Early wave of papers on Internet measurement
- Differences in emphasis
- Paxson96 end-to-end active probing to measure
the characteristics of the data plane - Labovitz97 passive monitoring of BGP update
messages from several ISPs to characterize
(in)stability of the interdomain routing system
12Paxson Study Forwarding Loops
- Forwarding loop
- Packet returns to same router multiple times
- May cause traceroute to show a loop
- If loop lasted long enough
- So many packets traverse the loopy path
- Traceroute may reveal false loops
- Path change that leads to a longer path
- Causing later probe packets to hit same nodes
- Heuristic solution
- Require traceroute to return same path 3 times
13Paxson Study Causes of Loops
- Transient vs. persistent
- Transient routing-protocol convergence
- Persistent likely configuration problem
- Challenges
- Appropriate time boundary between the two?
- What about flaky equipment going up and down?
- Determining the cause of persistent loops?
- Anecdote on recent study of persistent loops
- Provider has static route for customer prefix
- Customer has default route to the provider
14Paxson Study Path Fluttering
- Rapid changes between paths
- Multiple paths between a pair of hosts
- Load balancing policies inside the network
- Packet-based load balancing
- Round-robin or random
- Multiple paths for packets in a single flow
- Flow-based load balancing
- Hash of some fields in the packet header
- E.g., IP addresses, port numbers, etc.
- To keep packets in a flow on one path
15Paxson Study Routing Stability
- Route prevalence
- Likelihood of observing a particular route
- Relatively easy to measure with sound sampling
- Poisson arrivals see time averages (PASTA)
- Most host pairs have a dominant route
- Route persistence
- How long a route endures before a change
- Much harder to measure through active probes
- Look for cases of multiple observations
- Typical host pair has path persistence of a week
16Paxson Study Route Asymmetry
- Other causes
- Asymmetric link weights in intradomain routing
- Cold-potato routing, where AS requests traffic
enter at particular place - Consequences
- Lots of asymmetry
- One-way delay is not necessarily half of the
round-trip time
Customer B
Provider B
multiple peering points
Early-exit routing
Provider A
Customer A
17Labovitz Study Interdomain Routing
- AS-level topology
- Destinations are IP prefixes (e.g., 12.0.0.0/8)
- Nodes are Autonomous Systems (ASes)
- Links are connections business relationships
4
3
5
2
6
7
1
Client
Web server
18Labovitz Study BGP Background
- Extension of distance-vector routing
- Support flexible routing policies
- Avoid count-to-infinity problem
- Key idea advertise the entire path
- Distance vector send distance metric per dest d
- Path vector send the entire path for each dest d
d path (2,1)
d path (1)
3
1
data traffic
data traffic
d
19Labovitz Study BGP Background
- BGP is an incremental protocol
- In theory, no update messages in steady state
- Two kinds of update messages
- Announcement advertising a new route
- Withdrawal withdrawing an old route
- Study saw an alarming number of updates
- At the time, Internet had around 45,000 prefixes
- Routers were exchanging 3-6 million updates/day
- Sometimes as high as 30 million in a day
- Placing a very high load on the routers
20Labovitz Study Classifying Update Messages
- Analyze update messages
- For each (prefix, peer) tuple
- Classify the kinds of routing changes
- Forwarding instability
- WADiff explicit withdraw, replaced by alternate
- AADiff implict withdraw, replaced by alternate
- Pathological
- WADup explicit withdraw, and then reanounced
- AADup duplicate announcement
- WWDup duplicate withdrawal
21Labovitz Study Duplicate Withdrawals
- Time-space trade-off in router implementation
- Common system building technique
- Trade one resource for another
- Can have surprising side effects
- The gory details
- Ideally, you should not send a withdrawal if you
never sent a neighbor a corresponding
announcement - Requires remembering what update message you sent
to each neighbor - Easier to just send everyone a withdrawal when
your route goes away
22Labovitz Study Practical Impact
- Stateless BGP is compliant with the standard
- But, it forces other routers to handle more load
- So that you dont have to maintain state
- Arguably very unfair, and bad for global Internet
- One router vendor was largely at fault
- Router vendor modified its implementation
- ISPs then deployed the updated software
23Labovitz Study Still Hard to Diagnose Problems
- Despite having very detailed view into BGP
- Some pathologies were very hard to diagnose
- Possible causes
- Flaky equipment
- Synchronization of BGP timers
- Interaction between BGP and intradomain routing
- Policy oscillation
- These topics were studied in follow-up studies
- Example study of BGP data within a large ISP
- http//www.cs.princeton.edu/jrex/papers/nsdi05-ji
an.pdf
24ISP Study Detecting Important Routing Changes
- Large volume of BGP updates messages
- Around 2 million/day, and very bursty
- Too much for an operator to manage
- Identify important anomalies
- Lost reachability
- Persistent flapping
- Large traffic shifts
- Not the same as root-cause analysis
- Identify changes and their effects
- Focus on mitigation, rather than diagnosis
- Diagnose causes if they occur in/near the AS
25Challenge 1 Excess Update Messages
- A single routing change
- Leads to multiple update messages
- Affects routing decision at multiple routers
Persistent Flapping Prefixes
Group updates for a prefix with inter-arrival lt
70 seconds, and flag prefixes with changes
lasting gt 10 minutes.
26Determine Event Timeout
Cumulative distribution of BGP update
inter-arrival time
BGP beacon
(70, 98)
27Event Duration Persistent Flapping
Complementary cumulative distribution of event
duration
(600, 0.1)
28Detecting Persistent Flapping
- Significant persistent flapping
- 15.2 of all BGP update messages
- though a small number of destination prefixes
- Surprising, especially since flap dampening is
used - Types of persistent flapping
- Conservative flap-damping parameters (78.6)
- Policy oscillations, e.g., MED oscillation
(18.3) - Unstable interface or BGP session (3.0)
29Example Unstable eBGP Session
Peer
ATT
p
Customer
30Challenge 2 Identify Important Events
- Major concerns of network operators
- Changes in reachability
- Heavy load of routing messages on the routers
- Flow of the traffic through the network
Classify events by type of impact it has on the
network
31Event Category No Disruption
p
AS2
AS1
No Traffic Shift
ATT
No Disruption each of the border routers has
no traffic shift
32Event Category Internal Disruption
p
AS2
AS1
Internal Disruption all of the traffic shifts
are internal traffic shift
ATT
Internal Traffic Shift
33Event Type Single External Disruption
p
AS2
AS1
external Traffic Shift
ATT
Single External Disruption traffic at one exit
point shifts to other exit points
34Statistics on Event Classification
Events Updates
No Disruption 50.3 48.6
Internal Disruption 15.6 3.4
Single External Disruption 20.7 7.9
Multiple External Disruption 7.4 18.2
Loss/Gain of Reachability 6.0 21.9
35Challenge 3 Multiple Destinations
- A single routing change
- Affects multiple destination prefixes
Group events of same type that occur close in time
36Main Causes of Large Clusters BGP Resets
- External BGP session resets
- Failure/recovery of external BGP session
- E.g., session to another large tier-1 ISP
- Caused single external disruption events
- Validated by looking at syslog reports on routers
p
AS2
AS1
ATT
37Main Causes of Large Clusters Hot Potatoes
- Hot-potato routing changes
- Failure/recovery of an intradomain link
- E.g., leads to changes in IGP path costs
- Caused internal disruption events
- Validated by looking at OSPF measurements
P
Hot-potato routing route to closest egress
point
10
11
9
ISP
38Challenge 4 Popularity of Destinations
- Impact of event on traffic
- Depends on the popularity of the destinations
Netflow Data
Weight the group of destinations by the traffic
volume
39ISP Study Traffic Impact Prediction
- Traffic weight
- Per-prefix measurements from Netflow
- 10 prefixes accounts for 90 of traffic
- Traffic weight of a cluster
- The sum of traffic weight of the prefixes
- Flag clusters with heavy traffic
- A few large clusters have large traffic weight
- Mostly session resets and hot-potato changes
40ISP Study Summary
41Three Studies, Three Approaches
- End-to-end active probes
- Measure and characterize the forwarding path
- Identify the effects on data traffic
- Wide-area passive route monitoring
- Measure and classify BGP routing churn
- Identify pathologies and improve Internet health
- Intra-AS passive route monitoring
- Detailed measurements of BGP within an AS
- Aggregate data into small set of major events