Title: Infrastructurebased Resilient Routing
1Infrastructure-basedResilient Routing
- Ben Y. Zhao, Ling Huang, Jeremy Stribling,
Anthony Joseph and John Kubiatowicz - University of California, Berkeley
- ICSI Lunch Seminar, January 2004
- Network connectivity is not reliable
- Disconnections frequent in the Internet
(UMichTR98,IMC02) - 50 of backbone links have MTBF lt 10 days
- 20 of faults last longer than 10mins
- IP-level repair relatively slow
- Wide-area BGP ? 3 mins
- Local-area IS-IS ? 5 seconds
- Next generation wide-area network applications
- Streaming media, VoIP, B2B transactions
- Low tolerance of delay, jitter and faults
3The Challenge
- Routing failures are diverse
- Many causes
- Misconfigurations, cut fiber, planned downtime,
software bugs - Occur anywhere with local or global impact
- Single fiber cut can disconnect AS pairs
- One event can lead to complex protocol
interactions - Isolating failures is difficult
- End user symptoms often dynamic or intermittent
- WAN measurement research is ongoing (Rocketfuel,
etc) - Observations
- Fault detection from multiple distributed vantage
points - In-network decision making necessary for timely
4Talk Overview
- Motivation
- A structured overlay approach
- Mechanisms and policy
- Evaluation
- Some questions
5An Infrastructure Approach
- Our goals
- Resilient overlay to route around failures
- Respond in milliseconds (not seconds)
- Our approach (data control plane)
- Nodes are observation points (similar to Platos
NEWS service) - Nodes are also points of traffic
redirection(forwarding path determination and
data forwarding) - No edge node involvement
- Fast response time, security focused on
infrastructure - Fully transparent, no application awareness
6Why Structured Overlays
- Resilient Overlay Networks (MIT)
- Fully connected mesh
- Each node has full knowledge of network
- Fast, independent calculation of routes
- Nodes can construct any path, maximum flexibility
- Cost of flexibility
- Protocol needs to choose the right route/nodes
- Per node O(n) state
- Monitors n - 1 paths
- O(n2) total path monitoring is expensive
7The Big Picture
- Locate nearby overlay proxy
- Establish overlay path to destination host
- Overlay traffic routes traffic resiliently
8Traffic Tunneling
Legacy Node B
Legacy Node A
A, B are IP addresses
Structured Peer to Peer Overlay
- Store mapping from end host IP to its proxys
overlay ID - Similar to approach in Internet Indirection
Infrastructure (I3)
9Pros and Cons
- Leverage small neighbor sets
- Less neighbor paths to monitor O(n) ? O(log(n))
- Reduction in probing bandwidth
- Faster fault detection
- Actively maintain static route redundancy
- Manageable for small of paths
- Redirect traffic immediately when a failure is
detectedEliminate on-the-fly calculation of new
routes - Restore redundancy in background after failure
- Fast fault detection precomputed paths more
responsiveness - Cons overlay imposes routing stretch (mostly lt 2)
10In-network Resiliency Details
- Active periodic probes for fault-detection
- Exponentially weighted moving average link
quality estimation - Avoid route flapping due to short term loss
artifacts - Loss rate Ln (1 - ?) ? Ln-1 ? ??p
- Simple approach taken, much ongoing research
- Smart fault-detection / propagation (Zhuang04)
- Intelligent and cooperative path selection
(Seshardri04) - Maintaining backup paths
- Create and store backup routes at node insertion
- Query neighbors after failures to restore
redundancy - Ask any neighbor at or above routing level of
faulty nodee.g. ABCD sees ABDE failed, can ask
any AB?? node for info - Simple policies to choose among redundant paths
11First Reachable Link Selection (FRLS)
- Use link quality estimation to choose shortest
usable path - Use shortest path withminimal quality gt T
- Correlated failures
- Reduce with intelligent topology construction
- Goal leverage redundancy available
- Metrics for evaluation
- How much routing resiliency can we exploit?
- How fast can we adapt to faults (responsiveness)?
- Experimental platforms
- Event-based simulations on transit stub
topologies - Data collected over multiple 5000-node topologies
- PlanetLab measurements
- Microbenchmarks on responsiveness
13Exploiting Route Redundancy (Sim)
- Simulation of Tapestry, 2 backup paths per
routing entry - Transit-stub topology shown, results from TIER
and AS graphs similar
14Responsiveness to Faults (PlanetLab)
- Two reasonable values for filter constant ?
- Response time scales linearly to probe period
15Link Probing Bandwidth (Planetlab)
- Bandwidth increases logarithmically with overlay
size - Medium sized routing overlays incur low probing
- Trading flexibility for scalability and
responsiveness - Structured routing has low path maintenance costs
- Allows caching of backup paths for quick
failover - Can no longer construct arbitrary paths
- But simple policy exploits available redundancy
well - Fast enough for most interactive applications
- 300ms beacon period ? response time lt 700ms
- 300 nodes, b/w cost 7KB/s
17Ongoing Questions
- Is this the right approach?
- Is there a lower bound on desired responsiveness?
- Is this responsive enough for VoIP?
- If not, is multipath routing the solution?
- What about deployment issues?
- How does inter-domain deployment happen?
- A third-party approach? (Akamai for routing)
18Related Work
- Redirection overlays
- Detour (IEEE Micro 99)
- Resilient Overlay Networks (SOSP 01)
- Internet Indirection Infrastructure (SIGCOMM 02)
- Secure Overlay Services (SIGCOMM 02)
- Topology estimation techniques
- Adaptive probing (IPTPS 03)
- Internet tomography (IMC 03)
- Routing underlay (SIGCOMM 03)
- Many, many other structured peer-to-peer overlays
- Thanks to Dennis Geels / Sean Rhea for their work
on BMark
19Backup Slides
20Another Perspective on Reachability
Portion of all pair-wise paths where no
failure-free paths remain
A path exists, but neither IP nor FRLS can locate
the path
Portion of all paths where IP and FRLS both route
FRLS finds path, where short-term IP routing fails
21Constrained Multicast
- Used only when all paths are below quality
threshold - Send duplicate messages on multiple paths
- Leverage route convergence
- Assign unique messageIDs
- Mark duplicates
- Keep moving window of IDs
- Recognize and drop duplicates
- Limitations
- Assumes loss not from congestion
- Ideal for local area routing
22Latency Overhead of Misrouting
23Bandwidth Cost of Constrained Multicast