Title: Fault Localization via Analysis of Network Dependency
1 Fault Localization via Analysis of
Network Dependency http//pmon
Victor Bahl, Ranveer Chandra, Albert Greenberg,
Dave Maltz, Ming Zhang (MSR Redmond)
Failure of Management Systems
Mission
Automatically Localizing Faults
- What we have today
- Interdependent distributed systems with hidden
and unknown dependencies - Plethora of tools for graphing SNMP values,
paucity of tools for tracking relationships - Little visibility into effect of network on
applications - What we want
- Method to map the IT infrastructure - determining
which components affect a given client activity - Method to localize problems that affect users
Response time of 17 servers
Response time of 1 web server
10
10
- 10 of requests to internal servers take 10x
longer than normal - Persistent user frustration and high care costs
- Invisible to current management systems
Automatically Creating Models of Dependencies
Challenges
- A typical large enterprise
- 100,000 client desktops
- 10,000 servers
- 10,000 apps/services
- 10,000 network devices
- Service alerts for 10 days
- 120,000 housekeeping
- 2,000 missed heartbeats from 160 servers
- 18,000 alerts from 194 categories and 877 hosts
Results
- Algorithm for extraction of dependency models
- Sniffs and correlates packets between hosts
- Algorithm for flexible accurate fault
localization - Scalable to size of large enterprises
- Localizes both hard and performance faults
- Finds problems in network, even without data
from network routers - Deployed and evaluated on testbed and several
MSIT applications (e.g., msw, itweb)
State of the Art
- Management systems do not provide a big picture
- Tools are box-centric not service-centric
- Relationships among severs often undocumented
- Fragmentation results in more mistakes outages
- Tools do not directly measure user experience
Example Extracted Dependencies
On-Going Work
- Read/Write SML models of applications
- Automatically generate SML for legacy apps
- Complement expert-generated SML
- Explore other applications of Inference Graph
- Upgrade management (who will be affected)
- Availability analysis (who is being impacted)
Model is probabilistic to cope with caching, load
balancing and failover techniques