Title: CSE 598B: Self-* Systems
1CSE 598B Self- Systems
- Path Based Failure and Evolution Management
- Mike Y. Chen, Anthony Accardi, Emre Kiciman, Jim
Lloyd, Dave Patterson, Armando Fox, Eric Brewer
(UC Berkeley, Stanford U, Tellme Networks, eBay
Inc.) - Presented by Arjun R. Nath
2The Problem..
- Computing systems increasing in complexity
- Tending towards large, complex, distributed
systems - Sometimes there are thousands of machines
involved - Basic system management is becoming increasingly
difficult. - Detecting and diagnosing failures to
understanding application behaviour is becoming
very difficult.
2
3..the Problem
- Existing techniques such as code-level debuggers,
program slicing, process profiling and
application logs fail to characterize overall
system behaviour. - Distribuged debuggers are available but focus on
a homogenous subset of the system.
3
4Goal of the paper
- Techniques to help us understand large
distributed systems. - Improve
- availability
- reliability
- manageability
- Why are we looking at this paper ? (Self-
context) - This paper is about techniques for monitoring of
large, complex, distributed systems.
4
5Two main principles
- Path-Based Measurement
- Model the system as a collection of paths thru
heterogenous components. - Make local observations along the paths and store
these. These can be accessed via queries and
visualization techniques. - (Focus is on correctness rather than performance)
- Statistical Behaviour Analysis
- Large volumes of system requests are stored for
statistical analysis using classical techniques
to identify deviations from normal behaviour.
This can be applied to live systems or used for
offline analysis.
5
6What is a "Path" ?
- Associated with a request
- Control Flow
- Resources
- Paths may have inter-path dependencies shared
state, shared database tables, shared
filesystems, shared memory. - Multiple paths may be grouped together in
sessions.
6
7Coarse grained paths
8Fine grained paths
9How do paths help ?
- Failure Management
- Evolution (of the system)
9
10Failure Management...
- Detection
- Reduce downtime associcated with detection delays
- Using paths can help in noticing developing
problems before they become severe - The Key is to define "normal" behaviour
statistically and then check for deviations - Diagnosis
- Isolate problems using solely the recorded path
observations and then drive the diagnosis process
with the path information. - Paths help identify which components are involved
in a given failure and aid in identifiying causes.
10
11...Failure Management
- Impact Analysis
- Helps in knowing the scale of the problem -gt
estimate time-to-repair - Which other paths are at risk.
11
12Evolution (of the system)
- Its very difficult to get an overall picture of
how a complex distributed system changes with
time - - Software/hardware upgrades, patches, code
changes etc. - - Systems evolve through changes to their
components and also thru changes in how they
interact - Paths help in revealing system structure and
dependencies and tracking changes.
12
13Implementation
14Implementation Architecture
15Implementation...
- Tracers - tracking a request through the target
system. - Each request has an identifier associated that is
maintained throughout the path - Ids may be stored in extensible headers (HTTP,
SOAP) - Tracers are platform specific but can be generic
to applications using the same platform (J2EE,
.NET) - Pinpoint, ObsLogs, SuperCal all have tracers.
15
16Implementation tools..
Three systems that support path-based analysis
17...Implementation
- Aggregator and Repository
- Aggregator receives observations from tracers
- reconstructs paths using IDs
- Stores this in the Repository
- There may be also a Central Repository that
collects from distributed repositories. - Analysis Engines and Visualization.
- Single and multi-path analysis
- Dedicated engines for various statistical tests
- Support for some data mining tools\
- Visualization Tukeys boxplots generated using
Octave
18Implementation
A trend specific to recognition time in Tellme
application A suggests a regression in a speech
grammar in that application. The Tukey boxplots
shown illustrate a distributions center, spread,
and asymmetries by using rectangles to show the
upper and lower quartiles and the median, and
explicitly plotting each outlier.
19Limitations and constraints
- Cannot resolve fault causes at a very detailed
level - Overheads can be high for fine grained paths
- Need to decide which observations to include in
paths. This is an iterative process. - Can be difficult to implement especially for
existing systems -
20- Its important so understand that Path-based
analysis is an aid to fault detection and
recovery and not a solution in itself. It is
meant to be used in combination with traditional
fault handling techniques.
21Conclusion
- As systems get more complex, Path-based analysis
tools will have increasing importance. - Path based fault analysis complements traditional
techniques - Hardly any fully functional, path-based, fault
management tools available. - This paper
- Has breadth but lacks depth in some places.
- Needs some more data around production
environment experiments - Should have concentrated on 1 or 2
implementations and included more details. - Not much info on SuperCal and ObsLogs
22Other related stuff
- Pinpoint project at Stanford http//swig.stanfor
d.edu/pinpoint.shtml (Some interesting papers
here) - Magpie project (MicroSoft)
- Quest Software Jprobe Java performance
profiler - Borland's OptimizeIt Enterprise Suite
23- Thats all folks,
- Thank You