Title: Capturing, indexing and retrieving system history
1Capturing, indexing and retrieving system history
- Steve Zhang, Armando Fox -Stanford University
- Ira Cohen, Moises Goldszmidt, Julie Symons,
Terence Kelly HP Labs
2Outline
- Goals what problems are we trying to solve?
- Our approach
- Validation and testing
- Results
- Work-in-Progress
- Improving our approach
- Hurdles
- Initial observations
- Future direction
- Summary
3Service performance management by Service Level
Objectives (SLO)
Unhealthy
4Wouldnt it be great if
- We could identify similar instances of a
particular problem? - And retrieve previous diagnosis/repairs?
5Wouldnt it be great if
- We knew how many different problems and their
intensities? - We knew their characteristic symptoms (syndrome)?
6Contribution of this work
- Reducing system performance diagnosis into an
information retrieval framework by - Maintaining a long term memory of past events
and actions in a form that can be manipulated and
searched by computers. - Enabling
- Identification of different problems and their
intensities - Helps operator prioritization
- Find similar problems in past or in related
systems - Leverage previous repair/root cause efforts.
- Syndrome identification
7Main Challenge
- Find a representation (signature) that captures
the main characteristics of the system behavior
that is - Amenable to distance metrics
- Generated automatically
- In Machine readable form
- Therefore requiring little human intervention
8Representation of system state
- Many metrics available on IT-systems
- Assumption they are sufficient to capture the
essence of the problems - Couldnt we just keep their raw values?
9Signature construction approach
- Learn the relationship between the metrics and
the SLO state - Based on our OSDI04, DSN05 work
P(SLO, M)
10Signatures - example
- For a given SLO violation, the models provide a
list of metrics that are attributed with the
violation. - Metric has value 1 if it is attributed with the
violation, -1 if it is not attributed, 0 if it is
not relevant, e.g.
Attri- bution
11Why do we need metric attribution?
- Take two instances of a known problem and a
relevant metric to that problem - Values are very different, but behavior compared
to normal is similar
Gbl app alive proc
Time
12Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
-Identifies intensity -Identifies
recurrence -Provides syndromes
Clustering engine
Admin
13Validation and testing of approach
- Experimental testbed
- Three tier e-commerce site (Petstore)
- Three types of induced performance problems
- Service FT
- Geographically distributed three-tier mission
critical Enterprise application - 6 weeks of data from all instances of the
service. - One diagnosed (recurring) performance problem in
6 week period ( annotated "Stuck Thread")
14Service FT
15Evaluating signature representations
- Retrieval
- Perform tests with already annotated problems,
compute precision and recall of retrieval - Precision C/A Recall C/B
- Precision typically decreases as Recall increases
Relevant signatures in DB
B
Retrieved signatures
A
C
Correctly retrieved signatures
16Precision-Recall graph
Retrieval of "Stuck Thread" problem
Conclusion Retrieval with metric attribution
produces significantly better results
17Evaluating signature representations
- Grouping similar problems (Clustering)
- Good result is one where all members of each
cluster have a single annotation (pure) - Annotation of performance problems in case of
Petstore - We compute the entropy of the clusters with
respect to the annotation gt - Low entropy good result
18Entropy of clustering
Entropy of clustering in Petstore
Conclusion Clustering with metric attribution
produces significantly better results
19Identifying intensity and recurrent problems
6 weeks view of one of the App server instances
of Service FT
20Syndrome identification of problems
Attributed metrics identify the symptoms of the
problems
21Leveraging signatures across data-centers
What happened in Asia-Pacific during the failover
from Americas on Dec 18?
22Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur during the
failover?
Failover period
23Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur in AP during
failover?
NO! Signatures very different Symptoms not
related to app server Root cause is related to
database not being primed for typically unseen
type of transaction
24Improving Signatures
- Ideally
- Different root causes should have different
signatures - Similar root causes should have similar
signatures - Realistically
- Different syndromes should have different
signatures - Similar syndromes should have similar signatures
- Relationship between root causes and syndromes
hidden - Need domain specific knowledge
- Cant control it directly anyways
- Signature construction method can control how
signatures map to syndromes
25The Synonym Problem
- Two or more different metrics or sets of metrics
that all correlate equally well with SLO state
over a specific time period - Excludes cases where the correlation is very weak
- Can cause nearly identical syndromes mapping to
multiple distinct signatures. - Negatively impacts retrieval and clustering
- Different signatures no longer implies different
syndromes - Several different causes of synonyms
- Static vs. dynamic synonyms
26Static Synonyms
- When different sets of metrics always correlate
equally well with SLO state - Caused by
- One underlying metric with multiple names
- Metrics that are linear transformations of one
another - Solution
- Statically find these synonyms and consistently
choose one metric or set of metrics to use from
each synonym group - May be difficult if comparing different systems
under different administrative control
gbl_cpu_total_util
gbl_cpu_total_time
app_cpu_total_util
app_cpu_total_time
27Dynamic Synonyms
- When different metrics or sets of metrics only
sometimes correlate equally well with SLO state - Caused by
- One metric being sometimes driven by another
- A root cause manifesting itself in multiple ways
- Multiple root causes coinciding by chance
- Solutions
- Trying to pick one metric or set of metrics to
use from a synonym group will not work - Flat vector signature may be inadequate
AppSrv CPU Hog
AppSrv CPU High
AppSrv Net Out Low
28New Signature Format
- Let each signature consist of multiple
sub-signatures - Includes as many possible explanations as can
be found - Vary learning algorithm parameters to increase
signature coverage - Retrieval and clustering more complicated
- Distance metric can be some combination of
distances of all pairs of sub-signatures - Mean, Min, Top-N, etc
Sig2
C
B
A
F
E
G
D
I
H
Sig1
29Summary
- Showed that it is feasible to reduce diagnosis to
an information retrieval problem - Presented methodology enabling
- Systematic search over past diagnosis and
solutions - Automatic retrieval of similar issues
- Key to success is in finding a good
representation of system/application state - Just indexing based on raw values is not good
enough - Capturing information about metrics correlation
with SLO is the key. - Synonym problem is one roadblock to signatures
accurately representing system state