Title: Capturing, indexing and retrieving system history
1Capturing, indexing and retrieving system history
- Ira Cohen, Moises Goldszmidt, Julie Symons,
Terence Kelly HP Labs - Steve Zhang, Armando Fox -Stanford University
- Those who cannot remember the past are condemned
to repeat it, George Santayana
2Service performance management by Service Level
Objectives (SLO)
3Wouldnt it be great if
- We could identify similar instances of a
particular problem? - And retrieve previous diagnosis/repairs?
4Wouldnt it be great if
- We knew if these are different problems and their
intensity? - We knew which are recurrent?
- We knew their characteristic symptoms (syndrome)?
5Contribution of this work
- Reducing system performance diagnosis into an
information retrieval framework by - Maintaining a long term memory of past events
and actions in a form that can be manipulated and
searched by computers. - Enabling
- Identification of recurrent problems
- Prioritization and syndrome identification
- Find similar problems in related systems
- Leverage previous repair/root cause efforts.
6Main Challenge
- Find a representation (signature) that captures
the main characteristics of the system behavior
that is - Amenable to distance metrics
- Generated automatically
- In Machine readable form
- Therefore requiring little human intervention
7Representation of system state
- Many metrics available on IT-systems
- Assumption they are sufficient to capture the
essence of the problems - Couldnt we just keep their raw values?
8Signature construction approach
- Learn the relationship between the metrics and
the SLO state - Based on our OSDI04, DSN05 work
P(SLO, M)
9Signatures - example
- For a given SLO violation, the models provide a
list of metrics that are attributed with the
violation. - Metric has value 1 if it is attributed with the
violation, -1 if it is not attributed, 0 if it is
not relevant, e.g.
Attri- bution
10Why do we need metric attribution?
- Take two instances of a known problem and a
relevant metric to that problem - Values are very different, but behavior compared
to normal is similar
Gbl app alive proc
Time
11Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
-Identifies intensity -Identifies
recurrence -Provides syndromes
Clustering engine
Admin
12Validation and testing of approach
- Experimental testbed
- Three tier e-commerce site (Petstore)
- Three types of induced performance problems
- Service FT
- Geographically distributed three-tier mission
critical Enterprise application - 6 weeks of data from all instances of the
service. - One diagnosed (recurring) performance problem in
6 week period ( annotated "Stuck Thread")
13Service FT
14Evaluating signature representations
- Retrieval
- Perform tests with already annotated problems,
compute precision and recall of retrieval - Precision C/A Recall C/B
- Precision typically decreases as Recall increases
Relevant signatures in DB
B
Retrieved signatures
A
C
Correctly retrieved signatures
15Precision-Recall graph
Retrieval of "Stuck Thread" problem
Conclusion Retrieval with metric attribution
produces significantly better results
16Evaluating signature representations
- Grouping similar problems (Clustering)
- Good result is one where all members of each
cluster have a single annotation (pure) - Annotation of performance problems in case of
Petstore - SLO violation/compliance in case of Service FT
- We compute the entropy of the clusters with
respect to the annotation gt - Low entropy good result
17Entropy of clustering
Entropy of clustering in Petstore
Conclusion Clustering with metric attribution
produces significantly better results
18Identifying intensity and recurrent problems
6 weeks view of one of the App server instances
of Service FT
19Syndrome identification of problems
Attributed metrics identify the symptoms of the
problems
20Leveraging signatures across data-centers
What happened in Asia-Pacific during the failover
from Americas on Dec 18?
21Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur during the
failover?
Failover period
22Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur in AP during
failover?
NO! Signatures very different Symptoms not
related to app server Root cause is related to
database not being primed for typically unseen
type of transaction
23Related work
- Signatures and retrieval for other domain
problems - Virus and intrusion detection e.g., Kephart et
al. 1994 - Codebooks for fault isolation e.g., SMARTS.com
- Redstone et al. HotOS03 Using computers to
diagnose computers - Diagnosis and debugging of performance problems
- Project 5 Aguilera et al., SOSP03
- Magpie Barham et al., OSDI04
- Pinpoint Chen et al., DSN02
- SLIC OSDI04, DSN05
24Discussion and future work
- Can the signatures be generalized and abstracted
to be applicable across different
systems/applications? - What changes are the signatures resilient to?
- How to handle and represent signatures with
metrics that are synonyms of each other? - How can we leverage the many non-annotated
signatures with the few annotated ones to achieve
more accurate retrieval? (semi-supervised
learning) - How to interact with users?
- For displaying results
- For achieving annotation of signatures.
- For most relevant feedback so system can learn
faster-- active learning - How to /Should we incorporate time in the
construction of signatures?
25Summary
- Showed that it is feasible to reduce diagnosis to
an information retrieval problem - Presented methodology enabling
- Systematic search over past diagnosis and
solutions - Automatic retrieval of similar issues
- ID recurrent problems
- Leverage other IT infrastructure knowledge
- Key to success is in finding a good
representation of system/application state - Just indexing based on raw values is not good
enough - Capturing information about metrics correlation
with SLO is the key.
http//www.hpl.hp.com/research/slic
26Backup slides from here
27Performance of approach
- On HP-Labs Matlab implementation
- One month of data required 30 models
- Inducing a new model takes 3-5 seconds,
- Inference and update takes 1msec.
- Whole operation is real-time with respect to
the 5 minutes of an epoch - Ensemble accuracy in being able to predict SLO
state is about 90 - Model ? Classifier
- Probability distribution A Bayesian network
- About 6-10 parameters (real numbers) per metric
- Handful of metrics (3-5), per model
28The model a classifier F(M) ? SLOstate
- Consists of two parts
- A probabilistic model of ltM,SLOgt
- A decision function on what is more likely
P(S-M) gt P(SM) - Bayesian networks for representing P
- Interpretability (attribution)
- P(mipa(mi),s-) gt P(mipa(mi),s)TAN Friedman,
Geiger, Goldszmidt 96-97 - Efficiency of (representation and computation)
- Can fuse expert knowledge with statistical data
- Measures of success
- Percentage of patterns captured by the model
- BA 0.5(P(s- F(M)s-) P(s F(M) s))
29Adapting to Change
- Change is frequent (e.g. workload, sw/hw upgrade,
configuration, etc) - Data is continuously collected as a system runs
- Building a model early on and leaving it
unaltered doesnt work - Models built using data collected under one
condition do not work well on data collected
under a different condition - Possible Techniques
- Rebuild model with all available data
periodically (All-encompassing model) - Requires more complex model for multiple
behaviors - Rebuild model with most recent data periodically
(Single adaptive model) - Past behaviors are lost
- Induce new models periodically add to ensemble
of models only if necessary - Must combine information from multiple models
30Ensemble Algorithm Outline
- Size of window is a parameter
- Controls rate of adaptation
- Larger windows produce better models (up to a
point) - Typically 6-12 hrs of data
New sample every 1-5 minutes
Yes
2-3 secs
1 ms per model
few ms for 100s of models
20 per month
31Metric attribution with an Ensemble
- For all models predicting correctly the SLO state
do - Score all models on past window (Brier score)
- For each model perform metric attribution
- attr_i 1 if p(m_i s-) gt p(m_i s)
- -1 otherwise
- For each metric choose attribution result based
on model with best score - For all other metrics store value 0