Capturing, indexing and retrieving system history - PowerPoint PPT Presentation

1 / 24

About This Presentation

Title:

Capturing, indexing and retrieving system history

Description:

'Those who cannot remember the past are condemned to repeat it', George ... Magpie: Barham et al., OSDI'04. Pinpoint: Chen et al., DSN'02. SLIC: OSDI'04, DSN'05 ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 25

Provided by: duartede

Category:

more less

Transcript and Presenter's Notes

Title: Capturing, indexing and retrieving system history

1
Capturing, indexing and retrieving system history

Ira Cohen, Moises Goldszmidt, Julie Symons,
Terence Kelly HP Labs
Steve Zhang, Armando Fox -Stanford University
Those who cannot remember the past are condemned
to repeat it, George Santayana

2
Service performance management by Service Level
Objectives (SLO)
3
Wouldnt it be great if

We could identify similar instances of a
particular problem?
And retrieve previous diagnosis/repairs?

4
Wouldnt it be great if

We knew if these are different problems and their
intensity?
We knew which are recurrent?
We knew their characteristic symptoms (syndrome)?

5
Contribution of this work

Reducing system performance diagnosis into an
information retrieval framework by
Maintaining a long term memory of past events
and actions in a form that can be manipulated and
searched by computers.
Enabling
Identification of recurrent problems
Prioritization and syndrome identification
Find similar problems in related systems
Leverage previous repair/root cause efforts.

6
Main Challenge

Find a representation (signature) that captures
the main characteristics of the system behavior
that is
Amenable to distance metrics
Generated automatically
In Machine readable form
Therefore requiring little human intervention

7
Representation of system state

Many metrics available on IT-systems
Assumption they are sufficient to capture the
essence of the problems
Couldnt we just keep their raw values?

8
Signature construction approach

Learn the relationship between the metrics and
the SLO state
Based on our OSDI04, DSN05 work

P(SLO, M)
9
Signatures - example

For a given SLO violation, the models provide a
list of metrics that are attributed with the
violation.
Metric has value 1 if it is attributed with the
violation, -1 if it is not attributed, 0 if it is
not relevant, e.g.

Attri- bution
10
Why do we need metric attribution?

Take two instances of a known problem and a
relevant metric to that problem
Values are very different, but behavior compared
to normal is similar

Gbl app alive proc
Time
11
Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
-Identifies intensity -Identifies
recurrence -Provides syndromes
Clustering engine
Admin
12
Validation and testing of approach

Experimental testbed
Three tier e-commerce site (Petstore)
Three types of induced performance problems
Service FT
Geographically distributed three-tier mission
critical Enterprise application
6 weeks of data from all instances of the
service.
One diagnosed (recurring) performance problem in
6 week period ( annotated "Stuck Thread")

13
Service FT
14
Evaluating signature representations

Retrieval
Perform tests with already annotated problems,
compute precision and recall of retrieval
Precision C/A Recall C/B
Precision typically decreases as Recall increases

Relevant signatures in DB
B
Retrieved signatures
A
C
Correctly retrieved signatures
15
Precision-Recall graph
Retrieval of "Stuck Thread" problem
Conclusion Retrieval with metric attribution
produces significantly better results
16
Evaluating signature representations

Grouping similar problems (Clustering)
Good result is one where all members of each
cluster have a single annotation (pure)
Annotation of performance problems in case of
Petstore
SLO violation/compliance in case of Service FT
We compute the entropy of the clusters with
respect to the annotation gt
Low entropy good result

17
Entropy of clustering
Entropy of clustering in Petstore
Conclusion Clustering with metric attribution
produces significantly better results
18
Identifying intensity and recurrent problems
6 weeks view of one of the App server instances
of Service FT
19
Syndrome identification of problems
Attributed metrics identify the symptoms of the
problems
20
Leveraging signatures across data-centers
What happened in Asia-Pacific during the failover
from Americas on Dec 18?
21
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur during the
failover?
Failover period
22
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur in AP during
failover?
NO! Signatures very different Symptoms not
related to app server Root cause is related to
database not being primed for typically unseen
type of transaction
23
Related work

Signatures and retrieval for other domain
problems
Virus and intrusion detection e.g., Kephart et
al. 1994
Codebooks for fault isolation e.g., SMARTS.com
Redstone et al. HotOS03 Using computers to
diagnose computers
Diagnosis and debugging of performance problems
Project 5 Aguilera et al., SOSP03
Magpie Barham et al., OSDI04
Pinpoint Chen et al., DSN02
SLIC OSDI04, DSN05

24
Discussion and future work

Can the signatures be generalized and abstracted
to be applicable across different
systems/applications?
What changes are the signatures resilient to?
How to handle and represent signatures with
metrics that are synonyms of each other?
How can we leverage the many non-annotated
signatures with the few annotated ones to achieve
more accurate retrieval? (semi-supervised
learning)
How to interact with users?
For displaying results
For achieving annotation of signatures.
For most relevant feedback so system can learn
faster-- active learning
How to /Should we incorporate time in the
construction of signatures?

25
Summary

Showed that it is feasible to reduce diagnosis to
an information retrieval problem
Presented methodology enabling
Systematic search over past diagnosis and
solutions
Automatic retrieval of similar issues
ID recurrent problems
Leverage other IT infrastructure knowledge
Key to success is in finding a good
representation of system/application state
Just indexing based on raw values is not good
enough
Capturing information about metrics correlation
with SLO is the key.

http//www.hpl.hp.com/research/slic
26
Backup slides from here
27
Performance of approach

On HP-Labs Matlab implementation
One month of data required 30 models
Inducing a new model takes 3-5 seconds,
Inference and update takes 1msec.
Whole operation is real-time with respect to
the 5 minutes of an epoch
Ensemble accuracy in being able to predict SLO
state is about 90
Model ? Classifier
Probability distribution A Bayesian network
About 6-10 parameters (real numbers) per metric
Handful of metrics (3-5), per model

28
The model a classifier F(M) ? SLOstate

Consists of two parts
A probabilistic model of ltM,SLOgt
A decision function on what is more likely
P(S-M) gt P(SM)
Bayesian networks for representing P
Interpretability (attribution)
P(mipa(mi),s-) gt P(mipa(mi),s)TAN Friedman,
Geiger, Goldszmidt 96-97
Efficiency of (representation and computation)
Can fuse expert knowledge with statistical data
Measures of success
Percentage of patterns captured by the model
BA 0.5(P(s- F(M)s-) P(s F(M) s))

29
Adapting to Change

Change is frequent (e.g. workload, sw/hw upgrade,
configuration, etc)
Data is continuously collected as a system runs
Building a model early on and leaving it
unaltered doesnt work
Models built using data collected under one
condition do not work well on data collected
under a different condition
Possible Techniques
Rebuild model with all available data
periodically (All-encompassing model)
Requires more complex model for multiple
behaviors
Rebuild model with most recent data periodically
(Single adaptive model)
Past behaviors are lost
Induce new models periodically add to ensemble
of models only if necessary
Must combine information from multiple models