Capturing, indexing and retrieving system history - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Capturing, indexing and retrieving system history

Description:

Capturing, indexing and retrieving system history. Steve Zhang, Armando Fox -Stanford University ... Can't control it directly anyways ... – PowerPoint PPT presentation

Number of Views:218

Avg rating:3.0/5.0

Slides: 27

Provided by: duar77

Category:

more less

Transcript and Presenter's Notes

Title: Capturing, indexing and retrieving system history

1
Capturing, indexing and retrieving system history

Steve Zhang, Armando Fox -Stanford University
Ira Cohen, Moises Goldszmidt, Julie Symons,
Terence Kelly HP Labs

2
Outline

Goals what problems are we trying to solve?
Our approach
Validation and testing
Results
Work-in-Progress
Improving our approach
Hurdles
Initial observations
Future direction
Summary

3
Service performance management by Service Level
Objectives (SLO)
Unhealthy
4
Wouldnt it be great if

We could identify similar instances of a
particular problem?
And retrieve previous diagnosis/repairs?

5
Wouldnt it be great if

We knew how many different problems and their
intensities?
We knew their characteristic symptoms (syndrome)?

6
Contribution of this work

Reducing system performance diagnosis into an
information retrieval framework by
Maintaining a long term memory of past events
and actions in a form that can be manipulated and
searched by computers.
Enabling
Identification of different problems and their
intensities
Helps operator prioritization
Find similar problems in past or in related
systems
Leverage previous repair/root cause efforts.
Syndrome identification

7
Main Challenge

Find a representation (signature) that captures
the main characteristics of the system behavior
that is
Amenable to distance metrics
Generated automatically
In Machine readable form
Therefore requiring little human intervention

8
Representation of system state

Many metrics available on IT-systems
Assumption they are sufficient to capture the
essence of the problems
Couldnt we just keep their raw values?

9
Signature construction approach

Learn the relationship between the metrics and
the SLO state
Based on our OSDI04, DSN05 work

P(SLO, M)
10
Signatures - example

For a given SLO violation, the models provide a
list of metrics that are attributed with the
violation.
Metric has value 1 if it is attributed with the
violation, -1 if it is not attributed, 0 if it is
not relevant, e.g.

Attri- bution
11
Why do we need metric attribution?

Take two instances of a known problem and a
relevant metric to that problem
Values are very different, but behavior compared
to normal is similar

Gbl app alive proc
Time
12
Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
-Identifies intensity -Identifies
recurrence -Provides syndromes
Clustering engine
Admin
13
Validation and testing of approach

Experimental testbed
Three tier e-commerce site (Petstore)
Three types of induced performance problems
Service FT
Geographically distributed three-tier mission
critical Enterprise application
6 weeks of data from all instances of the
service.
One diagnosed (recurring) performance problem in
6 week period ( annotated "Stuck Thread")

14
Service FT
15
Evaluating signature representations

Retrieval
Perform tests with already annotated problems,
compute precision and recall of retrieval
Precision C/A Recall C/B
Precision typically decreases as Recall increases

Relevant signatures in DB
B
Retrieved signatures
A
C
Correctly retrieved signatures
16
Precision-Recall graph
Retrieval of "Stuck Thread" problem
Conclusion Retrieval with metric attribution
produces significantly better results
17
Evaluating signature representations

Grouping similar problems (Clustering)
Good result is one where all members of each
cluster have a single annotation (pure)
Annotation of performance problems in case of
Petstore
We compute the entropy of the clusters with
respect to the annotation gt
Low entropy good result

18
Entropy of clustering
Entropy of clustering in Petstore
Conclusion Clustering with metric attribution
produces significantly better results
19
Identifying intensity and recurrent problems
6 weeks view of one of the App server instances
of Service FT
20
Syndrome identification of problems
Attributed metrics identify the symptoms of the
problems
21
Leveraging signatures across data-centers
What happened in Asia-Pacific during the failover
from Americas on Dec 18?
22
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur during the
failover?
Failover period
23
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur in AP during
failover?
NO! Signatures very different Symptoms not
related to app server Root cause is related to
database not being primed for typically unseen
type of transaction
24
Improving Signatures

Ideally
Different root causes should have different
signatures
Similar root causes should have similar
signatures
Realistically
Different syndromes should have different
signatures
Similar syndromes should have similar signatures
Relationship between root causes and syndromes
hidden
Need domain specific knowledge
Cant control it directly anyways
Signature construction method can control how
signatures map to syndromes

25
The Synonym Problem

Two or more different metrics or sets of metrics
that all correlate equally well with SLO state
over a specific time period
Excludes cases where the correlation is very weak
Can cause nearly identical syndromes mapping to
multiple distinct signatures.
Negatively impacts retrieval and clustering
Different signatures no longer implies different
syndromes
Several different causes of synonyms
Static vs. dynamic synonyms

26
Static Synonyms

When different sets of metrics always correlate
equally well with SLO state
Caused by
One underlying metric with multiple names
Metrics that are linear transformations of one
another
Solution
Statically find these synonyms and consistently
choose one metric or set of metrics to use from
each synonym group
May be difficult if comparing different systems
under different administrative control

gbl_cpu_total_util
gbl_cpu_total_time
app_cpu_total_util
app_cpu_total_time
27
Dynamic Synonyms

When different metrics or sets of metrics only
sometimes correlate equally well with SLO state
Caused by
One metric being sometimes driven by another
A root cause manifesting itself in multiple ways
Multiple root causes coinciding by chance
Solutions
Trying to pick one metric or set of metrics to
use from a synonym group will not work
Flat vector signature may be inadequate

AppSrv CPU Hog
AppSrv CPU High
AppSrv Net Out Low
28
New Signature Format

Let each signature consist of multiple
sub-signatures
Includes as many possible explanations as can
be found
Vary learning algorithm parameters to increase
signature coverage
Retrieval and clustering more complicated
Distance metric can be some combination of
distances of all pairs of sub-signatures
Mean, Min, Top-N, etc

Sig2
C
B
A
F
E
G
D
I
H
Sig1
29
Summary

Showed that it is feasible to reduce diagnosis to
an information retrieval problem
Presented methodology enabling
Systematic search over past diagnosis and
solutions
Automatic retrieval of similar issues
Key to success is in finding a good
representation of system/application state
Just indexing based on raw values is not good
enough
Capturing information about metrics correlation
with SLO is the key.
Synonym problem is one roadblock to signatures
accurately representing system state