Capturing, indexing and retrieving system history - PowerPoint PPT Presentation

1 / 26
About This Presentation
Title:

Capturing, indexing and retrieving system history

Description:

Capturing, indexing and retrieving system history. Steve Zhang, Armando Fox -Stanford University ... Can't control it directly anyways ... – PowerPoint PPT presentation

Number of Views:218
Avg rating:3.0/5.0
Slides: 27
Provided by: duar77
Category:

less

Transcript and Presenter's Notes

Title: Capturing, indexing and retrieving system history


1
Capturing, indexing and retrieving system history
  • Steve Zhang, Armando Fox -Stanford University
  • Ira Cohen, Moises Goldszmidt, Julie Symons,
    Terence Kelly HP Labs

2
Outline
  • Goals what problems are we trying to solve?
  • Our approach
  • Validation and testing
  • Results
  • Work-in-Progress
  • Improving our approach
  • Hurdles
  • Initial observations
  • Future direction
  • Summary

3
Service performance management by Service Level
Objectives (SLO)
Unhealthy
4
Wouldnt it be great if
  • We could identify similar instances of a
    particular problem?
  • And retrieve previous diagnosis/repairs?

5
Wouldnt it be great if
  • We knew how many different problems and their
    intensities?
  • We knew their characteristic symptoms (syndrome)?

6
Contribution of this work
  • Reducing system performance diagnosis into an
    information retrieval framework by
  • Maintaining a long term memory of past events
    and actions in a form that can be manipulated and
    searched by computers.
  • Enabling
  • Identification of different problems and their
    intensities
  • Helps operator prioritization
  • Find similar problems in past or in related
    systems
  • Leverage previous repair/root cause efforts.
  • Syndrome identification

7
Main Challenge
  • Find a representation (signature) that captures
    the main characteristics of the system behavior
    that is
  • Amenable to distance metrics
  • Generated automatically
  • In Machine readable form
  • Therefore requiring little human intervention

8
Representation of system state
  • Many metrics available on IT-systems
  • Assumption they are sufficient to capture the
    essence of the problems
  • Couldnt we just keep their raw values?

9
Signature construction approach
  • Learn the relationship between the metrics and
    the SLO state
  • Based on our OSDI04, DSN05 work

P(SLO, M)
10
Signatures - example
  • For a given SLO violation, the models provide a
    list of metrics that are attributed with the
    violation.
  • Metric has value 1 if it is attributed with the
    violation, -1 if it is not attributed, 0 if it is
    not relevant, e.g.

Attri- bution
11
Why do we need metric attribution?
  • Take two instances of a known problem and a
    relevant metric to that problem
  • Values are very different, but behavior compared
    to normal is similar

Gbl app alive proc
Time
12
Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
-Identifies intensity -Identifies
recurrence -Provides syndromes
Clustering engine
Admin
13
Validation and testing of approach
  • Experimental testbed
  • Three tier e-commerce site (Petstore)
  • Three types of induced performance problems
  • Service FT
  • Geographically distributed three-tier mission
    critical Enterprise application
  • 6 weeks of data from all instances of the
    service.
  • One diagnosed (recurring) performance problem in
    6 week period ( annotated "Stuck Thread")

14
Service FT
15
Evaluating signature representations
  • Retrieval
  • Perform tests with already annotated problems,
    compute precision and recall of retrieval
  • Precision C/A Recall C/B
  • Precision typically decreases as Recall increases

Relevant signatures in DB
B
Retrieved signatures
A
C
Correctly retrieved signatures
16
Precision-Recall graph
Retrieval of "Stuck Thread" problem
Conclusion Retrieval with metric attribution
produces significantly better results
17
Evaluating signature representations
  • Grouping similar problems (Clustering)
  • Good result is one where all members of each
    cluster have a single annotation (pure)
  • Annotation of performance problems in case of
    Petstore
  • We compute the entropy of the clusters with
    respect to the annotation gt
  • Low entropy good result

18
Entropy of clustering
Entropy of clustering in Petstore
Conclusion Clustering with metric attribution
produces significantly better results
19
Identifying intensity and recurrent problems
6 weeks view of one of the App server instances
of Service FT
20
Syndrome identification of problems
Attributed metrics identify the symptoms of the
problems
21
Leveraging signatures across data-centers
What happened in Asia-Pacific during the failover
from Americas on Dec 18?
22
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur during the
failover?
Failover period
23
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur in AP during
failover?
NO! Signatures very different Symptoms not
related to app server Root cause is related to
database not being primed for typically unseen
type of transaction
24
Improving Signatures
  • Ideally
  • Different root causes should have different
    signatures
  • Similar root causes should have similar
    signatures
  • Realistically
  • Different syndromes should have different
    signatures
  • Similar syndromes should have similar signatures
  • Relationship between root causes and syndromes
    hidden
  • Need domain specific knowledge
  • Cant control it directly anyways
  • Signature construction method can control how
    signatures map to syndromes

25
The Synonym Problem
  • Two or more different metrics or sets of metrics
    that all correlate equally well with SLO state
    over a specific time period
  • Excludes cases where the correlation is very weak
  • Can cause nearly identical syndromes mapping to
    multiple distinct signatures.
  • Negatively impacts retrieval and clustering
  • Different signatures no longer implies different
    syndromes
  • Several different causes of synonyms
  • Static vs. dynamic synonyms

26
Static Synonyms
  • When different sets of metrics always correlate
    equally well with SLO state
  • Caused by
  • One underlying metric with multiple names
  • Metrics that are linear transformations of one
    another
  • Solution
  • Statically find these synonyms and consistently
    choose one metric or set of metrics to use from
    each synonym group
  • May be difficult if comparing different systems
    under different administrative control

gbl_cpu_total_util
gbl_cpu_total_time
app_cpu_total_util
app_cpu_total_time
27
Dynamic Synonyms
  • When different metrics or sets of metrics only
    sometimes correlate equally well with SLO state
  • Caused by
  • One metric being sometimes driven by another
  • A root cause manifesting itself in multiple ways
  • Multiple root causes coinciding by chance
  • Solutions
  • Trying to pick one metric or set of metrics to
    use from a synonym group will not work
  • Flat vector signature may be inadequate

AppSrv CPU Hog
AppSrv CPU High
AppSrv Net Out Low
28
New Signature Format
  • Let each signature consist of multiple
    sub-signatures
  • Includes as many possible explanations as can
    be found
  • Vary learning algorithm parameters to increase
    signature coverage
  • Retrieval and clustering more complicated
  • Distance metric can be some combination of
    distances of all pairs of sub-signatures
  • Mean, Min, Top-N, etc

Sig2
C
B
A
F
E
G
D
I
H
Sig1
29
Summary
  • Showed that it is feasible to reduce diagnosis to
    an information retrieval problem
  • Presented methodology enabling
  • Systematic search over past diagnosis and
    solutions
  • Automatic retrieval of similar issues
  • Key to success is in finding a good
    representation of system/application state
  • Just indexing based on raw values is not good
    enough
  • Capturing information about metrics correlation
    with SLO is the key.
  • Synonym problem is one roadblock to signatures
    accurately representing system state
Write a Comment
User Comments (0)
About PowerShow.com