Capturing, indexing and retrieving system history - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Capturing, indexing and retrieving system history

Description:

'Those who cannot remember the past are condemned to repeat it', George ... Magpie: Barham et al., OSDI'04. Pinpoint: Chen et al., DSN'02. SLIC: OSDI'04, DSN'05 ... – PowerPoint PPT presentation

Number of Views:27
Avg rating:3.0/5.0
Slides: 25
Provided by: duartede
Category:

less

Transcript and Presenter's Notes

Title: Capturing, indexing and retrieving system history


1
Capturing, indexing and retrieving system history
  • Ira Cohen, Moises Goldszmidt, Julie Symons,
    Terence Kelly HP Labs
  • Steve Zhang, Armando Fox -Stanford University
  • Those who cannot remember the past are condemned
    to repeat it, George Santayana

2
Service performance management by Service Level
Objectives (SLO)
3
Wouldnt it be great if
  • We could identify similar instances of a
    particular problem?
  • And retrieve previous diagnosis/repairs?

4
Wouldnt it be great if
  • We knew if these are different problems and their
    intensity?
  • We knew which are recurrent?
  • We knew their characteristic symptoms (syndrome)?

5
Contribution of this work
  • Reducing system performance diagnosis into an
    information retrieval framework by
  • Maintaining a long term memory of past events
    and actions in a form that can be manipulated and
    searched by computers.
  • Enabling
  • Identification of recurrent problems
  • Prioritization and syndrome identification
  • Find similar problems in related systems
  • Leverage previous repair/root cause efforts.

6
Main Challenge
  • Find a representation (signature) that captures
    the main characteristics of the system behavior
    that is
  • Amenable to distance metrics
  • Generated automatically
  • In Machine readable form
  • Therefore requiring little human intervention

7
Representation of system state
  • Many metrics available on IT-systems
  • Assumption they are sufficient to capture the
    essence of the problems
  • Couldnt we just keep their raw values?

8
Signature construction approach
  • Learn the relationship between the metrics and
    the SLO state
  • Based on our OSDI04, DSN05 work

P(SLO, M)
9
Signatures - example
  • For a given SLO violation, the models provide a
    list of metrics that are attributed with the
    violation.
  • Metric has value 1 if it is attributed with the
    violation, -1 if it is not attributed, 0 if it is
    not relevant, e.g.

Attri- bution
10
Why do we need metric attribution?
  • Take two instances of a known problem and a
    relevant metric to that problem
  • Values are very different, but behavior compared
    to normal is similar

Gbl app alive proc
Time
11
Creating and using signatures
Leveraging prior diagnosis efforts
Monitored service
Retrieval engine
Signature DB
Signature construction engine
Metrics/SLO Monitoring
Provides annotations in free form
-Identifies intensity -Identifies
recurrence -Provides syndromes
Clustering engine
Admin
12
Validation and testing of approach
  • Experimental testbed
  • Three tier e-commerce site (Petstore)
  • Three types of induced performance problems
  • Service FT
  • Geographically distributed three-tier mission
    critical Enterprise application
  • 6 weeks of data from all instances of the
    service.
  • One diagnosed (recurring) performance problem in
    6 week period ( annotated "Stuck Thread")

13
Service FT
14
Evaluating signature representations
  • Retrieval
  • Perform tests with already annotated problems,
    compute precision and recall of retrieval
  • Precision C/A Recall C/B
  • Precision typically decreases as Recall increases

Relevant signatures in DB
B
Retrieved signatures
A
C
Correctly retrieved signatures
15
Precision-Recall graph
Retrieval of "Stuck Thread" problem
Conclusion Retrieval with metric attribution
produces significantly better results
16
Evaluating signature representations
  • Grouping similar problems (Clustering)
  • Good result is one where all members of each
    cluster have a single annotation (pure)
  • Annotation of performance problems in case of
    Petstore
  • SLO violation/compliance in case of Service FT
  • We compute the entropy of the clusters with
    respect to the annotation gt
  • Low entropy good result

17
Entropy of clustering
Entropy of clustering in Petstore
Conclusion Clustering with metric attribution
produces significantly better results
18
Identifying intensity and recurrent problems
6 weeks view of one of the App server instances
of Service FT
19
Syndrome identification of problems
Attributed metrics identify the symptoms of the
problems
20
Leveraging signatures across data-centers
What happened in Asia-Pacific during the failover
from Americas on Dec 18?
21
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur during the
failover?
Failover period
22
Leveraging signatures across data-centers
Did the "Stuck Thread" problem occur in AP during
failover?
NO! Signatures very different Symptoms not
related to app server Root cause is related to
database not being primed for typically unseen
type of transaction
23
Related work
  • Signatures and retrieval for other domain
    problems
  • Virus and intrusion detection e.g., Kephart et
    al. 1994
  • Codebooks for fault isolation e.g., SMARTS.com
  • Redstone et al. HotOS03 Using computers to
    diagnose computers
  • Diagnosis and debugging of performance problems
  • Project 5 Aguilera et al., SOSP03
  • Magpie Barham et al., OSDI04
  • Pinpoint Chen et al., DSN02
  • SLIC OSDI04, DSN05

24
Discussion and future work
  • Can the signatures be generalized and abstracted
    to be applicable across different
    systems/applications?
  • What changes are the signatures resilient to?
  • How to handle and represent signatures with
    metrics that are synonyms of each other?
  • How can we leverage the many non-annotated
    signatures with the few annotated ones to achieve
    more accurate retrieval? (semi-supervised
    learning)
  • How to interact with users?
  • For displaying results
  • For achieving annotation of signatures.
  • For most relevant feedback so system can learn
    faster-- active learning
  • How to /Should we incorporate time in the
    construction of signatures?

25
Summary
  • Showed that it is feasible to reduce diagnosis to
    an information retrieval problem
  • Presented methodology enabling
  • Systematic search over past diagnosis and
    solutions
  • Automatic retrieval of similar issues
  • ID recurrent problems
  • Leverage other IT infrastructure knowledge
  • Key to success is in finding a good
    representation of system/application state
  • Just indexing based on raw values is not good
    enough
  • Capturing information about metrics correlation
    with SLO is the key.

http//www.hpl.hp.com/research/slic
26
Backup slides from here
27
Performance of approach
  • On HP-Labs Matlab implementation
  • One month of data required 30 models
  • Inducing a new model takes 3-5 seconds,
  • Inference and update takes 1msec.
  • Whole operation is real-time with respect to
    the 5 minutes of an epoch
  • Ensemble accuracy in being able to predict SLO
    state is about 90
  • Model ? Classifier
  • Probability distribution A Bayesian network
  • About 6-10 parameters (real numbers) per metric
  • Handful of metrics (3-5), per model

28
The model a classifier F(M) ? SLOstate
  • Consists of two parts
  • A probabilistic model of ltM,SLOgt
  • A decision function on what is more likely
    P(S-M) gt P(SM)
  • Bayesian networks for representing P
  • Interpretability (attribution)
  • P(mipa(mi),s-) gt P(mipa(mi),s)TAN Friedman,
    Geiger, Goldszmidt 96-97
  • Efficiency of (representation and computation)
  • Can fuse expert knowledge with statistical data
  • Measures of success
  • Percentage of patterns captured by the model
  • BA 0.5(P(s- F(M)s-) P(s F(M) s))

29
Adapting to Change
  • Change is frequent (e.g. workload, sw/hw upgrade,
    configuration, etc)
  • Data is continuously collected as a system runs
  • Building a model early on and leaving it
    unaltered doesnt work
  • Models built using data collected under one
    condition do not work well on data collected
    under a different condition
  • Possible Techniques
  • Rebuild model with all available data
    periodically (All-encompassing model)
  • Requires more complex model for multiple
    behaviors
  • Rebuild model with most recent data periodically
    (Single adaptive model)
  • Past behaviors are lost
  • Induce new models periodically add to ensemble
    of models only if necessary
  • Must combine information from multiple models

30
Ensemble Algorithm Outline
  • Size of window is a parameter
  • Controls rate of adaptation
  • Larger windows produce better models (up to a
    point)
  • Typically 6-12 hrs of data

New sample every 1-5 minutes
Yes
2-3 secs
1 ms per model
few ms for 100s of models
20 per month
31
Metric attribution with an Ensemble
  • For all models predicting correctly the SLO state
    do
  • Score all models on past window (Brier score)
  • For each model perform metric attribution
  • attr_i 1 if p(m_i s-) gt p(m_i s)
  • -1 otherwise
  • For each metric choose attribution result based
    on model with best score
  • For all other metrics store value 0
Write a Comment
User Comments (0)
About PowerShow.com