Management Pack University - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Management Pack University

Description:

A manageable application is the first step to creating a useful management pack ... Example: Data access layer. N-tier application with front end, middle tier and DAL ... – PowerPoint PPT presentation

Number of Views:21

Avg rating:3.0/5.0

Slides: 22

Provided by: brian286

Category:

more less

Transcript and Presenter's Notes

Title: Management Pack University

1
Getting Manageability Right

Management Pack University
Nishtha Soni

2
Agenda

A manageable application is the first step to
creating a useful management pack
Failure Mode Analysis is designed to increase
manageability

3
Who are our customers?

Anxious IT Managers Dont Sleep Well
InformationWeek - March 12, 2007
Two out of three IT managers say that they are
kept awake at night worrying about work
75 percent admit ongoing anxiety about
application performance concerns
25 of the responded reported suffering physical
symptoms, including nausea, headaches, migraines,
panic attacks, heart arrhythmia, and muscle
twitches. And nightmares.
Terry Beehr, a Central Michigan University
professor of psychology
"If IT goes down, a lot of other departments
can't do their work.
"IT is 24-by-7, plus that's combined with heavy
workloads and work that needs to be done
quickly,"

4
Operations Roles

Tier 2

Tier 1

Where do they work?
Centralized help desk workers
Most problem reports go here first
IT skill levels trained
What do they do?
Spot trends
Look at IT health all up
First response to problem alerts
Decide when to escalate
Fix the common stuff
Low privileges typically
May specialize
DBAs, Network, Apps
Managers are responsible for all up IT health
reporting

Where do they work?
Cubicles Offices
Dedicated application teams
ERP, LOB
Dedicated technology teams
E.g. security, network, AD
IT Skill levels specialist
What do they do?
Provision servers, get applications ready to run
Power/cooling typically separate
Get things working when Tier 1 cannot
Diagnose new problems
Systematize new remedies
Manage monitoring
Make sure outages can be detected and explained
Long term trends and capacity planning

5
Problems our customers have

After setup, product health is their top concern
Instrumentation often inadequate
Tier 3 focused instrumentation is prominent
Noisy monitoring is as bad as no monitoring
Alerts when there is nothing to fix increases
cost of ownership
Ideal case is to not have to staff monitoring
roles
Practical reality with MSFT products is this
isnt feasible
Differentiating maintenance from break fix
If maintenance is not performed (admin), break
fix occurs
We invest in administration interfaces, so why
not operations interfaces
Ops Manager is the operators interface

6
Getting Instrumentation Right

Instrumentation is the largest gating factor to
achieving proactive management.

7
Getting Manageability right

Manageability Maturity Model can help us
determine right investment levels for current
state of product.
Model focuses on product instrumentation as a
factor in gauging maturity
Six Levels
Level 0 most instrumentation (events/counters)
undocumented
Level 1 manual diagnostic info published for
all events (MMD health model)
Level 2 Instrumentation is symptomatic,
Management pack is rudimentary as a result. Many
false alarms elevated customer costs result.
Level 3 Instrumentation approach changes to
proactive and cause based. Knowledge articles
focus on how to fix outages, not on diagnosis.
Level 4 Root cause issue detection fully
supported by instrumentation. Tasks added to MP
to help streamline restore/repair.
Level 5 Instrumentation supports predictive
management, capacity planning and efficient data
collection with low privilege levels. Customer
costs are minimal TCO is best in class
You cant get level 5 results with level 1
instrumentation

8
Thinking about monitoring

Most instrumentation seen in the wild today
Added by the developer for debugging or code path
tracing of problems
Doesnt necessarily tell me if a service or
application is working well
Reports a symptom, and rarely alone is suitable
to make a diagnosis or break/fix decision
Most monitoring today is
Implemented by the operations people who need to
manage the IT asset
Rarely a part of the up front system or
application design effort
A best guess on the part of the person or team
who designed monitoring rules based on what
instrumentation is visible after setting up an
application on a test environment.
If an event manifest for the product is available
it is helpful, but without deep knowledge of
what each event signifies, not always useful in a
proactive way.
Failure Mode Analysis helps drive improvements on
both areas

9
What to measure
Deployment
Did it deploy? Ready to run?
File counts Smoke test OK
What went wrong? How is it configured?
Verification
Is everything in the right place? Right
version?
In Compliance? Patch level good?
What is missing? End to end trace?
Running
SLA being met? Resources ok? Performing well?
Can it be used to do work? Is it responsive?
What is the internal state right now?
Questions
Observations
Diagnosing
10
Life cycle states roles
Deployment
Tier 2
Tier 2
Tier 2 smoke test Tier 3 deep dive
Verification
Tier 2 Tier 3
Tier 2 App owner Admin
Tier 2
Running
Tier 2 Tier 3
Tier 1
Tier 1
Questions
Observations
Diagnosing
11
Three important questions

Is my application healthy?
Use health measures to show there are no customer
impacting issues
Look at redundant measures that detect elements
that have failed
Look at the balance of work across the system
Are critical dependencies able to perform in
concert without major disruption to users?
Are the users of my application happy?
How fast do your pages load from request to
responsiveness?
Look at abandon page rates relative to overall
traffic
Can an end to end interaction happen without
interruption?
Consider artificial transactions as a weak proxy
for these
How well do the parts of my application work
together?
Look at subsystem measures that signal imbalances
Instrument for detecting problems where they
occur
Be able to follow a call from end to end if
necessary

12
Failure Mode Analysis

Moving up the scale on the manageability maturity
model

13
Failure Modes what are they?

Failure modes drive support incidents
Planning helps lower support costs for your
product
Planning for failure lets you optimize what you
instrument and monitor to detect
No service is free of failure modes
Planning for failures makes products more
resilient
Examples
Hardware monitoring
Fan can fail
Disk can fail in a raid 5 array, or be full
Configuration monitoring
Configuration file is not in correct location
Access Control List to remote host can be
changed, causing a failure
Critical bug fixes dont get applied
Capacity
Database can become full, causing a failure
Too much data in system making queries slow
Too much traffic overusing resources

14
Failure-Mode analysis

Definition
An up-front design effort for a monitoring plan
that is similar to threat modeling
Produces
Monitoring plan
Used to drive management pack technical design
Coverage Matrix failure mode coverage
Capacity plan what to collect to enable
trending and capacity planning
Instrumentation plan
Design artifacts used to write code that helps
detect failures
Derives from coverage matrix and capacity plan
(union)
Typically shows up in dev specs QA tests using
coverage matrix plan
Health Façade
Describes health at the end-user and subsystem
level
Used to understand impact of specific types of
failures on each subsystem
Guides mitigation and recovery documentation
Helps drive escalations from Tier 1 to Tier 2
Driven by
Monitoring champion Ensures that monitoring is
part of design process

15
Failure-Mode Analysis Steps

Process
Step 1 List what can go wrong and cause harm to
service
Identify all failure modes List predictable ways
to fail
Understand if an item is a way to fail or an
effect of a failure
Prioritize according to impact on service health,
probability, cost
Include physical, software, and network
components
Step 2 Identify a detection strategy for each
failure mode
Each high-impact item needs at least two
detection methods
Detection can be a measure or event, or can
require watchdog (code)
Step 3 Add these detection elements to your code
effort
Some are probes, some are monitors (automate as
much as possible)
Step 4 Plan your management pack content
Result is the basis of instrumentation and
monitoring plans
Failure modes are root causes
Detecting root causes directly is optimal
Inferring root causes via symptoms requires
correlation

16
Getting it wrong

Instrumentation that is useful for tracing is not
always right for finding and pinpointing issues

The name hidden Server management pack increased
my costs by 40. We had to hire more operators
just to close all of the non-actionable alerts
CENSORED
17
The code-path problem

Example Data access layer
N-tier application with front end, middle tier
and DAL
What does error 100 mean?

Public iDatagetBusinessData( parameters )
try mConfig.open (mConfigPath) connectToDBmConfi
g.ConnectString data conn.getDatafromDb(Sproc,
parameters) return data catch
(exception e) WriteEventLogEvent(100,
E_ExceptionInDal) throw
18
Code-path problem explodes
try call_middle_Tier(params) catch
(exception e) WriteEventLogEvent(101,
E_ExceptionWeb) throw
Front End
try call_DAL(prams) catch (exception
e) WriteEventLogEvent(102, E_)
throw
Middle Tier
BAM
DAL
BAM
19
Failure Modes