Title: Management Pack University
1Getting Manageability Right
- Management Pack University
- Nishtha Soni
2Agenda
- A manageable application is the first step to
creating a useful management pack - Failure Mode Analysis is designed to increase
manageability
3Who are our customers?
- Anxious IT Managers Dont Sleep Well
- InformationWeek - March 12, 2007
- Two out of three IT managers say that they are
kept awake at night worrying about work - 75 percent admit ongoing anxiety about
application performance concerns - 25 of the responded reported suffering physical
symptoms, including nausea, headaches, migraines,
panic attacks, heart arrhythmia, and muscle
twitches. And nightmares. - Terry Beehr, a Central Michigan University
professor of psychology - "If IT goes down, a lot of other departments
can't do their work. - "IT is 24-by-7, plus that's combined with heavy
workloads and work that needs to be done
quickly,"
4Operations Roles
- Where do they work?
- Centralized help desk workers
- Most problem reports go here first
- IT skill levels trained
- What do they do?
- Spot trends
- Look at IT health all up
- First response to problem alerts
- Decide when to escalate
- Fix the common stuff
- Low privileges typically
- May specialize
- DBAs, Network, Apps
- Managers are responsible for all up IT health
reporting
- Where do they work?
- Cubicles Offices
- Dedicated application teams
- ERP, LOB
- Dedicated technology teams
- E.g. security, network, AD
- IT Skill levels specialist
- What do they do?
- Provision servers, get applications ready to run
- Power/cooling typically separate
- Get things working when Tier 1 cannot
- Diagnose new problems
- Systematize new remedies
- Manage monitoring
- Make sure outages can be detected and explained
- Long term trends and capacity planning
5Problems our customers have
- After setup, product health is their top concern
- Instrumentation often inadequate
- Tier 3 focused instrumentation is prominent
- Noisy monitoring is as bad as no monitoring
- Alerts when there is nothing to fix increases
cost of ownership - Ideal case is to not have to staff monitoring
roles - Practical reality with MSFT products is this
isnt feasible - Differentiating maintenance from break fix
- If maintenance is not performed (admin), break
fix occurs - We invest in administration interfaces, so why
not operations interfaces - Ops Manager is the operators interface
6Getting Instrumentation Right
- Instrumentation is the largest gating factor to
achieving proactive management.
7Getting Manageability right
- Manageability Maturity Model can help us
determine right investment levels for current
state of product. - Model focuses on product instrumentation as a
factor in gauging maturity - Six Levels
- Level 0 most instrumentation (events/counters)
undocumented - Level 1 manual diagnostic info published for
all events (MMD health model) - Level 2 Instrumentation is symptomatic,
Management pack is rudimentary as a result. Many
false alarms elevated customer costs result. - Level 3 Instrumentation approach changes to
proactive and cause based. Knowledge articles
focus on how to fix outages, not on diagnosis. - Level 4 Root cause issue detection fully
supported by instrumentation. Tasks added to MP
to help streamline restore/repair. - Level 5 Instrumentation supports predictive
management, capacity planning and efficient data
collection with low privilege levels. Customer
costs are minimal TCO is best in class - You cant get level 5 results with level 1
instrumentation
8Thinking about monitoring
- Most instrumentation seen in the wild today
- Added by the developer for debugging or code path
tracing of problems - Doesnt necessarily tell me if a service or
application is working well - Reports a symptom, and rarely alone is suitable
to make a diagnosis or break/fix decision - Most monitoring today is
- Implemented by the operations people who need to
manage the IT asset - Rarely a part of the up front system or
application design effort - A best guess on the part of the person or team
who designed monitoring rules based on what
instrumentation is visible after setting up an
application on a test environment. - If an event manifest for the product is available
it is helpful, but without deep knowledge of
what each event signifies, not always useful in a
proactive way. - Failure Mode Analysis helps drive improvements on
both areas
9What to measure
Deployment
Did it deploy? Ready to run?
File counts Smoke test OK
What went wrong? How is it configured?
Verification
Is everything in the right place? Right
version?
In Compliance? Patch level good?
What is missing? End to end trace?
Running
SLA being met? Resources ok? Performing well?
Can it be used to do work? Is it responsive?
What is the internal state right now?
Questions
Observations
Diagnosing
10Life cycle states roles
Deployment
Tier 2
Tier 2
Tier 2 smoke test Tier 3 deep dive
Verification
Tier 2 Tier 3
Tier 2 App owner Admin
Tier 2
Running
Tier 2 Tier 3
Tier 1
Tier 1
Questions
Observations
Diagnosing
11Three important questions
- Is my application healthy?
- Use health measures to show there are no customer
impacting issues - Look at redundant measures that detect elements
that have failed - Look at the balance of work across the system
- Are critical dependencies able to perform in
concert without major disruption to users? - Are the users of my application happy?
- How fast do your pages load from request to
responsiveness? - Look at abandon page rates relative to overall
traffic - Can an end to end interaction happen without
interruption? - Consider artificial transactions as a weak proxy
for these - How well do the parts of my application work
together? - Look at subsystem measures that signal imbalances
- Instrument for detecting problems where they
occur - Be able to follow a call from end to end if
necessary
12Failure Mode Analysis
- Moving up the scale on the manageability maturity
model
13Failure Modes what are they?
- Failure modes drive support incidents
- Planning helps lower support costs for your
product - Planning for failure lets you optimize what you
instrument and monitor to detect - No service is free of failure modes
- Planning for failures makes products more
resilient - Examples
- Hardware monitoring
- Fan can fail
- Disk can fail in a raid 5 array, or be full
- Configuration monitoring
- Configuration file is not in correct location
- Access Control List to remote host can be
changed, causing a failure - Critical bug fixes dont get applied
- Capacity
- Database can become full, causing a failure
- Too much data in system making queries slow
- Too much traffic overusing resources
14Failure-Mode analysis
- Definition
- An up-front design effort for a monitoring plan
that is similar to threat modeling - Produces
- Monitoring plan
- Used to drive management pack technical design
- Coverage Matrix failure mode coverage
- Capacity plan what to collect to enable
trending and capacity planning - Instrumentation plan
- Design artifacts used to write code that helps
detect failures - Derives from coverage matrix and capacity plan
(union) - Typically shows up in dev specs QA tests using
coverage matrix plan - Health Façade
- Describes health at the end-user and subsystem
level - Used to understand impact of specific types of
failures on each subsystem - Guides mitigation and recovery documentation
- Helps drive escalations from Tier 1 to Tier 2
- Driven by
- Monitoring champion Ensures that monitoring is
part of design process
15Failure-Mode Analysis Steps
- Process
- Step 1 List what can go wrong and cause harm to
service - Identify all failure modes List predictable ways
to fail - Understand if an item is a way to fail or an
effect of a failure - Prioritize according to impact on service health,
probability, cost - Include physical, software, and network
components - Step 2 Identify a detection strategy for each
failure mode - Each high-impact item needs at least two
detection methods - Detection can be a measure or event, or can
require watchdog (code) - Step 3 Add these detection elements to your code
effort - Some are probes, some are monitors (automate as
much as possible) - Step 4 Plan your management pack content
- Result is the basis of instrumentation and
monitoring plans - Failure modes are root causes
- Detecting root causes directly is optimal
- Inferring root causes via symptoms requires
correlation
16Getting it wrong
- Instrumentation that is useful for tracing is not
always right for finding and pinpointing issues
The name hidden Server management pack increased
my costs by 40. We had to hire more operators
just to close all of the non-actionable alerts
CENSORED
17The code-path problem
- Example Data access layer
- N-tier application with front end, middle tier
and DAL - What does error 100 mean?
Public iDatagetBusinessData( parameters )
try mConfig.open (mConfigPath) connectToDBmConfi
g.ConnectString data conn.getDatafromDb(Sproc,
parameters) return data catch
(exception e) WriteEventLogEvent(100,
E_ExceptionInDal) throw
18Code-path problem explodes
try call_middle_Tier(params) catch
(exception e) WriteEventLogEvent(101,
E_ExceptionWeb) throw
Front End
try call_DAL(prams) catch (exception
e) WriteEventLogEvent(102, E_)
throw
Middle Tier
BAM
DAL
BAM
19Failure Modes
- Failure Modes are Predictable Causes
- Configuration
- Config file missing
- Config file permissions
- Config file corrupt no defaults
- Connect string incorrect
- Database
- DB availability database is offline
- DB permissions log-in denied
- DB permissions execute permission on sproc
denied - DB data error
- Environment
- Network DNS lookup fails
- Network ACL issues (looks a lot like db
availability) - Instrument these
- Unique event per root cause predicted problem
- Diagnostic event log when other.
- Context is key. Know the source who has the
context to diagnose the problem - These go into the management pack as trouble
signals
Public iDatagetBusinessData( parameters )
try mConfig.open (mConfigPath) connectToDBmConfi
g.ConnectString data conn.getDatafromDb(Sproc,
parameters) return data catch
(exception e) WriteEventLogEvent(100,
E_ExceptionInDal) throw
20Failure-mode analysis key message
- Debug instrumentation is high cost to customers
- Better than NO instrumentation
- MMD health model critical to help debug at low
maturity levels - Contextual failure alerting is predictive (level
4 and up) - Management pack alerts should be actionable
- It might mean a or b is not a starting point
- There are only 5 essential operator actions
(locked down env) - Reboot Host
- Stop/Start service
- Run maintenance task
- Add more or reduce overall capacity (e.g. shut it
off) - Change configuration
- Instrumentation should identify cause and map to
action - Manual diagnosis drives additional expense
21Learning more
- Templates
- Failure Mode Analysis Template
- White papers
- Introduction to Operations
- Thinking Operationally why we measure