Implementing Service Monitoring in a Business Context

About This Presentation

Title:

Implementing Service Monitoring in a Business Context

Description:

Auto Clear' Events. Event De-duplication. State-based Correlation. Automated Resolution ... TBSM enables a service-centric approach to management. 57. Security ... – PowerPoint PPT presentation

Number of Views:138

Avg rating:3.0/5.0

Slides: 70

Provided by: jasonf9

Category:

more less

Transcript and Presenter's Notes

Title: Implementing Service Monitoring in a Business Context

1
Implementing Service Monitoring in a Business
Context

Tivoli User Group

2
Agenda

Session 1
Phase 0 Assessment Planning
Phase 1 Foundation
Phase 2 Discovery
Session 2
Phase 3 Service Visibility

3
Phase 0 Assessment
4
The cost of Service Outage

Revenue lost for transactions that fail
Directly impacts profitability, ever minute of
every outage (Downtime cost for typical retail
outlet is 7,800 per minute (IDC))
Dissatisfied customers go to competitors, and
potentially never return
Tests at Amazon revealed every 100 ms increase in
load time of Amazon.com decreased sales by 1
Google found that moving from a 10-result page
loading in 0.4 seconds to a 30-result page
loading in 0.9 seconds decreased traffic and ad
revenues by 20 (Linden 2006).

5
Cost of Service Outage contd.

Business reputation is damaged
Even customers not directly impacted by outage
may hear about the poor service
The Business vilifies IT when it happens
But still change is pushed through with little
consideration of operational needs
We need to break this cycle, and Business Service
Management can be a catalyst

6
How to Start?

Firstly Identify Stakeholders
Critical to ultimate success
Starting in the right way is key to a good
outcome
Set realistic goals and expectations
Dont be drawn into trying to boil the ocean
Work by service, step by step
See the IT service in its Business Context
View our services based on how they support the
overall organisation and its end customers
Essential in converting Business Managers to see
IT as a key enabler, not a cost burden

7
Identify the Service

What constitutes the service?
Even relatively simple Services cross traditional
IT silo boundaries
IT Infrastructure
Servers Clients
Middleware (Web, Database, Messaging)
Applications
Storage
Security
Processes (Business IT)
People
I.E. Everything supporting transactions
BSM tools help span the boundaries

8
Example Service

For this presentation we will use a standard
e-business application called Plants By Websphere

8
9
PlantsByWebSphere

PlantsByWebSphere is a standard E-business
application. It has
1 Load Balancer (Linux server)
2 Web Servers (Linux Apache and Windows IIS)
2 WebSphere Servers (1 Unix and 1 Windows acting
as a pair)
1 DB2 Database (Linux)

9
10
First Steps

Now we have our service we can use the following
structured approach
Review (or establish) SLAs supporting OLAs
Pull out the Important Business KPIs
Agree relative priority of these
Articulate the KPIs as a measurable IT function
Establish the components that support the
end-to-end Service
Monitor, report, improve

11
Service Level Agreements

Not an essential pre-requisite to BSM
BSM solutions still deliver value without formal
SLAs
Existence will increase the value of the tool
Everyone understands what we are striving to
deliver, and why
Knowing SLA (and thus Business) impact of
incidents helps with recovery prioritisation
Targets (and cost of failure) are clear to all
Facilitates identification of KPIs

12
Service Level Agreements

Useful framework for dialogue with Business
A common understanding of which IT Services are
Business Critical
Establish reasonable, achievable, and measurable
availability and performance targets
Evaluate cost of meeting targets, and business
appetite to invest in suitable infrastructure
Vehicle to justify suitable Test and DR
environments
High availability is expensive and not always
justifiable

13
Objective Measurement

An SLA target that cannot be objectively
qualified is pointless
Directly relate KPIs to IT infrastructure, then
measure and report in Business context
Get the foundations right and build up
Cover all elements of a specific service
Complete coverage of one IT component has little
value to Service Monitoring if others are missing

14
PlantsByWebSphere SLA

Luckily for this exercise we already have an SLA!
The PlantsByWebSphere SLA service is allowed 15
minutes of downtime in a calendar month
After this each minute of downtime will cost 100
per minute

15
Key Performance Indicators

Secondly we need our KPIs
KPIs should reflect the Service SLAs in place
They need to be Business focused, objective, and
achievable
It must be possible to monitor and measure IT
component(s) that equate to the KPI
Relative importance should be agreed
Baseline of current performance advantageous

16
PBW Availability KPIs
17
PBW Performance KPIs
18
Identify the End to End Service Infrastructure

BSM solutions are most effective when all
components of a service are monitored
How do we identify them?
Asset Registry
Discovery Tools
CMDB
Bob (whos worked here for years)
The truth is probably a combination of the above

19
End to End Service Infrastructure

Having identified the Service components
Verify which are monitored
Close any gaps in cover
Ensure KPI Monitors are in place
Integrate with the BSM layer
Display in appropriate views (IT and Business)
Dont forget the supporting processes
Incident, Problem, and Change Management
Capacity Planning

20
Monitor, Report, Improve

Even if everything goes perfectly, you will miss
something (whatever Bob says)
BSM is an iterative process, as there is always
change
Ensure your processes consider your monitoring
solution (Change, Release)
BSM should be scoped in all new projects
Revisit SLAs KPIs regularly to keep them
relevant

21
Assessment Deliverables

Validate stakeholders and business drivers
Vital to an effective implementation project
Establish service components and KPIs
What to monitor and why
Produce high level solution design
How to monitor and visualise results
Common understanding of costs
Hardware, Software, Services, Operation

22
Assessment Deliverables

Investment justification (Business Case)
Benefits and milestones (ROI)
Agree priorities and phasing
Structure project to deliver early Business Value
Process and Organisational change
The crucial elements that wrap around the
technology implementation
Improved communication and understanding
IT and Business working to common goals

23
Phase 1 Foundation

Simon Barnes

24
Foundation

The Foundation phase uses the deliverables from
the Assessment and establishes core capability to
manage a service end to end
This phase focuses on the capability to provide
base monitoring of all components supporting a
service including
Server
Network
Applications

25
What is included with the IBM BSM Foundation
Solution?
Service Infrastructure
Other
Other
Storage
Storage
Business
Business
Systems
Systems
Applications
Applications
Voice
Voice
Mainframe
Mainframe
Network
Network
Security
Security
Wireless
Wireless
26
What is the aim?

The aim of this phase is to reduce MTTR
Or the time it takes from a problem occurs until
it is fixed
80 of MTTR is spent in the Diagnose/Isolate
phase (IDC)

27
Why is this so hard?

How do customers manage their IT infrastructure?

HP NNM
Net iQ
Microsoft MOM
Tivoli Monitoring
NetView/z
Customer unable to place new online order
SOA
Intranet
Oracle
Mainframe
Billing
Web server
Network infrastructure
Security
Customer
28
Key Performance Indicators

At this stage we need to know the KPIs
It must be possible to monitor and measure IT
components that equate to the KPI
Our KPIs are
Performance
End to End Transaction Time (Load
Balancer-gtWebServer-WebSphere-gtDatabase)
Component Transaction Time (WebSphere, WebServer,
DB2)
Client Transaction Time (Speed for client to open
application)
Availability
Client Web Site Availability (Failed open
requests)
Component unavailability due to critical failure
(component failure)
Number of Severity One Service IDs

28
29
Server KPIs

Our Server KPIs are
Availability
Component unavailability due to critical failure
(component failure)
To this end we will monitor
Server Availability
Critical Server Components (Disk, Memory,
Processor)
Critical Processes
However DO NOT monitor too much
As a rule only create alerts for things that have
a corrective action

29
30
PlantsByWebSphere

The first aspect of our foundation is to add
Server Monitoring to capture component problems
Server Monitoring is provided by adding a base OS
IBM Tivoli Monitoring 6.2.1 agent to all key
components

31
Network KPIs

Our small example does not have any Network
specific KPIs
However in a normal environment you will need to
know all the network components that affect a
service
Such as
Load Balancers
Switches
Firewalls
Network Monitoring is provided by IBM Tivoli
Network Manager

31
32
Application Monitoring

Application Monitoring can take many forms
Specialist Agents such as DB2, WebSphere, Web
Server
Logfile and Process Monitoring
SNMP Monitoring
Custom Agent
End to End Transaction Monitoring

33
Why End to End?
End-user experience
High-performing resources dont always translate
into high-performing applications
33
34
Key Performance Indicators

Application monitoring is the key phase to any
service management project as almost all KPIs are
measured in this way
To this end our KPIs are
Performance
End to End Transaction Time (Load
Balancer-gtWebServer-WebSphere-gtDatabase)
Component Transaction Time
Client Transaction Time
Availability
Client Web Site Availability
Component unavailability due to critical failure
(component failure)

ITCAM for RT
ITCAM for RT, ITCAM for Web Resources
ITCAM for RT
ITCAM for RT
ITM for Databases
34
35
PBW Application Monitoring?
Web Response
Web Response
35
36
ITCAM for RT
Any GUI client or Web Transaction can be recorded
and uploaded
36
37
Client Response Agent

Responds to requests for data from a client
Lots of out of the box integrations
Can record your own to listen to start and end
API calls of any application

37
38
Web Response Agent

The Web Response Time agent collects user
response time for HTTP and HTTPS Web
transactions.
HTTP traffic - agent listens locally to TCP/IP
stack and measures the response time of the
transaction.
HTTPS traffic - as it needs to access an
unencrypted HTTP data stream, the agent runs on
the Web server machine and makes use of the Web
server exits to get access to the data stream.
Appliance mode - allows the agent to collect HTTP
traffic from other machines in the same network
segment by enabling collection of network packets
in promiscuous mode.

38
39
Correlation

Correlation should be included at the beginning
so that root cause identification can be
accelerated from the outset
This can then be added to as a process of
continuous improvement
In the case of the PlantsByWebSphere application
it is provided by OMNIbus
This will automatically provide things like
deduplication

40
Tivoli Business Bottom-Line Effect
Events
gt10M
Netcool Advanced Data Processing Delivers
Business Assurance and Increased Operations
Efficiency Through Massive Event Reduction and
Prioritization
gt1k
gt100
gt10
ITNM Integration Topology-Based RCA
OMNIbus Auto Clear Events Event
De-duplication State-based Correlation Automated
Resolution Device-based RCA
OMNIbus Event Collection/Consolidation Maximum
Event Generation Probes and Monitors
OMNIbus Probe Monitor Level Event
Filtering Suppression
Degree of Netcool Advanced Data Processing
Implemented
41
ITNM Root Cause

Automated discovery and graphic topology
Devices
Device relationships
Real time status and event management
Events and their impacts
Root-cause analysis (RCA)

42
Event Reduction

Because of root cause analysis the number of
events is reduced greatly

Event Reduction 141
Event Reduction 521
The failed device becomes the root cause for all
connectivity events
The failed device becomes the root cause for all
connectivity events on isolated devices
43
Phase 2 Discovery

Simon Barnes

44
TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to
Computer System

Understand what you have

Application Mapping with Dependencies
Agent-less and Credential-free
Discover interdependencies between Applications,
middleware, servers and network components)

Switch
Infrastructure Application
Business Application
45
TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to

Learn how your CIs are configured ( changing
over time)

Configuration Auditing
Tracks changes in applications
Depicts that information on the map
Depicts that information thru reports

Automatically tracks changes on all CIs
attribute values over time
Application
46
TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to

Determine if it is compliant

Comparing two instances of an Apache Web Server
to the reference master

Compliance
Compare configuration to reference master
Compare to your standard policy

Values in red and blue are policy violations
47
And what does this mean?

Reduce MTTR
Accurate and comprehensive cross-tier service
visibility
Deep configuration details and interdependencies
Change history data to identify and isolate
application changes.
Improve Operational Efficiency
Make decisions based on accurate operational data
Enhance business availability
Align IT infrastructure with the business through
discovery automation

48
ITM and TADDM Integration

Simon Barnes

49
Monitoring Coverage

You can now use TADDM to display ITM 6 monitoring
coverage

50
Launch In Context to search portal

With 7.1.2, LIC to the search portal handles LTPA
(Lightweight Third Party Authentication) tokens,
enabling SSO to occur.

50
51
ITNM and TADDM Integration

Simon Barnes

52
Integration Overview
NetworkResources
ITNM Discovery
TADDM Discovery
TADDM ITNM GUI
Resource Relationship Data
IDMLBook
Bulk Loader
DLA (Discovery Library Adapter)
53
End of Part 1
54
Phase 3 Service Visibility

Simon Barnes

55
Service Visibility

In this second part we will learn how to offer
visibility of the availability and performance of
specific business services to different business
units using Tivoli Business Services Manager
(TBSM)
Also we will show how we can monitor and measure
these against business oriented metrics e.g.
volume of transactions, revenue flow.

56
IBM Tivoli Business Service Manager
TBSM enables a service-centric approach to
management

Capabilities
Custom business views dashboards
View key performance indicators (KPIs)
Model any service
Service status/health from external sources
Track real-time Service Level Agreements
Advanced numeric rules for calculations
Service definition from CMDB/inventory
Tight BSM product integration
ITCAM for ISM ITM
TADDM, TSLA
OMNIbus TEC
Can add value to non-Tivoli monitored
environments!

57
Event Sources
Status
58
Discovery Dependencies
Structure
59
Business Data Processes
Status and Structure
60
How does it work?

TBSM uses service models
A service model describes dependencies of units
of operation within an organization.
Model elements can describe hardware, software,
business processes, transactions and others.
Models can be built manually in TBSM or built
automatically from event and/or external data.

61
Templates and Services

Templates define how Service Instances will
behave.
Services are instantiations of templates.
Web Server Template ? WebServer1
There are multiple ways that instances can be
created in TBSM
Manual configuration via graphical user interface
(GUI).
Auto-population based on events
RAD API.
Data Fetcher or External Service Dependency
Adapter (ESDA) for auto population from an
external source.

62
PlantsByWebsphere
63
Service Status from Events

Services can derive status from incoming OMNIbus
events.
Hundreds of event feeds through OMNIbus!
Propagation of status in a service tree is
defined in template dependency rules.

64
Service Status from Business Sources Data
Fetchers

Data Fetcher is a database poller to
Retrieve key performance indicator values
Drive status of service instances (similar to
events)
Retrieved data can be used with numerical health
calculations
Useful feed of KPIs for scorecard/dashboard views
Data can be used in SLA calculations
Extends auto-population feature

WebServer6 TroubleTkts 2
WebServer13 TroubleTkts 7
WebFarm3
Rows
Data Fetcher
WebServer21 TroubleTkts 0
WebServer15 TroubleTkts 4
65
Numeric Modeling

Metrics can be associated with a service and
optionally be used to determine status.
e.g. Web server response time
Dependencies can be based on numerics
(aggregations).
e.g. Web farm response time as the calculated
average response time for each web server in the
web farm
Predefined aggregation calculations
Average
Maximum, Minimum
Percentile
Sum
Use Netcool/Impact policies to create customized
calculations.
Configure a status threshold that will result in
the service designated as Good/Bad/Marginal based
on the value of the numeric rule.

66
Product Integration - Discovery
TBSM
TADDM
Application Maps (IDML)
Configuration and change history query
Application Detailed Configuration and Change
History Data
Business Systems View

TBSM / TADDM Value Proposition
Accurate and comprehensive application visibility
Cross-tier application topology
Deep configuration details and interdependencies
Automatically create and maintain
application/service groupings
Identify and isolate application changes to
dramatically reduce MTTR

67
Service Level Agreements

Service Level Agreements
Can be defined for
Services
Applications
Devices
3 Types of SLAs
Instance
Cumulative
Violation Count

SLA Metrics
Availability
Downtime (MTTR)
Penalties ()

68
PlantsByWebSphere SLA

Service Level Agreements
Can be defined for
Services
Applications
Devices
3 Types of SLAs
Instance
Cumulative
Violation Count

The PlantsByWebSphere SLA service is set for
a the Number of minutes down (cumulative
duration) for a calendar month (1st to 1st).
The service is allowed to be down for 15 minutes
We would like a warning after 10 minutes.
We also want to have a penalty of 100 per minute
down.

?
?