Title: Implementing Service Monitoring in a Business Context
1Implementing Service Monitoring in a Business
Context
2Agenda
- Session 1
- Phase 0 Assessment Planning
- Phase 1 Foundation
- Phase 2 Discovery
- Session 2
- Phase 3 Service Visibility
3Phase 0 Assessment
4The cost of Service Outage
- Revenue lost for transactions that fail
- Directly impacts profitability, ever minute of
every outage (Downtime cost for typical retail
outlet is 7,800 per minute (IDC)) - Dissatisfied customers go to competitors, and
potentially never return - Tests at Amazon revealed every 100 ms increase in
load time of Amazon.com decreased sales by 1 - Google found that moving from a 10-result page
loading in 0.4 seconds to a 30-result page
loading in 0.9 seconds decreased traffic and ad
revenues by 20 (Linden 2006).
5Cost of Service Outage contd.
- Business reputation is damaged
- Even customers not directly impacted by outage
may hear about the poor service - The Business vilifies IT when it happens
- But still change is pushed through with little
consideration of operational needs - We need to break this cycle, and Business Service
Management can be a catalyst
6How to Start?
- Firstly Identify Stakeholders
- Critical to ultimate success
- Starting in the right way is key to a good
outcome - Set realistic goals and expectations
- Dont be drawn into trying to boil the ocean
- Work by service, step by step
- See the IT service in its Business Context
- View our services based on how they support the
overall organisation and its end customers - Essential in converting Business Managers to see
IT as a key enabler, not a cost burden
7Identify the Service
- What constitutes the service?
- Even relatively simple Services cross traditional
IT silo boundaries - IT Infrastructure
- Servers Clients
- Middleware (Web, Database, Messaging)
- Applications
- Storage
- Security
- Processes (Business IT)
- People
- I.E. Everything supporting transactions
- BSM tools help span the boundaries
8Example Service
- For this presentation we will use a standard
e-business application called Plants By Websphere
8
9PlantsByWebSphere
- PlantsByWebSphere is a standard E-business
application. It has - 1 Load Balancer (Linux server)
- 2 Web Servers (Linux Apache and Windows IIS)
- 2 WebSphere Servers (1 Unix and 1 Windows acting
as a pair) - 1 DB2 Database (Linux)
9
10First Steps
- Now we have our service we can use the following
structured approach - Review (or establish) SLAs supporting OLAs
- Pull out the Important Business KPIs
- Agree relative priority of these
- Articulate the KPIs as a measurable IT function
- Establish the components that support the
end-to-end Service - Monitor, report, improve
11Service Level Agreements
- Not an essential pre-requisite to BSM
- BSM solutions still deliver value without formal
SLAs - Existence will increase the value of the tool
- Everyone understands what we are striving to
deliver, and why - Knowing SLA (and thus Business) impact of
incidents helps with recovery prioritisation - Targets (and cost of failure) are clear to all
- Facilitates identification of KPIs
12Service Level Agreements
- Useful framework for dialogue with Business
- A common understanding of which IT Services are
Business Critical - Establish reasonable, achievable, and measurable
availability and performance targets - Evaluate cost of meeting targets, and business
appetite to invest in suitable infrastructure - Vehicle to justify suitable Test and DR
environments - High availability is expensive and not always
justifiable
13Objective Measurement
- An SLA target that cannot be objectively
qualified is pointless - Directly relate KPIs to IT infrastructure, then
measure and report in Business context - Get the foundations right and build up
- Cover all elements of a specific service
- Complete coverage of one IT component has little
value to Service Monitoring if others are missing
14PlantsByWebSphere SLA
- Luckily for this exercise we already have an SLA!
- The PlantsByWebSphere SLA service is allowed 15
minutes of downtime in a calendar month - After this each minute of downtime will cost 100
per minute
15Key Performance Indicators
- Secondly we need our KPIs
- KPIs should reflect the Service SLAs in place
- They need to be Business focused, objective, and
achievable - It must be possible to monitor and measure IT
component(s) that equate to the KPI - Relative importance should be agreed
- Baseline of current performance advantageous
16PBW Availability KPIs
17PBW Performance KPIs
18Identify the End to End Service Infrastructure
- BSM solutions are most effective when all
components of a service are monitored - How do we identify them?
- Asset Registry
- Discovery Tools
- CMDB
- Bob (whos worked here for years)
- The truth is probably a combination of the above
19End to End Service Infrastructure
- Having identified the Service components
- Verify which are monitored
- Close any gaps in cover
- Ensure KPI Monitors are in place
- Integrate with the BSM layer
- Display in appropriate views (IT and Business)
- Dont forget the supporting processes
- Incident, Problem, and Change Management
- Capacity Planning
20Monitor, Report, Improve
- Even if everything goes perfectly, you will miss
something (whatever Bob says) - BSM is an iterative process, as there is always
change - Ensure your processes consider your monitoring
solution (Change, Release) - BSM should be scoped in all new projects
- Revisit SLAs KPIs regularly to keep them
relevant
21Assessment Deliverables
- Validate stakeholders and business drivers
- Vital to an effective implementation project
- Establish service components and KPIs
- What to monitor and why
- Produce high level solution design
- How to monitor and visualise results
- Common understanding of costs
- Hardware, Software, Services, Operation
22Assessment Deliverables
- Investment justification (Business Case)
- Benefits and milestones (ROI)
- Agree priorities and phasing
- Structure project to deliver early Business Value
- Process and Organisational change
- The crucial elements that wrap around the
technology implementation - Improved communication and understanding
- IT and Business working to common goals
23Phase 1 Foundation
24Foundation
- The Foundation phase uses the deliverables from
the Assessment and establishes core capability to
manage a service end to end - This phase focuses on the capability to provide
base monitoring of all components supporting a
service including - Server
- Network
- Applications
25What is included with the IBM BSM Foundation
Solution?
Service Infrastructure
Other
Other
Storage
Storage
Business
Business
Systems
Systems
Applications
Applications
Voice
Voice
Mainframe
Mainframe
Network
Network
Security
Security
Wireless
Wireless
26What is the aim?
- The aim of this phase is to reduce MTTR
- Or the time it takes from a problem occurs until
it is fixed - 80 of MTTR is spent in the Diagnose/Isolate
phase (IDC)
27Why is this so hard?
- How do customers manage their IT infrastructure?
HP NNM
Net iQ
Microsoft MOM
Tivoli Monitoring
NetView/z
Customer unable to place new online order
SOA
Intranet
Oracle
Mainframe
Billing
Web server
Network infrastructure
Security
Customer
28Key Performance Indicators
- At this stage we need to know the KPIs
- It must be possible to monitor and measure IT
components that equate to the KPI - Our KPIs are
- Performance
- End to End Transaction Time (Load
Balancer-gtWebServer-WebSphere-gtDatabase) - Component Transaction Time (WebSphere, WebServer,
DB2) - Client Transaction Time (Speed for client to open
application) - Availability
- Client Web Site Availability (Failed open
requests) - Component unavailability due to critical failure
(component failure) - Number of Severity One Service IDs
28
29Server KPIs
- Our Server KPIs are
- Availability
- Component unavailability due to critical failure
(component failure) - To this end we will monitor
- Server Availability
- Critical Server Components (Disk, Memory,
Processor) - Critical Processes
- However DO NOT monitor too much
- As a rule only create alerts for things that have
a corrective action
29
30PlantsByWebSphere
- The first aspect of our foundation is to add
Server Monitoring to capture component problems - Server Monitoring is provided by adding a base OS
IBM Tivoli Monitoring 6.2.1 agent to all key
components
31Network KPIs
- Our small example does not have any Network
specific KPIs - However in a normal environment you will need to
know all the network components that affect a
service - Such as
- Load Balancers
- Switches
- Firewalls
- Network Monitoring is provided by IBM Tivoli
Network Manager
31
32Application Monitoring
- Application Monitoring can take many forms
- Specialist Agents such as DB2, WebSphere, Web
Server - Logfile and Process Monitoring
- SNMP Monitoring
- Custom Agent
- End to End Transaction Monitoring
33Why End to End?
End-user experience
High-performing resources dont always translate
into high-performing applications
33
34Key Performance Indicators
- Application monitoring is the key phase to any
service management project as almost all KPIs are
measured in this way - To this end our KPIs are
- Performance
- End to End Transaction Time (Load
Balancer-gtWebServer-WebSphere-gtDatabase) - Component Transaction Time
- Client Transaction Time
- Availability
- Client Web Site Availability
- Component unavailability due to critical failure
(component failure)
ITCAM for RT
ITCAM for RT, ITCAM for Web Resources
ITCAM for RT
ITCAM for RT
ITM for Databases
34
35PBW Application Monitoring?
Web Response
Web Response
35
36ITCAM for RT
Any GUI client or Web Transaction can be recorded
and uploaded
36
37Client Response Agent
- Responds to requests for data from a client
- Lots of out of the box integrations
- Can record your own to listen to start and end
API calls of any application
37
38Web Response Agent
- The Web Response Time agent collects user
response time for HTTP and HTTPS Web
transactions. - HTTP traffic - agent listens locally to TCP/IP
stack and measures the response time of the
transaction. - HTTPS traffic - as it needs to access an
unencrypted HTTP data stream, the agent runs on
the Web server machine and makes use of the Web
server exits to get access to the data stream. - Appliance mode - allows the agent to collect HTTP
traffic from other machines in the same network
segment by enabling collection of network packets
in promiscuous mode.
38
39Correlation
- Correlation should be included at the beginning
so that root cause identification can be
accelerated from the outset - This can then be added to as a process of
continuous improvement - In the case of the PlantsByWebSphere application
it is provided by OMNIbus - This will automatically provide things like
deduplication
40Tivoli Business Bottom-Line Effect
Events
gt10M
Netcool Advanced Data Processing Delivers
Business Assurance and Increased Operations
Efficiency Through Massive Event Reduction and
Prioritization
gt1k
gt100
gt10
ITNM Integration Topology-Based RCA
OMNIbus Auto Clear Events Event
De-duplication State-based Correlation Automated
Resolution Device-based RCA
OMNIbus Event Collection/Consolidation Maximum
Event Generation Probes and Monitors
OMNIbus Probe Monitor Level Event
Filtering Suppression
Degree of Netcool Advanced Data Processing
Implemented
41ITNM Root Cause
- Automated discovery and graphic topology
- Devices
- Device relationships
- Real time status and event management
- Events and their impacts
- Root-cause analysis (RCA)
42Event Reduction
- Because of root cause analysis the number of
events is reduced greatly
Event Reduction 141
Event Reduction 521
The failed device becomes the root cause for all
connectivity events
The failed device becomes the root cause for all
connectivity events on isolated devices
43Phase 2 Discovery
44TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to
Computer System
- Application Mapping with Dependencies
- Agent-less and Credential-free
- Discover interdependencies between Applications,
middleware, servers and network components)
Switch
Infrastructure Application
Business Application
45TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to
- Learn how your CIs are configured ( changing
over time)
- Configuration Auditing
- Tracks changes in applications
- Depicts that information on the map
- Depicts that information thru reports
Automatically tracks changes on all CIs
attribute values over time
Application
46TADDM Provides 3 Key Benefits - Enabling the IT
Service Mgmt user to
- Determine if it is compliant
Comparing two instances of an Apache Web Server
to the reference master
- Compliance
- Compare configuration to reference master
- Compare to your standard policy
Values in red and blue are policy violations
47And what does this mean?
- Reduce MTTR
- Accurate and comprehensive cross-tier service
visibility - Deep configuration details and interdependencies
- Change history data to identify and isolate
application changes. - Improve Operational Efficiency
- Make decisions based on accurate operational data
- Enhance business availability
- Align IT infrastructure with the business through
discovery automation
48ITM and TADDM Integration
49Monitoring Coverage
- You can now use TADDM to display ITM 6 monitoring
coverage
50Launch In Context to search portal
- With 7.1.2, LIC to the search portal handles LTPA
(Lightweight Third Party Authentication) tokens,
enabling SSO to occur.
50
51ITNM and TADDM Integration
52Integration Overview
NetworkResources
ITNM Discovery
TADDM Discovery
TADDM ITNM GUI
Resource Relationship Data
IDMLBook
Bulk Loader
DLA (Discovery Library Adapter)
53End of Part 1
54Phase 3 Service Visibility
55Service Visibility
- In this second part we will learn how to offer
visibility of the availability and performance of
specific business services to different business
units using Tivoli Business Services Manager
(TBSM) - Also we will show how we can monitor and measure
these against business oriented metrics e.g.
volume of transactions, revenue flow.
56IBM Tivoli Business Service Manager
TBSM enables a service-centric approach to
management
- Capabilities
- Custom business views dashboards
- View key performance indicators (KPIs)
- Model any service
- Service status/health from external sources
- Track real-time Service Level Agreements
- Advanced numeric rules for calculations
- Service definition from CMDB/inventory
- Tight BSM product integration
- ITCAM for ISM ITM
- TADDM, TSLA
- OMNIbus TEC
- Can add value to non-Tivoli monitored
environments!
57Event Sources
Status
58Discovery Dependencies
Structure
59Business Data Processes
Status and Structure
60How does it work?
- TBSM uses service models
- A service model describes dependencies of units
of operation within an organization. - Model elements can describe hardware, software,
business processes, transactions and others. - Models can be built manually in TBSM or built
automatically from event and/or external data.
61Templates and Services
- Templates define how Service Instances will
behave. - Services are instantiations of templates.
- Web Server Template ? WebServer1
- There are multiple ways that instances can be
created in TBSM - Manual configuration via graphical user interface
(GUI). - Auto-population based on events
- RAD API.
- Data Fetcher or External Service Dependency
Adapter (ESDA) for auto population from an
external source.
62PlantsByWebsphere
63Service Status from Events
- Services can derive status from incoming OMNIbus
events. - Hundreds of event feeds through OMNIbus!
- Propagation of status in a service tree is
defined in template dependency rules.
64Service Status from Business Sources Data
Fetchers
- Data Fetcher is a database poller to
- Retrieve key performance indicator values
- Drive status of service instances (similar to
events) - Retrieved data can be used with numerical health
calculations - Useful feed of KPIs for scorecard/dashboard views
- Data can be used in SLA calculations
- Extends auto-population feature
WebServer6 TroubleTkts 2
WebServer13 TroubleTkts 7
WebFarm3
Rows
Data Fetcher
WebServer21 TroubleTkts 0
WebServer15 TroubleTkts 4
65Numeric Modeling
- Metrics can be associated with a service and
optionally be used to determine status. - e.g. Web server response time
- Dependencies can be based on numerics
(aggregations). - e.g. Web farm response time as the calculated
average response time for each web server in the
web farm - Predefined aggregation calculations
- Average
- Maximum, Minimum
- Percentile
- Sum
- Use Netcool/Impact policies to create customized
calculations. - Configure a status threshold that will result in
the service designated as Good/Bad/Marginal based
on the value of the numeric rule.
66Product Integration - Discovery
TBSM
TADDM
Application Maps (IDML)
Configuration and change history query
Application Detailed Configuration and Change
History Data
Business Systems View
- TBSM / TADDM Value Proposition
- Accurate and comprehensive application visibility
- Cross-tier application topology
- Deep configuration details and interdependencies
- Automatically create and maintain
application/service groupings - Identify and isolate application changes to
dramatically reduce MTTR
67Service Level Agreements
- Service Level Agreements
- Can be defined for
- Services
- Applications
- Devices
- 3 Types of SLAs
- Instance
- Cumulative
- Violation Count
- SLA Metrics
- Availability
- Downtime (MTTR)
- Penalties ()
68PlantsByWebSphere SLA
- Service Level Agreements
- Can be defined for
- Services
- Applications
- Devices
- 3 Types of SLAs
- Instance
- Cumulative
- Violation Count
- The PlantsByWebSphere SLA service is set for
a the Number of minutes down (cumulative
duration) for a calendar month (1st to 1st). - The service is allowed to be down for 15 minutes
- We would like a warning after 10 minutes.
- We also want to have a penalty of 100 per minute
down.
?
?
- SLA Metrics
- Availability
- Downtime (MTTR)
- Penalties ()
?
?
69End and Thank You