Deal with Production Issues - PowerPoint PPT Presentation

About This Presentation

Title:

Deal with Production Issues

Description:

Deal with Production Issues Suggestions from ITIL Problems to solve Long resolution time Neglected issues Issues we lose track of until our users remind us Recurring ... – PowerPoint PPT presentation

Number of Views:69

Avg rating:3.0/5.0

Slides: 42

Provided by: bigapplez

Category:

more less

Transcript and Presenter's Notes

Title: Deal with Production Issues

1
Deal with Production Issues

Suggestions from ITIL

2
Problems to solve

Long resolution time
Neglected issues
Issues we lose track of until our users remind us
Recurring issues
Inconsistency in response time
Developers are distracted constantly to resolve
issues

3
Goal

Manage issues in a consistent manner
Fast resolution
Reduce client impact
Proactively resolve issues before they impact
clients

4
Basic Concepts

Incidents
Any event which is not part of the standard
operation of a service and which causes, or may
cause an interruption to or a reduction in, the
quality of that service
Problems
A problem is a condition often identified as the
cause of multiple incidents that exhibit common
symptoms.
Known Errors
A known error is a condition identified by
successful diagnosis of the root cause of a
problem, and subsequent development of a
Work-around

5
Relationship of the three

Problem is the root cause of the incidents
Incident is the manifest of a underline Problem
One Problem can cause many Incidents
Known error is a problem with known root cause
and known workaround

6
Manage Incident vs. Manage Problem

Different goals
Incident Management focus on restoring the
service operation as quickly as possible
Problem management focus on finding and
eliminating the root cause
Different actions
Incident management applies workarounds or
temporary fixes to quickly restore the services
Problem management issue a change to
fundamentally eliminate the root cause
Incident management is reactive and problem
management is proactive
Incident management emphasize speed and problem
management emphasize quality

7
Common mistakes

Spend tremendous time and efforts to find root
cause before the service level is recovered
Stop the investigation after an incident is fixed
by a workaround
Same incident occurs repeatedly without
understanding of the root cause

8
Solutions from ITIL

Separate out Incident Management and Problem
Management into two independent but related
processes
Handle incidents (restore service) as quickly as
possible
Proactively and independently work on resolving
problems
Wisely manage Known Errors

9
Incident Management

Always remember the goal is to Restore service
level as quickly as possible
How to go fast?
Classification
Match known errors and known workarounds
Appropriate escalation
Go fast, but not go crazy. Dont miss
Record
Prioritize
Follow up

10
Incident Management Process
11
Acceptance And Record

Benefits of recording
Help to diagnosis new incidents based on known
incidents
Help Problem Management to find the root cause
Easy to determine the impact
Be able to track and control the issue
resolution.
Incident Reporting Channels
User
System Monitor/Alert
IT person

12
Incident Record

Unique ID
Basic diagnosis info
Timestamp
Symptoms
User info (name, contact info)
Whos responsible
Additional information
Screenshots
Logs
Status
New, Accepted, Scheduled, Assigned, Active,
Suspended, Resolved, Terminated

13
Classification

Classification
Possible reasons (application, network, database,
business logic, etc.)
Supporting group (application group, database
group, infrastructure group, network group, etc.)
Prioritize
Priority Impact X Urgency
Determine resolution timeline (resolve within X
hours) based on Service Level Agreement

14
Preliminary Support

Preliminary Response
Acknowledge of acceptance
Collect basic info
Provide basic help to the user
Service Requests
Service Request is standard service like check
status, reset password, etc.
Go through standard procedure to handle service
requests

15
Match

Match known errors
Known solution
Known workaround
Known resolution procedure
Match existing incidents
Link the new incident with the existing incidents
Increase the impact level of the existing
incident
If the existing one is already worked on, inform
the responsible personal/group

16
Investigate and Diagnosis

Escalation
Functional escalation (Technical escalation)
Involve more technical experts, involve teams in
other functional group, or involve external
suppliers
Hierarchical escalation (Management escalation)
Escalate to higher level management team

17
Escalation by Priorities

D (Incident Manager)
E (Division Management)
F (Corporate Management

A (Service Desk)
B (Second Line)
C (Third Line, Supplier)

18
Investigation Activities

Assign dedicated support person
Collect basic info
Query historical data
Recent releases
Recent changes
Workload trend
Analyze
Again, dont spend too much time in finding the
root cause. Find a workaround as soon as
possible!

19
Resolve and recover

Resolution (workarounds or permanent fix)
Create a Request For Change (RFC)
Approve RFC
Implement Change.
Record the analysis, the root cause, the
workaround and the solution
Leave the incident in Open status when resolution
hasnt been found

20
Termination

Contact the user to confirm incident is resolved
Change the Incident status into Closed
Update all the Incident record to reflect the
final priority, impact, user and root cause

21
Track and Monitor

Assign an owner to each incident. Usually its
the Service Desk person.
Provide feedback to the users after a change
Enforce the escalation based on the priority

22
Problem Management

Problem Control
Find the root cause of a problem
Turn a problem into a Known Error
Error Control
Control and Monitor the Known Errors until they
are appropriately handled
Proactive Problem Management
Resolve problems before they cause any incidents

23
Problem Control
24
Identify Problems

Analyze the trends of incidents
Likely to reoccur
Likely more will occur
Likely to have larger impact
Analyze the weakness of the infrastructure
Availability
Capability
A significant incident (outage)

25
Diagnosis

Recreate incident in testing environment
Link the modules with incidents
Review the latest changes
After the root cause of a problem is found, this
problem becomes a Known Error

26
Temporary Fixes

Its important to find a temporary fix if the
problem causes significant incident
If temporary fix involves changes in the
infrastructure, a Request For Change must be
submitted. (Later, another RFC may be submitted
to fix the root cause)
For urgent problems, Emergency Change Request
Process should be initialized.

27
Error Control
28
Identify and Record Known Error

Identify
Find the root cause of a problem
Link a problem with a known error
Record
Assign an ID
Symptoms
Root cause
Status
Notification
Notify incident management team. They can
associate new incidents with known errors

29
Determine the solution

Evaluate based on
Service Level Agreement
Impact and Urgency
Cost and benefit
Possible solutions
Temporary fixes
Permanent fixes
No fix (cost is greater than benefits)
Record the decision in Problem Database

30
Known Errors from other environments

Known errors from development environment
We may choose to release with some minor known
issues
Known errors from suppliers
Usually reported in the release notes
Record, Monitor and Track those known errors
Relate problems with those known errors

31
PIR (Post Implementation Review)

Normal problems
Confirm all the related incidents are closed
Verify if the problem record is complete
(symptoms, root cause and solutions)
Change the problem status into Resolved
Significant problems
What went well?
What went wrong?
How to do better next time?
How to prevent the similar issues from happening
again?

32
Track and Monitor

Track the full lifecycle of each known error
Reevaluate impact and urgency. Adjust the
priorities accordingly.
Monitor the progress of the diagnosis and
implementation of the solution. Monitor the
implementation of the RFC.

33
Proactive Problem Management

Focus on the quality of the service and the
infrastructure
Analyze operational trends
Detect the potential incidents and prevent them
from happening
Find out the weak points of the infrastructure or
the overloaded components

34
Ideas to improve our Production Support process

Idea 1 Create an independent Problem Management
Team.
Idea 2 Create an Problem Database
Idea 3 Define the Production Support Procedure
Idea 4 Review and revise the procedures of using
TeamTrack
Idea 5 Enforce Post Implementation Review
Idea 6 Proactively manage problems
Idea 7 (optional) Acquire an Service Desk
software to facilitate the process

35
Create an independent Problem Management Team.

Can be a full time team or a part time team
Appoint a Problem Management Manager. Must be
different than the Production Support Manager.
Their goals, schedules and requirements are
different.
Responsible for managing all the production
problems (not incidents) for multiple
applications
Identify problems
Record problem
Find and evaluate solutions
Track the progress till closure
Work closely with the existing Production Support
team.

36
Create a Problem Database

A easy to search knowledge database
Include problems and known errors
Track symptoms, root causes, temporary fixes,
workarounds, and permanent solutions
Include all the known errors in DEV and
unresolved or deferred defects in QA/RATE
environments
Maintained by the Problem Management Team
Will be used by Production Support team for match
and fast resolution of incidents

37
Define the Production Support Procedure (Work
Instructions)

Create a formal and detailed document. Train
Production Support Team to follow the new
procedure
Start with ITIL Incident Management Process.
Adjust it to our own situation and tools
Clearly define how to calculate priorities
Clearly define the time-bound escalation
procedure
Clearly define the monitoring and tracking steps

38
Review and define the procedure of using TeamTrack

TeamTrack is our existing Incident Tracking
system
Review the functions of TeamTrack
Redefine the incident escalation process
according to ITIL suggestions
Define the interface between PC Support and IT
Production Support Team
Communication channel
Roles and responsibilities
Escalation
Track and Control
Knowledge sharing

39
Enforce PIR

Contact each user to confirm all the incidents
are closed
Make sure the Problem record is complete and
useful
Identify issues in the Incident and Problem
Management process. Add those to Problem database.

40
Proactively Manage Problems

Responsibility of the Problem Management Team.
Perform the following activities
Analyze incidents to find the trend
Analyze infrastructure to identify possible
bottleneck
Run fail-over and stress tests
Apply a problem solution across multiple related
applications
Establish and maintain the Production Monitor
System to proactively detect system anomalies
Evaluate how many problems are proactively
identified and resolved

41
Service Desk Software