Title: Deal with Production Issues
1Deal with Production Issues
2Problems to solve
- Long resolution time
- Neglected issues
- Issues we lose track of until our users remind us
- Recurring issues
- Inconsistency in response time
- Developers are distracted constantly to resolve
issues
3Goal
- Manage issues in a consistent manner
- Fast resolution
- Reduce client impact
- Proactively resolve issues before they impact
clients
4Basic Concepts
- Incidents
- Any event which is not part of the standard
operation of a service and which causes, or may
cause an interruption to or a reduction in, the
quality of that service - Problems
- A problem is a condition often identified as the
cause of multiple incidents that exhibit common
symptoms. - Known Errors
- A known error is a condition identified by
successful diagnosis of the root cause of a
problem, and subsequent development of a
Work-around
5Relationship of the three
- Problem is the root cause of the incidents
- Incident is the manifest of a underline Problem
- One Problem can cause many Incidents
- Known error is a problem with known root cause
and known workaround
6Manage Incident vs. Manage Problem
- Different goals
- Incident Management focus on restoring the
service operation as quickly as possible - Problem management focus on finding and
eliminating the root cause - Different actions
- Incident management applies workarounds or
temporary fixes to quickly restore the services - Problem management issue a change to
fundamentally eliminate the root cause - Incident management is reactive and problem
management is proactive - Incident management emphasize speed and problem
management emphasize quality
7Common mistakes
- Spend tremendous time and efforts to find root
cause before the service level is recovered - Stop the investigation after an incident is fixed
by a workaround - Same incident occurs repeatedly without
understanding of the root cause
8Solutions from ITIL
- Separate out Incident Management and Problem
Management into two independent but related
processes - Handle incidents (restore service) as quickly as
possible - Proactively and independently work on resolving
problems - Wisely manage Known Errors
9Incident Management
- Always remember the goal is to Restore service
level as quickly as possible - How to go fast?
- Classification
- Match known errors and known workarounds
- Appropriate escalation
- Go fast, but not go crazy. Dont miss
- Record
- Prioritize
- Follow up
10Incident Management Process
11Acceptance And Record
- Benefits of recording
- Help to diagnosis new incidents based on known
incidents - Help Problem Management to find the root cause
- Easy to determine the impact
- Be able to track and control the issue
resolution. - Incident Reporting Channels
- User
- System Monitor/Alert
- IT person
12Incident Record
- Unique ID
- Basic diagnosis info
- Timestamp
- Symptoms
- User info (name, contact info)
- Whos responsible
- Additional information
- Screenshots
- Logs
- Status
- New, Accepted, Scheduled, Assigned, Active,
Suspended, Resolved, Terminated
13Classification
- Classification
- Possible reasons (application, network, database,
business logic, etc.) - Supporting group (application group, database
group, infrastructure group, network group, etc.) - Prioritize
- Priority Impact X Urgency
- Determine resolution timeline (resolve within X
hours) based on Service Level Agreement
14Preliminary Support
- Preliminary Response
- Acknowledge of acceptance
- Collect basic info
- Provide basic help to the user
- Service Requests
- Service Request is standard service like check
status, reset password, etc. - Go through standard procedure to handle service
requests
15Match
- Match known errors
- Known solution
- Known workaround
- Known resolution procedure
- Match existing incidents
- Link the new incident with the existing incidents
- Increase the impact level of the existing
incident - If the existing one is already worked on, inform
the responsible personal/group
16Investigate and Diagnosis
- Escalation
- Functional escalation (Technical escalation)
Involve more technical experts, involve teams in
other functional group, or involve external
suppliers - Hierarchical escalation (Management escalation)
Escalate to higher level management team
17Escalation by Priorities
- D (Incident Manager)
- E (Division Management)
- F (Corporate Management
- A (Service Desk)
- B (Second Line)
- C (Third Line, Supplier)
18Investigation Activities
- Assign dedicated support person
- Collect basic info
- Query historical data
- Recent releases
- Recent changes
- Workload trend
- Analyze
- Again, dont spend too much time in finding the
root cause. Find a workaround as soon as
possible!
19Resolve and recover
- Resolution (workarounds or permanent fix)
- Create a Request For Change (RFC)
- Approve RFC
- Implement Change.
- Record the analysis, the root cause, the
workaround and the solution - Leave the incident in Open status when resolution
hasnt been found
20Termination
- Contact the user to confirm incident is resolved
- Change the Incident status into Closed
- Update all the Incident record to reflect the
final priority, impact, user and root cause
21Track and Monitor
- Assign an owner to each incident. Usually its
the Service Desk person. - Provide feedback to the users after a change
- Enforce the escalation based on the priority
22Problem Management
- Problem Control
- Find the root cause of a problem
- Turn a problem into a Known Error
- Error Control
- Control and Monitor the Known Errors until they
are appropriately handled - Proactive Problem Management
- Resolve problems before they cause any incidents
23Problem Control
24Identify Problems
- Analyze the trends of incidents
- Likely to reoccur
- Likely more will occur
- Likely to have larger impact
- Analyze the weakness of the infrastructure
- Availability
- Capability
- A significant incident (outage)
25Diagnosis
- Recreate incident in testing environment
- Link the modules with incidents
- Review the latest changes
- After the root cause of a problem is found, this
problem becomes a Known Error
26Temporary Fixes
- Its important to find a temporary fix if the
problem causes significant incident - If temporary fix involves changes in the
infrastructure, a Request For Change must be
submitted. (Later, another RFC may be submitted
to fix the root cause) - For urgent problems, Emergency Change Request
Process should be initialized.
27Error Control
28Identify and Record Known Error
- Identify
- Find the root cause of a problem
- Link a problem with a known error
- Record
- Assign an ID
- Symptoms
- Root cause
- Status
- Notification
- Notify incident management team. They can
associate new incidents with known errors
29Determine the solution
- Evaluate based on
- Service Level Agreement
- Impact and Urgency
- Cost and benefit
- Possible solutions
- Temporary fixes
- Permanent fixes
- No fix (cost is greater than benefits)
- Record the decision in Problem Database
30Known Errors from other environments
- Known errors from development environment
- We may choose to release with some minor known
issues - Known errors from suppliers
- Usually reported in the release notes
- Record, Monitor and Track those known errors
- Relate problems with those known errors
31PIR (Post Implementation Review)
- Normal problems
- Confirm all the related incidents are closed
- Verify if the problem record is complete
(symptoms, root cause and solutions) - Change the problem status into Resolved
- Significant problems
- What went well?
- What went wrong?
- How to do better next time?
- How to prevent the similar issues from happening
again?
32Track and Monitor
- Track the full lifecycle of each known error
- Reevaluate impact and urgency. Adjust the
priorities accordingly. - Monitor the progress of the diagnosis and
implementation of the solution. Monitor the
implementation of the RFC.
33Proactive Problem Management
- Focus on the quality of the service and the
infrastructure - Analyze operational trends
- Detect the potential incidents and prevent them
from happening - Find out the weak points of the infrastructure or
the overloaded components
34Ideas to improve our Production Support process
- Idea 1 Create an independent Problem Management
Team. - Idea 2 Create an Problem Database
- Idea 3 Define the Production Support Procedure
- Idea 4 Review and revise the procedures of using
TeamTrack - Idea 5 Enforce Post Implementation Review
- Idea 6 Proactively manage problems
- Idea 7 (optional) Acquire an Service Desk
software to facilitate the process
35Create an independent Problem Management Team.
- Can be a full time team or a part time team
- Appoint a Problem Management Manager. Must be
different than the Production Support Manager.
Their goals, schedules and requirements are
different. - Responsible for managing all the production
problems (not incidents) for multiple
applications - Identify problems
- Record problem
- Find and evaluate solutions
- Track the progress till closure
- Work closely with the existing Production Support
team.
36Create a Problem Database
- A easy to search knowledge database
- Include problems and known errors
- Track symptoms, root causes, temporary fixes,
workarounds, and permanent solutions - Include all the known errors in DEV and
unresolved or deferred defects in QA/RATE
environments - Maintained by the Problem Management Team
- Will be used by Production Support team for match
and fast resolution of incidents
37Define the Production Support Procedure (Work
Instructions)
- Create a formal and detailed document. Train
Production Support Team to follow the new
procedure - Start with ITIL Incident Management Process.
Adjust it to our own situation and tools - Clearly define how to calculate priorities
- Clearly define the time-bound escalation
procedure - Clearly define the monitoring and tracking steps
38Review and define the procedure of using TeamTrack
- TeamTrack is our existing Incident Tracking
system - Review the functions of TeamTrack
- Redefine the incident escalation process
according to ITIL suggestions - Define the interface between PC Support and IT
Production Support Team - Communication channel
- Roles and responsibilities
- Escalation
- Track and Control
- Knowledge sharing
39Enforce PIR
- Contact each user to confirm all the incidents
are closed - Make sure the Problem record is complete and
useful - Identify issues in the Incident and Problem
Management process. Add those to Problem database.
40Proactively Manage Problems
- Responsibility of the Problem Management Team.
- Perform the following activities
- Analyze incidents to find the trend
- Analyze infrastructure to identify possible
bottleneck - Run fail-over and stress tests
- Apply a problem solution across multiple related
applications - Establish and maintain the Production Monitor
System to proactively detect system anomalies - Evaluate how many problems are proactively
identified and resolved
41Service Desk Software
- Evaluate the existing TeamTrack software and see
if it covers out needs - Other popular options
- HP Openview Service Desk
- Remedy Strategic Service Suite
- CA Unicenter Service Desk