Background of Aurora Health Care - PowerPoint PPT Presentation

1 / 38

About This Presentation

Title:

Background of Aurora Health Care

Description:

Paged by site support at 2 rivers, users are experiencing blank desktops when they load Cerner and when they click the – PowerPoint PPT presentation

Number of Views:93

Avg rating:3.0/5.0

Slides: 39

Provided by: SteveSo3

Category:

more less

Transcript and Presenter's Notes

Title: Background of Aurora Health Care

1

Cerner Millennium System Stability Getting to
99.99 Uptime
Eric Ried Eric.Ried_at_Aurora.Org Aurora Health
Care, Inc. Milwaukee, Wisconsin
Steve SondermanSteve.Sonderman_at_Aurora.OrgAurora
Health Care, Inc.Milwaukee, Wisconsin
2
Aurora Health Care as an Organization

Largest private employer in the state of
Wisconsin
25,000 employees
660 employed physicians
3400 physicians on staff
Comprised of
13 hospitals
Over 100 clinics
Over 120 retail pharmacies
Homecare
Hospice
Other outpatient treatment centers

3
Cerner Millennium Installations Across Aurora

Facilities live on Millennium
All 13 hospitals
85 Clinics
Approximately 3.5 million patients in the record
system.

4
Millennium Infrastructure at Aurora(Production
Environment)

Hardware
4 HP GS1280 Alpha Servers
Two 32way
Two 24way
Redundant configurations on application nodes.
270 Citrix Servers
8 Chart Servers
3 RRD Servers (12 port total capacity)
7 Multum Servers
Other servers include Document Imaging and BMDI

5
Millennium Infrastructure at Aurora(Production
Environment)

Database
Oracle 9i
Current size of 6.0TB
Applications
6500 concurrent users during peak usage times

Powerchart ProVision
- Powerchart Office - Clinical Documentation
Radnet - iNet
Surginet Firstnet
ERM - BMDI
Pharmnet/eMar/Barcoding Clinical Reporting/RRD/MRP
Scheduling
6
Stability History

Growth of Organization and Deployment schedule.
Aurora was growing as an organization.
Strong focus on rolling out the Electronic Health
Record (Cerner Millennium) to integrate
facilities.
To accommodate, numerous application and system
changes were frequently made.
Frequent Service Packages taken for new
functionality.
Operating in a break/fix mode.
Frequent system outages were occurring.

7
Stability History
8
Stability History

July 2006
Freeze placed on all non-essential Production
Changes.
Change Control Process Redesigned with creation
of the Change Control Board (CCB) to oversee and
approve all changes.
All deployment of new functionality or existing
functionality to new facilities was put on hold.
In depth analysis of current stability issues
began.
Set a short term goal of 99.75 uptime.

9
Getting to 99.99 Uptime
10
Getting to 99.99 Uptime
11
Dedicated Production Monitoring

7 AM 6 PM Monitoring Monday Friday
Production monitored by an Engineer/Administrator
at all times throughout the business day.
Primary utilities used in Monitoring
SYSMON
MON PROC/TOPC
SPSMON
WATCH_QUOTA
BMC Patrol
Softek Panther (Panther Sensors)

12
Dedicated Production Monitoring

SYSMON
Message queuing
Sharp drop in
connections
Terminating servers
High connection count
MON PROC/TOPC
All four nodes
Hung processes
High CPU usage (poor script performance)

13
Dedicated Production Monitoring

SPSMON
Long running processes
Track down script in CCL
Correlate findings with Oracle statistics
Identify user trends

14
Dedicated Production Monitoring

Watch_Quota
Monitored periodically throughout the
business day
Quotas raised if trend is seen over time.
Reduces potential for memory resource
issues.

15
Dedicated Production Monitoring

Graphs
- CPU Usage
- Service Manager BG Device Count
Total BG Device Count
Monitor for sharp drops or increases in
device count on one or all nodes.
Monitor for spikes in CPU usage.

16
Softek Panther

Panther Sensors
Shared Service Queue Backlog
Service Not Accepting Connections- Server
Thrashing
Absence of Server (server deficit)

17
Getting to 99.99 Uptime
18
Redesigning the Change Control Process

Change Control Prior to the Production Freeze.
Planned changes were reviewed at a weekly
meeting, however they were implemented without
going through a formal approval process.
Changes were often made at the discretion of the
analysts and engineers.
Change windows were reserved for major system and
application changes.
Server cycling was performed ad hoc, upon request.

19
Redesigning the Change Control Process

Transitioning from Change Review to true Change
Control.
Change Control Board (CCB) was formed.
All application and technical changes examined
for business need, priority, system impact, and
risk, then categorized into change types.
Exempt
Pre-approved
Management (CCB) approval.
Distinct change windows created to minimize
impact to the end user.

20
Redesigning the Change Control Process

Change Control Today
Standardized templates are used to gather
information on the requested change to assist in
identifying business need, priority, impact, and
risk.
CCB meets daily Monday Friday to review all
requested (non-exempt and management approval)
changes.
No changes made until approved and scheduled into
a designated change window.
Changes that require server cycling are limited
to certain windows.

21
Getting to 99.99 Uptime
22
Knowledge Sharing with Cerner

In August 2006, began working closely with Cerner
on major system and stability issues.
Series of onsite meetings were held.
Began tracking all production issues and events
which had a negative impact to the end user.
In cooperation with Cerner, identified and
resolved major stability issues.

23
Getting to 99.99 Uptime
24
Service Outage Analysis

Tracking events
Hung processes, poor performing scripts, Global
Service Manager issues, Chart Server and RRD
problems, Multum issues, Network interruptions,
etc
Any other event that had a negative impact on
system performance or on the end user.
Events tracked on an Excel spreadsheet.

25
Service Outage Analysis

Data Tracked

Date/Time of event Affected Node(s)
Duration Vendor Log
Issue Description S.E. (Employee)
Running Process/Script
Quota (in use)
Database Locks
Impact NI, IF, PD, Outage
Resolution
26
Service Outage Analysis

A closer look at Impact
NI (No Impact)
Events that do not directly impact system
performance or end user experience (i.e., low
quota on a middleware server, proactive pages for
low disk space warning, etc).
IF (Incomplete Functionality)
A single function or component is not available
or working properly (i.e., RRD or BMDI is
unavailable).
PD (Performance Degradation)
An event that causes a performance issue or
impacts the end user but all functions in the
system are available (i.e., message queuing,
server hangs using 100 of a CPU).
Outage
Millennium is unavailable or in a degraded state
such that users are unable to function. At this
point, the environment can be taken for
diagnostics that require users to be off the
system and even shutdown to resolve the issue.

27
Service Outage Analysis

Events imported into an Access 2007 Database.
Weekly meetings held to review prior weeks
events.
Like events linked to a common issue.
Issues examined for common trends and patterns
(script, user, quota, behavior, etc)
Issues assigned To Dos for staff follow up.
Quick escalation to Cerner once issues are found.

28
Service Outage Analysis

SOA Database (Review of Daily Events)
Daily events imported into database.
Review event and link to a repeating issues
(if applicable).
Update status.

29
Service Outage Analysis

SOA Database (Issue Maintenance)
Daily events that repeat and/or need follow
tasks become an Issue.
Issues are tracked until a resolution is
identified.
Once the resolution is implemented and the
issue is verified to be resolved, its closed.

30
Service Outage Analysis

SOA Database (To Dos)
Follow up tasks are logged in the database
and linked to the issue.
To Do reports are generated and sent to
the necessary staff.

31
Service Outage Analysis

Middleware Tuning (Hardening)
In depth look into server configurations.
Paging File Limits
Number of instances
Kill times
Request class routing
Routed poor performing or unstable scripts to
dedicated servers (i.e., scripts that frequently
hang or terminate servers).
Less impact on users, by isolating the above
types of scripts to dedicated servers.
Installed Service Package to correct issue with
Clinical Event Server hanging and Code Cache
Synchronization package.
Became proactive rather than reactive.

32
Service Outage Analysis

Other Tuning (Hardening)
Chart Servers were consistently having memory
resource issues, causing distributions to error
or hang in-process.
Implemented an auto-reboot schedule for the Chart
Servers.
Multum interaction/reaction requests were slow
and would often hang.
Increased the number of Multum Servers and
installed a corrective component fix.
User processes caused issues with Middleware
processes.
Through trend analysis, identified specific tasks
users were performing that caused inefficient
calls to the database, servers to hang, etc...
Worked with application teams and the end user to
modify their process and even preferences to
prevent the issues from occurring.

33
Post-Production Freeze

Since the Production Freeze which was lifted at
the end of August 2006, weve accomplished the
following
Code upgrade from 2005.02.24 to 2005.02.53
Achieved 99.99 uptime.
Provided a stable Millennium system for our end
users.
Secured our ability to continue with scheduled
deployments and implementing new functionality,
thus meeting our strategic goals.

34
Current State
35
Current State

No unplanned Millennium downtime since Mid-August
2006.
How we got here
Dedicated Production Monitoring
Redesigning Change Control
Knowledge Sharing with Cerner
Service Outage Analysis

36
Current State

No unplanned Millennium downtime since Mid-August
2006.
How we got here (continued)
Focus shifted from strong deployment schedule to
stability.
Maintaining a stable code base (commitment to the
same code base for one year with minimal
exception packages).

37
Wrapping up