Title: Background of Aurora Health Care
1Cerner Millennium System Stability Getting to
99.99 Uptime
Eric Ried Eric.Ried_at_Aurora.Org Aurora Health
Care, Inc. Milwaukee, Wisconsin
Steve SondermanSteve.Sonderman_at_Aurora.OrgAurora
Health Care, Inc.Milwaukee, Wisconsin
2Aurora Health Care as an Organization
- Largest private employer in the state of
Wisconsin - 25,000 employees
- 660 employed physicians
- 3400 physicians on staff
- Comprised of
- 13 hospitals
- Over 100 clinics
- Over 120 retail pharmacies
- Homecare
- Hospice
- Other outpatient treatment centers
3Cerner Millennium Installations Across Aurora
- Facilities live on Millennium
- All 13 hospitals
- 85 Clinics
- Approximately 3.5 million patients in the record
system.
4Millennium Infrastructure at Aurora(Production
Environment)
- Hardware
- 4 HP GS1280 Alpha Servers
- Two 32way
- Two 24way
- Redundant configurations on application nodes.
- 270 Citrix Servers
- 8 Chart Servers
- 3 RRD Servers (12 port total capacity)
- 7 Multum Servers
- Other servers include Document Imaging and BMDI
5Millennium Infrastructure at Aurora(Production
Environment)
- Database
- Oracle 9i
- Current size of 6.0TB
- Applications
- 6500 concurrent users during peak usage times
Powerchart ProVision
- Powerchart Office - Clinical Documentation
Radnet - iNet
Surginet Firstnet
ERM - BMDI
Pharmnet/eMar/Barcoding Clinical Reporting/RRD/MRP
Scheduling
6Stability History
- Growth of Organization and Deployment schedule.
- Aurora was growing as an organization.
- Strong focus on rolling out the Electronic Health
Record (Cerner Millennium) to integrate
facilities. - To accommodate, numerous application and system
changes were frequently made. - Frequent Service Packages taken for new
functionality. - Operating in a break/fix mode.
- Frequent system outages were occurring.
7Stability History
8Stability History
- July 2006
- Freeze placed on all non-essential Production
Changes. - Change Control Process Redesigned with creation
of the Change Control Board (CCB) to oversee and
approve all changes. - All deployment of new functionality or existing
functionality to new facilities was put on hold. - In depth analysis of current stability issues
began. - Set a short term goal of 99.75 uptime.
9Getting to 99.99 Uptime
10Getting to 99.99 Uptime
11Dedicated Production Monitoring
- 7 AM 6 PM Monitoring Monday Friday
- Production monitored by an Engineer/Administrator
at all times throughout the business day. - Primary utilities used in Monitoring
- SYSMON
- MON PROC/TOPC
- SPSMON
- WATCH_QUOTA
- BMC Patrol
- Softek Panther (Panther Sensors)
12Dedicated Production Monitoring
- SYSMON
- Message queuing
- Sharp drop in
- connections
- Terminating servers
- High connection count
- MON PROC/TOPC
- All four nodes
- Hung processes
- High CPU usage (poor script performance)
13Dedicated Production Monitoring
- SPSMON
- Long running processes
- Track down script in CCL
- Correlate findings with Oracle statistics
- Identify user trends
14Dedicated Production Monitoring
- Watch_Quota
- Monitored periodically throughout the
business day - Quotas raised if trend is seen over time.
- Reduces potential for memory resource
issues.
15Dedicated Production Monitoring
- Graphs
- - CPU Usage
- - Service Manager BG Device Count
- Total BG Device Count
- Monitor for sharp drops or increases in
device count on one or all nodes. - Monitor for spikes in CPU usage.
16Softek Panther
- Panther Sensors
- Shared Service Queue Backlog
- Service Not Accepting Connections- Server
Thrashing - Absence of Server (server deficit)
17Getting to 99.99 Uptime
18Redesigning the Change Control Process
- Change Control Prior to the Production Freeze.
- Planned changes were reviewed at a weekly
meeting, however they were implemented without
going through a formal approval process. - Changes were often made at the discretion of the
analysts and engineers. - Change windows were reserved for major system and
application changes. - Server cycling was performed ad hoc, upon request.
19Redesigning the Change Control Process
- Transitioning from Change Review to true Change
Control. - Change Control Board (CCB) was formed.
- All application and technical changes examined
for business need, priority, system impact, and
risk, then categorized into change types. - Exempt
- Pre-approved
- Management (CCB) approval.
- Distinct change windows created to minimize
impact to the end user.
20Redesigning the Change Control Process
- Change Control Today
- Standardized templates are used to gather
information on the requested change to assist in
identifying business need, priority, impact, and
risk. - CCB meets daily Monday Friday to review all
requested (non-exempt and management approval)
changes. - No changes made until approved and scheduled into
a designated change window. - Changes that require server cycling are limited
to certain windows.
21Getting to 99.99 Uptime
22Knowledge Sharing with Cerner
- In August 2006, began working closely with Cerner
on major system and stability issues. - Series of onsite meetings were held.
- Began tracking all production issues and events
which had a negative impact to the end user. - In cooperation with Cerner, identified and
resolved major stability issues.
23Getting to 99.99 Uptime
24Service Outage Analysis
- Tracking events
- Hung processes, poor performing scripts, Global
Service Manager issues, Chart Server and RRD
problems, Multum issues, Network interruptions,
etc - Any other event that had a negative impact on
system performance or on the end user. - Events tracked on an Excel spreadsheet.
25Service Outage Analysis
Date/Time of event Affected Node(s)
Duration Vendor Log
Issue Description S.E. (Employee)
Running Process/Script
Quota (in use)
Database Locks
Impact NI, IF, PD, Outage
Resolution
26Service Outage Analysis
- A closer look at Impact
- NI (No Impact)
- Events that do not directly impact system
performance or end user experience (i.e., low
quota on a middleware server, proactive pages for
low disk space warning, etc). - IF (Incomplete Functionality)
- A single function or component is not available
or working properly (i.e., RRD or BMDI is
unavailable). - PD (Performance Degradation)
- An event that causes a performance issue or
impacts the end user but all functions in the
system are available (i.e., message queuing,
server hangs using 100 of a CPU). - Outage
- Millennium is unavailable or in a degraded state
such that users are unable to function. At this
point, the environment can be taken for
diagnostics that require users to be off the
system and even shutdown to resolve the issue.
27Service Outage Analysis
- Events imported into an Access 2007 Database.
- Weekly meetings held to review prior weeks
events. - Like events linked to a common issue.
- Issues examined for common trends and patterns
(script, user, quota, behavior, etc) - Issues assigned To Dos for staff follow up.
- Quick escalation to Cerner once issues are found.
28Service Outage Analysis
- SOA Database (Review of Daily Events)
- Daily events imported into database.
- Review event and link to a repeating issues
(if applicable). - Update status.
29Service Outage Analysis
- SOA Database (Issue Maintenance)
- Daily events that repeat and/or need follow
tasks become an Issue. - Issues are tracked until a resolution is
identified. - Once the resolution is implemented and the
issue is verified to be resolved, its closed.
30Service Outage Analysis
- SOA Database (To Dos)
- Follow up tasks are logged in the database
and linked to the issue. - To Do reports are generated and sent to
the necessary staff.
31Service Outage Analysis
- Middleware Tuning (Hardening)
- In depth look into server configurations.
- Paging File Limits
- Number of instances
- Kill times
- Request class routing
- Routed poor performing or unstable scripts to
dedicated servers (i.e., scripts that frequently
hang or terminate servers). - Less impact on users, by isolating the above
types of scripts to dedicated servers. - Installed Service Package to correct issue with
Clinical Event Server hanging and Code Cache
Synchronization package. - Became proactive rather than reactive.
32Service Outage Analysis
- Other Tuning (Hardening)
- Chart Servers were consistently having memory
resource issues, causing distributions to error
or hang in-process. - Implemented an auto-reboot schedule for the Chart
Servers. - Multum interaction/reaction requests were slow
and would often hang. - Increased the number of Multum Servers and
installed a corrective component fix. - User processes caused issues with Middleware
processes. - Through trend analysis, identified specific tasks
users were performing that caused inefficient
calls to the database, servers to hang, etc... - Worked with application teams and the end user to
modify their process and even preferences to
prevent the issues from occurring.
33Post-Production Freeze
- Since the Production Freeze which was lifted at
the end of August 2006, weve accomplished the
following - Code upgrade from 2005.02.24 to 2005.02.53
- Achieved 99.99 uptime.
- Provided a stable Millennium system for our end
users. - Secured our ability to continue with scheduled
deployments and implementing new functionality,
thus meeting our strategic goals.
34Current State
35Current State
- No unplanned Millennium downtime since Mid-August
2006. - How we got here
- Dedicated Production Monitoring
- Redesigning Change Control
- Knowledge Sharing with Cerner
- Service Outage Analysis
36Current State
- No unplanned Millennium downtime since Mid-August
2006. - How we got here (continued)
- Focus shifted from strong deployment schedule to
stability. - Maintaining a stable code base (commitment to the
same code base for one year with minimal
exception packages).
37Wrapping up
38Wrapping up
- Contact Information
- Eric Ried
- Supervisor, Cerner Technical Support
- Aurora Health Care
- 414.647.3068
- Eric.Ried_at_Aurora.Org
- Steve Sonderman
- Software Systems Administrator
- Aurora Health Care
- 414.647.6422
- Steve.Sonderman_at_Aurora.Org