Background of Aurora Health Care - PowerPoint PPT Presentation

1 / 38
About This Presentation
Title:

Background of Aurora Health Care

Description:

Paged by site support at 2 rivers, users are experiencing blank desktops when they load Cerner and when they click the – PowerPoint PPT presentation

Number of Views:89
Avg rating:3.0/5.0
Slides: 39
Provided by: SteveSo3
Category:

less

Transcript and Presenter's Notes

Title: Background of Aurora Health Care


1

Cerner Millennium System Stability Getting to
99.99 Uptime
Eric Ried Eric.Ried_at_Aurora.Org Aurora Health
Care, Inc. Milwaukee, Wisconsin
Steve SondermanSteve.Sonderman_at_Aurora.OrgAurora
Health Care, Inc.Milwaukee, Wisconsin
2
Aurora Health Care as an Organization
  • Largest private employer in the state of
    Wisconsin
  • 25,000 employees
  • 660 employed physicians
  • 3400 physicians on staff
  • Comprised of
  • 13 hospitals
  • Over 100 clinics
  • Over 120 retail pharmacies
  • Homecare
  • Hospice
  • Other outpatient treatment centers

3
Cerner Millennium Installations Across Aurora
  • Facilities live on Millennium
  • All 13 hospitals
  • 85 Clinics
  • Approximately 3.5 million patients in the record
    system.

4
Millennium Infrastructure at Aurora(Production
Environment)
  • Hardware
  • 4 HP GS1280 Alpha Servers
  • Two 32way
  • Two 24way
  • Redundant configurations on application nodes.
  • 270 Citrix Servers
  • 8 Chart Servers
  • 3 RRD Servers (12 port total capacity)
  • 7 Multum Servers
  • Other servers include Document Imaging and BMDI

5
Millennium Infrastructure at Aurora(Production
Environment)
  • Database
  • Oracle 9i
  • Current size of 6.0TB
  • Applications
  • 6500 concurrent users during peak usage times

Powerchart ProVision
- Powerchart Office - Clinical Documentation
Radnet - iNet
Surginet Firstnet
ERM - BMDI
Pharmnet/eMar/Barcoding Clinical Reporting/RRD/MRP
Scheduling
6
Stability History
  • Growth of Organization and Deployment schedule.
  • Aurora was growing as an organization.
  • Strong focus on rolling out the Electronic Health
    Record (Cerner Millennium) to integrate
    facilities.
  • To accommodate, numerous application and system
    changes were frequently made.
  • Frequent Service Packages taken for new
    functionality.
  • Operating in a break/fix mode.
  • Frequent system outages were occurring.

7
Stability History
8
Stability History
  • July 2006
  • Freeze placed on all non-essential Production
    Changes.
  • Change Control Process Redesigned with creation
    of the Change Control Board (CCB) to oversee and
    approve all changes.
  • All deployment of new functionality or existing
    functionality to new facilities was put on hold.
  • In depth analysis of current stability issues
    began.
  • Set a short term goal of 99.75 uptime.

9
Getting to 99.99 Uptime
10
Getting to 99.99 Uptime
11
Dedicated Production Monitoring
  • 7 AM 6 PM Monitoring Monday Friday
  • Production monitored by an Engineer/Administrator
    at all times throughout the business day.
  • Primary utilities used in Monitoring
  • SYSMON
  • MON PROC/TOPC
  • SPSMON
  • WATCH_QUOTA
  • BMC Patrol
  • Softek Panther (Panther Sensors)

12
Dedicated Production Monitoring
  • SYSMON
  • Message queuing
  • Sharp drop in
  • connections
  • Terminating servers
  • High connection count
  • MON PROC/TOPC
  • All four nodes
  • Hung processes
  • High CPU usage (poor script performance)

13
Dedicated Production Monitoring
  • SPSMON
  • Long running processes
  • Track down script in CCL
  • Correlate findings with Oracle statistics
  • Identify user trends

14
Dedicated Production Monitoring
  • Watch_Quota
  • Monitored periodically throughout the
    business day
  • Quotas raised if trend is seen over time.
  • Reduces potential for memory resource
    issues.

15
Dedicated Production Monitoring
  • Graphs
  • - CPU Usage
  • - Service Manager BG Device Count
  • Total BG Device Count
  • Monitor for sharp drops or increases in
    device count on one or all nodes.
  • Monitor for spikes in CPU usage.

16
Softek Panther
  • Panther Sensors
  • Shared Service Queue Backlog
  • Service Not Accepting Connections- Server
    Thrashing
  • Absence of Server (server deficit)

17
Getting to 99.99 Uptime
18
Redesigning the Change Control Process
  • Change Control Prior to the Production Freeze.
  • Planned changes were reviewed at a weekly
    meeting, however they were implemented without
    going through a formal approval process.
  • Changes were often made at the discretion of the
    analysts and engineers.
  • Change windows were reserved for major system and
    application changes.
  • Server cycling was performed ad hoc, upon request.

19
Redesigning the Change Control Process
  • Transitioning from Change Review to true Change
    Control.
  • Change Control Board (CCB) was formed.
  • All application and technical changes examined
    for business need, priority, system impact, and
    risk, then categorized into change types.
  • Exempt
  • Pre-approved
  • Management (CCB) approval.
  • Distinct change windows created to minimize
    impact to the end user.

20
Redesigning the Change Control Process
  • Change Control Today
  • Standardized templates are used to gather
    information on the requested change to assist in
    identifying business need, priority, impact, and
    risk.
  • CCB meets daily Monday Friday to review all
    requested (non-exempt and management approval)
    changes.
  • No changes made until approved and scheduled into
    a designated change window.
  • Changes that require server cycling are limited
    to certain windows.

21
Getting to 99.99 Uptime
22
Knowledge Sharing with Cerner
  • In August 2006, began working closely with Cerner
    on major system and stability issues.
  • Series of onsite meetings were held.
  • Began tracking all production issues and events
    which had a negative impact to the end user.
  • In cooperation with Cerner, identified and
    resolved major stability issues.

23
Getting to 99.99 Uptime
24
Service Outage Analysis
  • Tracking events
  • Hung processes, poor performing scripts, Global
    Service Manager issues, Chart Server and RRD
    problems, Multum issues, Network interruptions,
    etc
  • Any other event that had a negative impact on
    system performance or on the end user.
  • Events tracked on an Excel spreadsheet.

25
Service Outage Analysis
  • Data Tracked

Date/Time of event Affected Node(s)
Duration Vendor Log
Issue Description S.E. (Employee)
Running Process/Script
Quota (in use)
Database Locks
Impact NI, IF, PD, Outage
Resolution
26
Service Outage Analysis
  • A closer look at Impact
  • NI (No Impact)
  • Events that do not directly impact system
    performance or end user experience (i.e., low
    quota on a middleware server, proactive pages for
    low disk space warning, etc).
  • IF (Incomplete Functionality)
  • A single function or component is not available
    or working properly (i.e., RRD or BMDI is
    unavailable).
  • PD (Performance Degradation)
  • An event that causes a performance issue or
    impacts the end user but all functions in the
    system are available (i.e., message queuing,
    server hangs using 100 of a CPU).
  • Outage
  • Millennium is unavailable or in a degraded state
    such that users are unable to function. At this
    point, the environment can be taken for
    diagnostics that require users to be off the
    system and even shutdown to resolve the issue.

27
Service Outage Analysis
  • Events imported into an Access 2007 Database.
  • Weekly meetings held to review prior weeks
    events.
  • Like events linked to a common issue.
  • Issues examined for common trends and patterns
    (script, user, quota, behavior, etc)
  • Issues assigned To Dos for staff follow up.
  • Quick escalation to Cerner once issues are found.

28
Service Outage Analysis
  • SOA Database (Review of Daily Events)
  • Daily events imported into database.
  • Review event and link to a repeating issues
    (if applicable).
  • Update status.

29
Service Outage Analysis
  • SOA Database (Issue Maintenance)
  • Daily events that repeat and/or need follow
    tasks become an Issue.
  • Issues are tracked until a resolution is
    identified.
  • Once the resolution is implemented and the
    issue is verified to be resolved, its closed.

30
Service Outage Analysis
  • SOA Database (To Dos)
  • Follow up tasks are logged in the database
    and linked to the issue.
  • To Do reports are generated and sent to
    the necessary staff.

31
Service Outage Analysis
  • Middleware Tuning (Hardening)
  • In depth look into server configurations.
  • Paging File Limits
  • Number of instances
  • Kill times
  • Request class routing
  • Routed poor performing or unstable scripts to
    dedicated servers (i.e., scripts that frequently
    hang or terminate servers).
  • Less impact on users, by isolating the above
    types of scripts to dedicated servers.
  • Installed Service Package to correct issue with
    Clinical Event Server hanging and Code Cache
    Synchronization package.
  • Became proactive rather than reactive.

32
Service Outage Analysis
  • Other Tuning (Hardening)
  • Chart Servers were consistently having memory
    resource issues, causing distributions to error
    or hang in-process.
  • Implemented an auto-reboot schedule for the Chart
    Servers.
  • Multum interaction/reaction requests were slow
    and would often hang.
  • Increased the number of Multum Servers and
    installed a corrective component fix.
  • User processes caused issues with Middleware
    processes.
  • Through trend analysis, identified specific tasks
    users were performing that caused inefficient
    calls to the database, servers to hang, etc...
  • Worked with application teams and the end user to
    modify their process and even preferences to
    prevent the issues from occurring.

33
Post-Production Freeze
  • Since the Production Freeze which was lifted at
    the end of August 2006, weve accomplished the
    following
  • Code upgrade from 2005.02.24 to 2005.02.53
  • Achieved 99.99 uptime.
  • Provided a stable Millennium system for our end
    users.
  • Secured our ability to continue with scheduled
    deployments and implementing new functionality,
    thus meeting our strategic goals.

34
Current State
35
Current State
  • No unplanned Millennium downtime since Mid-August
    2006.
  • How we got here
  • Dedicated Production Monitoring
  • Redesigning Change Control
  • Knowledge Sharing with Cerner
  • Service Outage Analysis

36
Current State
  • No unplanned Millennium downtime since Mid-August
    2006.
  • How we got here (continued)
  • Focus shifted from strong deployment schedule to
    stability.
  • Maintaining a stable code base (commitment to the
    same code base for one year with minimal
    exception packages).

37
Wrapping up
  • Questions?

38
Wrapping up
  • Contact Information
  • Eric Ried
  • Supervisor, Cerner Technical Support
  • Aurora Health Care
  • 414.647.3068
  • Eric.Ried_at_Aurora.Org
  • Steve Sonderman
  • Software Systems Administrator
  • Aurora Health Care
  • 414.647.6422
  • Steve.Sonderman_at_Aurora.Org
Write a Comment
User Comments (0)
About PowerShow.com