Title: Setting the Standard for DR
1Setting the Standard for DR
- John Pollard 23 March 2006
PAS 77 Guide to IT Service Continuity Management
2PAS 56 Guide to Business Continuity Management
Business Continuity Management
RISK MANAGEMENT
IT DISASTER RECOVERY
FACILITIES MANAGEMENT
SUPPLY CHAIN MANAGEMENT
QUALITY MANAGEMENT
HEALTH SAFETY
KNOWLEDGE MANAGEMENT
EMERGENCY MANAGEMENT
SECURITY
CRISIS COMMUNICATIONS PR
Source PAS 562003 Guide to Business
Continuity Management
3IT Service Continuity Management
managing an organisations ability to continue
to provide a pre-determined and agreed level of
IT Services to support the minimum business
requirements
Source ITIL Best Practice for Service Delivery
4Threats
- Loss, damage or denial of access to key
infrastructure services - Failure or non-performance of third parties
- Loss or corruption of key information
- Sabotage, extortion or industrial espionage
- Infiltration or attack on critical information
systems
5Scope
- Generic framework and guidelines for a continuity
programme, including - Management structure responsibilities
- How to conduct business criticality risk
assessments - How to define and create an IT Service Continuity
plan - How to rehearse an IT Service Continuity plan
- Solution architectures and design considerations
6What is a PAS?
Source BSI
7Status
Group formed
External review
Expected release
First draft
Edit
Revise
Contracts / Structure / Content
Q4
Q1
Q2
Q4
Q3
Q1
Q2
Q4
Q3
2005
2006
2004
8Contributors
9ITSC Strategy
- Define direction and high-level methods to meet
IT service level objectives - Agreed at Board level
- Needs to consider 4 stages of major incident
- Initial response
- Service recovery
- Service delivery (following incident)
- Normal service resumption
- Enable rehearsal of major incident
10ITSC Strategy Plan
Business Strategy
Business Criticality
Threat Analysis
IT Service Continuity Strategy
IT Architecture
IT Service Continuity Plan
Rehearsals
Costs
Processes
11Maintaining an ITSC Strategy
Monitor
12Management Structure
Crisis Management Team
CMT
CMT
Business Continuity Management Team
BCMT
BCMT
Incident Management Team
IMT
IMT
13Business Criticality Risk Assessments
- Identify business units processes
- Categorise criticality of processes
- Identify IT services supporting the business
processes - Categorise criticality of IT services
- Review
- By location
- By business unit
14Business Criticality Categories
- Critical
- Vital to day-to-day operation
- Mandatory
- Vital to meet statutory requirements
- Strategic
- Important for implementation of long-term
strategy - Tactical
- Important for short/medium term objectives
15Risk Assessment Process
Learn Lessons
16ITSC Plan
- Part of wider BCM Plan
- Model plan should include
- Initial response
- Incident assessment
- Roles responsibilities
- Procedures
- Rehearsing the plan
- Maintaining the plan
17Recovery Objectives
- Recovery Point Objective (RPO)
- The point in time to which work is restored. E.g.
Start of day - Recovery Time Objective (RTO)
- The time required to recover service
18Balancing Cost Recovery Objectives
19IT Architecture Resilience Considerations
- Location distance between sites
- Number of sites
- Staff access proximity
- Remote access
- Dark site vs. manned site
- Staff skill levels
- Telecoms connectivity and redundant routing
- Automation required
- Telephony and email
- 3rd party / external links
20High Level Process Flow
21Task Summary Sheet
22Rehearsal
- A body to control coordinate
- Objectives success criteria
- Rehearsal plan scripts
- Staff briefing
- Logs and critique forms
- Observers
- Post-rehearsal review
23Areas to Rehearse
- Callout
- Walk through reviews
- Walk through exercises
- Component rehearsals
- Integration rehearsals
- Relocation rehearsals
- Failover rehearsals
- Major incident simulations
24Architectures
25Site Models
- Active / Contingency
- Cold site
- Active / Active
- Service runs from both sites
- Active / Alternate
- Service can run from either site
- Active / Backup
- Warm standby site
- Multi-site and other hybrids
26Data Resilience
Tape/backup
Database
Application
Host
Storage Array
SAN
27Replication Modes
- Synchronous
- Increased write latency
- Typically OK for OLTP
- May impact batch processing
- Requires greater inter-site bandwidth than other
options - Snapshot
- Point in time copy
- Only valid on completion of transfer
- Minimal/no performance impact
- Near real-time
- Frequent snapshots
- Minimal performance impact
28A Holistic Approach
Service Continuity is much more than technology
29john.pollard_at_unisys.com
30Part II - Workshop
Defining the Standard for DR
PAS 77 Guide to IT Service Continuity Management
31Typical Challenges
- Tape recovery slow
- Manual build is complex
- Complex inter-operation between systems
- Difficult to define critical and non-critical
- Management of failover site
- Keeping sites in step
- Windows Servers
32Synchronous Write Latency
Server
Transfer time
Write 1 0. 5 mSec
Write 2 0.5 mSec
Storage Array
Storage Array
Communication link
Latency 2 Write Time Transfer Time
For 200 kilometres using Fibre Channel Latency
2 0.5 4.0 5.0 mSec
33Site Synchronisation
- Major challenge
- Cultural change is needed
- Critical to successful operation
- DR systems
- Build at recovery time
- Slow / complex recovery
- Maintain ready to use
- How to validate changes
- Live run
- System dependent
34Windows Servers
- Build DR servers at recovery time
- Lengthy recovery process
- Prone to errors
- Complex requires higher skill level
- Maintain DR servers ready to use
- HW does not have to be identical
- Complex SW change and configuration management
- How to validate releases
- Boot servers from storage array
- Requires matching HW
- SW only installed once
- Simplifies SW change and configuration management
- Simplifies failover process / improves recovery
35Windows Boot from SAN
Production Site
DR Site
Test Server
Live Server
DR Server
Test Data
Live Data
Live OS
Test OS
Data
OS
Storage Array
Storage Array
36Virtualisation
- Reduced investment
- Fewer servers dedicated for resilience
- Expand/replace if long term outage
- Flexibility
- Allocate/use servers as required
- Potentially reduced capacity
- Depending on system and scale of incident
- Configuration may not have been proved
37Service Management
- Identify Affected Areas
- Service Desk
- Incident Management
- Problem Management
- Configuration Management
- Change Management
- Release Management
- Testing
38Operational Assessment
- Understand people and process
- Gap analysis
39Delivery Approach
Discover
Model
Design
Implement
Manage
- Business Objectives
- Current Issues or Problems
- Existing/Target Infrastructure
- Success Criteria
- Vision
- Existing Systems, Applications Services
- Physical As-Is Model
- Logical As-Is Model
- Data profiling
- Security assessment
- To-Be Logical Model
- To-Be Physical Model
- Project plan
- Resource schedule
- Develop business case
- Implement target environment
- Migrate and consolidate applications
- Application and middleware integration
- Define and implement test strategy
- Operational assessment gap analysis
- Implement operational management processes
40Workshop
- Determine high-level requirements
- Determine Business Drivers
- Determine Success Criteria
- Overview systems and applications
- Identify team members, sponsors, etc.
- Agree timelines
41Discovery
- Audit and map
- Hardware
- Software
- Services
42Analysis
- Data
- Applications
- Services
- Group Systems
43Design
- Systems architecture
- Operational assessment
- Test environment
- Project plan and resource schedule
- Training requirements
44Transition to Future State
Operational Management
Optimised Architecture
Service Continuity
Application Selection and Development Standards
Data Centre Transformation
Network Design
Storage Design
Training Requirements
Systems Design
Systems Management
Migration Plan
Test Environment and Strategy
45Implementation
- Methodology
- Call on best practice
- Operational management
- Cultural change
- Keep people informed
46john.pollard_at_unisys.com