Title: DM235 Building a Robust Business Continuation Plan
1DM235Building a Robust Business Continuation Plan
Jit Biswas Systems Director Prudential jit.biswas_at_
prudential.com
2Agenda
- Preparing a Business Continuation Plan
- Topologies for High Availability
- Backups
- Clusters
- Warm Standby
- Peer to Peer Replication
3A Word about Prudential
- Prudential Market Strengths
- 26.9 billion in revenue
- 195.9 billion in statutory assets
- 1 trillion of coverage in force for individual
and group customers - Prudential is a large, widely distributed,
US.-based company with an international presence - Over 30 million customers
- 2,300 firms
- Approximately 65,000 employees
4Prudentials Technology Strengths
- Sybase Is Prudentials database standard
- 1 billion annual technology budget
- Prudential.com gets 600,000 hits monthly
- Almost 5000 IT employees
5Preparing the Business Continuation Plan
- Remember, a disaster plan is never a fixed
finished document - it evolves - Be systematic in your plan - don't try to
outguess Nature and plan for a flood, a
hurricane, a fire, etc. - Appoint a second in command in case the primary
contact is injured/unavailable
6How to start planning?
- Common elements in any disaster
- loss of information,
- loss of access to information facilities,
- loss of people.
- Make a matrix, with these three as the columns,
and each of your activities as a row. (Also
include things like "accounts receivable,"
"payroll," etc. depending on your situation).
Then figure out how you would respond to loss of
information, access, and/or personnel for each
function.
7To Dos
- List individual responsibilities ahead of time,
and assign specific people to each task. This
includes tasks such as notifying your suppliers
where to deliver, calling your most important
customers to tell them what has happened, calling
your Board members, etc. - Protect critical paper records - such as
"pending" contracts, advertising, research, loan
applications, etc. - which only exist on paper.
8To Dos
- Set clear priorities among your activities.
- Not everything will come back to normal at the
same time - Decide beforehand the longest amount of time you
are willing to be "dead in the water" for each of
your activities. - In the event that your leased-lines are lost, or
in the event that you must relocate to a
different site, plan for it.
9More to Dos
- Keep copies of all of your forms off site. This
includes extra checks so that you can buy the
emergency supplies you need. - Keep a copy of your disaster plan at home. Make
sure it includes the home phone numbers of the
service people you rely on your insurance agent,
plumber, electrician, etc.
10Consider the Issues
- What is the greatest risk?
- How are various groups within your company
affected by downtime? - What preventative measures are in place right
now? - How will your data be recovered?
11Create a Procedure
- Define how to deal with various aspects of the
network, including loss of servers,
bridges/routers, etc - Specify who arranges for repairs or
reconstruction and how the data recovery process
occurs. - Create a check-list or test procedure to verify
that everything is back to normal once repairs
and data recovery have taken place.
12Whats your risk?
- A hurricane took out the reservation system of a
major airline, forcing agents to write tickets by
hand and causing huge losses. - The World Trade Center bombing One of the banks
in the building lost revenues estimated at US20
million per day, or 13,889 per minute.
13How expensive is downtime?
- Hardware repairs and missed sales opportunities
are only the most obvious costs. - Lost productivity and idle employees
- Increased technical support costs such as onsite
repair - Missed SLAs
- Loss of customer confidence and goodwill
- Legal liability
14Don't Be Caught with Your Data Down
- 67 of companies that go through a disaster
lasting more than two weeks are out of business
within two years. - People don't plan to fail, they simply fail to
plan.
15Average Cost of Downtime
16Pyramid of Availability
17Hardware Redundancy The First Line of Defense
- Hardware redundancy protects against computer
and disk failure. - Hardware redundancy
- RAID (redundant array of inexpensive disk)
- Disk mirroring
- Hardware redundancy cannot protect against
failures that can cause corrupted data to be
written to both the primary and the redundant
disk.
18RAID levels
- How are the disks arranged into logical volumes?
- RAID-0 (or stripes) increases overall
performance, but significantly reduce overall
volume reliability. - Various combinations of RAID-1 (mirroring) and
RAID-0 increase performance while also increasing
reliability. - RAID-5 also tends to increase both performance
and reliability. Approx. two to three times more
time should be planned for restoring data to a
RAID-5 volume than it took to back it up
19Cold Standby Backup and Restore
- Operational databases are backed up on a daily
basis so that all will not be lost in the event
of a system failure, that destroys or corrupts
your data. - One limitation to this approach is the time
required to restore a database. While a database
is recovering, it is inaccessible to the
end-users.
20Creating a Warm Standby with Sybase Replication
Server
- For avoiding the problem of potentially
corrupted data, one can use Sybase Replication
Server to create a warm standby that can be
brought up in the event of a system failure.
Replication is usually combined with redundant
hardware. - The combination of logical replication and
hardware redundancy provides greater protection
against loss of availability than either
mechanism alone.
21Active/Active Hot Standby
- ASE 12.0 includes a Companion Server Option. 2
ASE servers act as companions in either
asymmetric (master-slave) or symmetric
(active/active hot standby) config. - Two-node hardware cluster with two ASE servers
running, both servers actively run applications.
If server 1 goes down, server 2 will open up the
devices of server 1 bring them online, while
continuing to handle its own clients
22Active/Active Config with apps. running on both
servers
23Companion Server takes over in the event of a
failure
24Automatic Client Failover
- Clients should not have to reconnect in the event
of a failover. - Open Client 12.0 has been enhanced to
automatically try to connect to the companion
server in the event of a failover. - If a transaction is in a partially completed
state, an error message is generated to say a
failover has occurred and that the current
transaction must be resubmitted.
25Automatic Client Failover
26Failback
- ASE 12.0 shuts down individual databases from
one server while the connections that were using
that database are held on the companion server.
Once the shutdown is complete, the primary server
can be restarted, and the companions proxy
databases can be re-established. - Support for failback enables a seamless move
back to the original configuration once the
primary server has been restarted.
27Agenda
- Preparing a Business Continuation Plan
- Topologies for High Availability
- Backups
- Clusters
- Warm Standby
- Peer to Peer Replication
28Are you backed up?
- Did you create a new file today? Billion new
files are created every day! - Is your file protected? 82 percent dont!
- What would it cost you to lose that file?
- 50,000/hr loss to re-create data
- 18,000 is the average hourly cost of downtime
for PC networks.
29Backup Your First Layer of Protection
- Create a multi-layered backup schedule.
- Full backup, Incremental backup, Differential
backup - Rotate the media according to a well-defined
schedule - Grandfather-Father-Son This scheme uses daily
(Son), weekly (Father), and monthly (Grandfather)
backup media sets. - Tower of Hanoi
30Logical Backups Use Third party tools
- Physical backups Byte-for-byte image
- Faster than logical backups. Entire volume is
backed up as a single entity. - Logical backup
- Reads the superblock to obtain the names of all
the directories in the fs - Slower than Physical Backups
- Benefit of logical backups is their ability to
restore single tables instead of the whole
database
31Client Backup
- How many clients are there?
- What types of clients are there?
- Do the clients have their own backup devices?
- How are the clients distributed?
- How autonomous are the client systems?
32Tape Environment
- What are the temperature and humidity like?
Optimal Operating conditions are 10-40 deg C,
storage as 16-32 deg C, and humidity between
20-80. - How often are the drive heads cleaned?
- How old are the drives and tapes?
33Agenda
- Preparing a Business Continuation Plan
- Topologies for High Availability
- Backups
- Clusters
- Warm Standby
- Peer to Peer Replication
34Clusters
- A cluster is a group of computers (referred to as
nodes) connected in a way that lets them work as
a single, continuously available system. - Highly available and scalable, viz. intranet
servers, which are increasingly relied upon for
daily operation, are a good candidate for
"conversion" into a cluster. The extra nodes help
ensure uptime and increase the server's
throughput and storage capacity.
35Third Party Cluster software for ASE
- Adaptive Server Enterprise 12.0s Companion
Server option is certified to work with the
following high availability solutions. - Sun Microsystems Sun Cluster
- IBM HACMP
- Hewlett-Packard ServiceGuard
- Compaq TruCluster
- Microsoft Windows NT MSCS
- ASE12.0 uses these solutions to detect system
failure and initiate a failover.
36How do clusters work?
- The typical topology for an HA cluster is as
follows - Two nodes are connected by Ethernet or FDDI.
- A "heartbeat" is passed between the nodes on the
private network to monitor the health of each
node. - The storage arrays are redundantly connected to
the servers. Only one node "owns" a given logical
diskset, the other node can takeover ownership in
the case of failover.
37How a Cluster Works
- A cluster has two adapters, a primary and a
non-primary adapter. The primary adapter is the
adapter that controls the RAID arrays. - When the cluster is first configured and the
systems turned on, the adapter that has the
higher unique ID is automatically defined to be
the primary adapter.
38Failover in a cluster
- Each adapter checks periodically that it can
still communicate with its system. The other
adapter detects that the first adapter has
stopped operating. - If the non-primary adapter detects that is has
lost access to the other adapter, the non-primary
adapter becomes the primary adapter. Commands
that were sent to the original primary adapter
after access was lost are sent again, to the new
primary adapter. This action is called failover .
39Failover of a cluster
- If writes are in progress to an array (or have
occurred within the last 20 sec.) when failover
occurs, the array is rebuilt after the new
primary adapter has taken control. - If an array has one of its members missing (that
is, the array is in the exposed or degraded
state) when a failover occurs, the status of the
array becomes offline and an error is logged.
Manual intervention is needed to resolve this
error.
40Failback
- After a failover has occurred and a new adapter
has been installed in place of the faulty one,
the new adapter might have an ID that is higher
than that of the remaining (current primary)
adapter. Under that condition, the new adapter
becomes the primary adapter. This action is
called failback .
41Agenda
- Preparing a Business Continuation Plan
- Topologies for High Availability
- Backups
- Clusters
- Warm Standby
- Peer to Peer Replication
42Replication Topologies
- Replicate sites read_only (and do not update
primary data). - Remote primary update with client connection.
- Remote site request for primary update, with
local changes. - Distributed primary fragments.
- Corporate roll-up.
- Warm Standby
- Peer to Peer Topology
43Replicate Sites Read Only(and do not update
primary data)
- Replicate sites just need read only access to
data -
- Updates are done to the primary site and are
propagated to replicated sites - Replicate sites use replicate copies in
read-only mode. - Only a local client request can update the
primary.
44Replicate Sites Read Only(and do not update
primary data)
Primary database
Read Only Replicate Side
One way Replication
Read Only
45Remote primary Update with Client Connections
- Replicate sites
- Replicate sites wish to change data
- Updates are done at primary site and are
propagated to replicate sites - Replicate sites use replicate copies in read
only mode - Updating primary data
- Replicate sites remotely log in or use direct
client connections to make changes to primary
data
46Remote primary Update with Client Connections
Read Only Replicate Site
Primary Site
One way Replication
Read
Remote Login to Primary
WRITE
47Remote Site Request for Primary Update with Local
Changes
- Replicate sites
- Replicate sites wish to change data
- Updates are done at primary site and are
propagated to replicate sites - Replicate sites use replicate copies in read
only mode, but change a local copy of their
replicate data - Updating Primary Data
- Remote (replicate) sites change their own local
copy and request changes in the primary to bring
everything up to date
48Remote Site Request for Primary Update with Local
Changes
Applied Function
Request Function
49Distributed Primary Fragments
- Partitioned Primary Data
- Primary data is fragmented into any number of
databases in the system - Only one image of primary data set exists, but
not in a single table in a single server. - Remote Sites
- Remote sites control their own primary data, and
can change it.
50Distributed Primary Fragments
51Corporate Roll-Up
- Partitioned Primary Data
- Primary data is fragmented into any number of
databases in the system - Only one image of primary data set exists, but
not in a single table in a single server - Remote Sites
- Remote sites control their primary, and can
change it - Remote sites may not see the rest of the database
- Updates done on primary data are propagated only
to corporate site.
52Corporate Roll Up
53Warm Standby
- Redundant Primaries
- Two complete versions of the primary database,
for warm standby - Only one of them acts as the primary at any
given time - Only one primary, the active one, may be updated
at a time - Replicate Sites
- Remote (replicate) sites are read-only
54Warm Standby - Switching Over
- Switching Primaries
- When first primary server is unavailable, the
application that updates the primary switches to
updating the backup primary site - Applications must update only one primary at a
time - Backup primary site propagates changes to
replicate sites
55OpenSwitch
- Client programs would migrate seamlessly to a
failover database in the event of an unplanned
outage. A simple flip of a software switch would
transfer client connections to backup servers
during planned downtime.
56Threads used in the Replication Server
57The Replicate Replication Server
58LTM(Log Transfer Manager)
- The RS maintains the secondary Truncation point
within the primary database transaction log. - The LTM truncation point is a pointer to the
OLDEST Active Transaction that has not been
completely read out by the LTM. - The LTM uses the RSSD table rs_locater to keep
track of this truncation point.
59LTM Service Threads
- The LTM will coordinate with the Dataserver to
persistently store the current LTM trunc. point - The rs_zeroltm stored proc is used to initialize
the locater value if the trunc.point is ever
invalidated. - At LTM startup, two threads are spawned to speak
with the Dataserver and Replication Server. - Log Scan service thread
- Log transfer service thread
60LTM(Log Transfer Manager)
- The LT Service thread submits LTL to the
Replication Server containing the complete before
and after row images. - The LTL is composed by executing dbcc
logtransfer(scan, normal) - The Log Scan thread will allways synch with the
log transfer service thread each time it sends a
completed batch by fetching a new truncation
point.
61LTM(Log Transfer Manager)
- The LTM has very limited language handling
capabilities. - The LTM is only interested in log records which
have a sysobjects.sysstat -32768 (0x8000).
Other scanned log records are discarded including
the maintenance user transactions unless the LTM
has been started using a -A and/or -W flag.
62LTM(Log Transfer Manager)
- The LTM invoking the dbcc logtransfer() command
translates to - dbcc logtransfer(scan, normal) ---------gt
exec_dbcc() ----------gt
call_logtransfer() - The call logtransfer() is the entry point into
the SQL server log scanner. -
- When a scan is invoked, the logscan thread reads
qualifying records from the transaction log and
transmits then to LTM one record at a time.
63SQL Servers Role in Replication
- SQL Server plays a role in three stages during
data replication - To mark objects for replication use
sp_setreptable, sp_setrepcol - SQL Server marks the log records for these
objects with the LHSC_REPLICATE flag - Transmitting these log records to the LTM
64Preparation for Replication
- The stored proc sp_setreptable is used to mark
user tables for replication. The procedure sets
the O_REPLICATED flag in the sysobjects row for
that table - Performance Tip When sp_setrepcol is called on a
table with text/image columns, txtimg_upd_table()
scans the table, reading each text/image column
in each row and updating the text pointer. This
is a big overhead for large tables.
65Preparation for Replication
- Stored procedures are marked for Replication
using sp_setrepproc. The system stored proc. sets
the O_REPLICATED flag and also the
O_PROC_SUBSCRIBABLE flag in the sysobjects row
for the stored proc. - For replicated stored procedures, SQL Server
marks only the EXECBEGIN and EXECEND log records
with the LHSC_REPLICATE flag i.e. the DML LOG
records are not replicated. - Replication server does not replicate nested
stored procs.
66Agenda
- Preparing a Business Continuation Plan
- Topologies for High Availability
- Backups
- Clusters
- Warm Standby
- Peer to Peer Replication
67Update Everywhere - Peer to Peer Topology
- Two way replication and all sites can use DML
statements at all times. - Updates and inserts are resolved by careful
application design - A replication server is required at all sites
- Function strings need to be modified for conflict
resolution - One site is given priority based on the
discretion of the Replication Server Manager.
Other models use timestamp priority, site
priority, ownership priority.
68Peer To Peer Topology
69The Gameplan
- Deletes on both sides not an issue
- Updates and Inserts pose problems
- Peer to Peer topology with update conflict
resolution - Two way replication with Version controlled
updates - Inserts for system generated numbers and
identity fields given different ranges - Conflicting updates go to log tables which are
resolved by the RepServer Administrator
70Version Controlled Updates
- Version control is a method of detecting and
resolving conflicting updates. - The version column can be a number that increases
with each update, a timestamp, or - Application must provide the current value of the
version column at the primary site. This is
passed to a stored procedure. - Rejected transactions are written into the
exceptions log
71Version Controlled Updates (contd)
- Add two columns to each table replicated called
row version and owner - The version column gets incremented by one every
time a row is updated - The owner column gets updated to either US or HK
based on who updates the row - The old owner, the primary key, the new owner and
the new incremented version number was passed to
a update conflict resolution stored procedure(one
for each replicated table). - This stored proc. is executed by the rs_update
function string
72Version Controlled Updates (contd)
- This stored procedure checks if the versions and
the transaction is successful if the old versions
and owners match - If the row has been updated simultaneously, the
data goes to a log table in the replicate site
which fires a trigger to populate a master table
of all exceptions. - The Replication administrator applies these
transactions using his discretion logging in as a
maintenance user.
73Pre-installation tasks
- Get the network topology(connections, bandwidth)
- Hardware capacity planning
- Understanding the Business Logic
- Schema/ER diagram, list of tables
- Eliminate static tables
- Identify what tables get modified in each side
- Frequency and volume of transactions
- Bulk inserts and updates if any
74Pre-installation tasks (contd)
- Triggers on the replicate site
- Identify text columns
- Select from table_name clauses have to go.
- Triggers updating more than 16 tables in pre 11.5
installations - Create a test environment
75Post-installation stage
- Test out the entire bi-directional strategy
- What happens when the network goes down
- Ignore duplicate rows (defaults to stop
replication). - Are logtables getting populated on conflict.
- Empty the RSSD exceptions table with
rs_delexception - Close eye on disk space. We allocated 7 days
worth of transaction space in the stable queue
76Limitations
- Text columns cannot be handles by UCR
- After insert triggers force two transactions,
additional overhead, additional exceptions - Network connection should be reliable
- Front end program had to be written for doing the
conflict resolution. Manual intervention
required. - Needs careful monitoring
77Recovery strategies
- Regular full database backups suggested
- Transaction log backups during the day
- For tables out of synch use bcp or rs_subcmp
- Dump and load procedure for extreme database
corruption. - For large databases it is advisable to keep
options to connect to the primary server during
cases of massive corruption