DM235 Building a Robust Business Continuation Plan - PowerPoint PPT Presentation

1 / 77

About This Presentation

Title:

DM235 Building a Robust Business Continuation Plan

Description:

... 'payroll,' etc. depending on your situation) ... Don't Be Caught with Your Data Down ... Tape Environment. What are the temperature and humidity like? ... – PowerPoint PPT presentation

Number of Views:28

Avg rating:3.0/5.0

Slides: 78

Provided by: sybas

Category:

more less

Transcript and Presenter's Notes

Title: DM235 Building a Robust Business Continuation Plan

1
DM235Building a Robust Business Continuation Plan
Jit Biswas Systems Director Prudential jit.biswas_at_
prudential.com
2
Agenda

Preparing a Business Continuation Plan
Topologies for High Availability
Backups
Clusters
Warm Standby
Peer to Peer Replication

3
A Word about Prudential

Prudential Market Strengths
26.9 billion in revenue
195.9 billion in statutory assets
1 trillion of coverage in force for individual
and group customers
Prudential is a large, widely distributed,
US.-based company with an international presence
Over 30 million customers
2,300 firms
Approximately 65,000 employees

4
Prudentials Technology Strengths

Sybase Is Prudentials database standard
1 billion annual technology budget
Prudential.com gets 600,000 hits monthly
Almost 5000 IT employees

5
Preparing the Business Continuation Plan

Remember, a disaster plan is never a fixed
finished document - it evolves
Be systematic in your plan - don't try to
outguess Nature and plan for a flood, a
hurricane, a fire, etc.
Appoint a second in command in case the primary
contact is injured/unavailable

6
How to start planning?

Common elements in any disaster
loss of information,
loss of access to information facilities,
loss of people.
Make a matrix, with these three as the columns,
and each of your activities as a row. (Also
include things like "accounts receivable,"
"payroll," etc. depending on your situation).
Then figure out how you would respond to loss of
information, access, and/or personnel for each
function.

7
To Dos

List individual responsibilities ahead of time,
and assign specific people to each task. This
includes tasks such as notifying your suppliers
where to deliver, calling your most important
customers to tell them what has happened, calling
your Board members, etc.
Protect critical paper records - such as
"pending" contracts, advertising, research, loan
applications, etc. - which only exist on paper.

8
To Dos

Set clear priorities among your activities.
Not everything will come back to normal at the
same time
Decide beforehand the longest amount of time you
are willing to be "dead in the water" for each of
your activities.
In the event that your leased-lines are lost, or
in the event that you must relocate to a
different site, plan for it.

9
More to Dos

Keep copies of all of your forms off site. This
includes extra checks so that you can buy the
emergency supplies you need.
Keep a copy of your disaster plan at home. Make
sure it includes the home phone numbers of the
service people you rely on your insurance agent,
plumber, electrician, etc.

10
Consider the Issues

What is the greatest risk?
How are various groups within your company
affected by downtime?
What preventative measures are in place right
now?
How will your data be recovered?

11
Create a Procedure

Define how to deal with various aspects of the
network, including loss of servers,
bridges/routers, etc
Specify who arranges for repairs or
reconstruction and how the data recovery process
occurs.
Create a check-list or test procedure to verify
that everything is back to normal once repairs
and data recovery have taken place.

12
Whats your risk?

A hurricane took out the reservation system of a
major airline, forcing agents to write tickets by
hand and causing huge losses.
The World Trade Center bombing One of the banks
in the building lost revenues estimated at US20
million per day, or 13,889 per minute.

13
How expensive is downtime?

Hardware repairs and missed sales opportunities
are only the most obvious costs.
Lost productivity and idle employees
Increased technical support costs such as onsite
repair
Missed SLAs
Loss of customer confidence and goodwill
Legal liability

14
Don't Be Caught with Your Data Down

67 of companies that go through a disaster
lasting more than two weeks are out of business
within two years.
People don't plan to fail, they simply fail to
plan.

15
Average Cost of Downtime
16
Pyramid of Availability
17
Hardware Redundancy The First Line of Defense

Hardware redundancy protects against computer
and disk failure.
Hardware redundancy
RAID (redundant array of inexpensive disk)
Disk mirroring
Hardware redundancy cannot protect against
failures that can cause corrupted data to be
written to both the primary and the redundant
disk.

18
RAID levels

How are the disks arranged into logical volumes?
RAID-0 (or stripes) increases overall
performance, but significantly reduce overall
volume reliability.
Various combinations of RAID-1 (mirroring) and
RAID-0 increase performance while also increasing
reliability.
RAID-5 also tends to increase both performance
and reliability. Approx. two to three times more
time should be planned for restoring data to a
RAID-5 volume than it took to back it up

19
Cold Standby Backup and Restore

Operational databases are backed up on a daily
basis so that all will not be lost in the event
of a system failure, that destroys or corrupts
your data.
One limitation to this approach is the time
required to restore a database. While a database
is recovering, it is inaccessible to the
end-users.

20
Creating a Warm Standby with Sybase Replication
Server

For avoiding the problem of potentially
corrupted data, one can use Sybase Replication
Server to create a warm standby that can be
brought up in the event of a system failure.
Replication is usually combined with redundant
hardware.
The combination of logical replication and
hardware redundancy provides greater protection
against loss of availability than either
mechanism alone.

21
Active/Active Hot Standby

ASE 12.0 includes a Companion Server Option. 2
ASE servers act as companions in either
asymmetric (master-slave) or symmetric
(active/active hot standby) config.
Two-node hardware cluster with two ASE servers
running, both servers actively run applications.
If server 1 goes down, server 2 will open up the
devices of server 1 bring them online, while
continuing to handle its own clients

22
Active/Active Config with apps. running on both
servers
23
Companion Server takes over in the event of a
failure
24
Automatic Client Failover

Clients should not have to reconnect in the event
of a failover.
Open Client 12.0 has been enhanced to
automatically try to connect to the companion
server in the event of a failover.
If a transaction is in a partially completed
state, an error message is generated to say a
failover has occurred and that the current
transaction must be resubmitted.

25
Automatic Client Failover
26
Failback

ASE 12.0 shuts down individual databases from
one server while the connections that were using
that database are held on the companion server.
Once the shutdown is complete, the primary server
can be restarted, and the companions proxy
databases can be re-established.
Support for failback enables a seamless move
back to the original configuration once the
primary server has been restarted.

27
Agenda

Preparing a Business Continuation Plan
Topologies for High Availability
Backups
Clusters
Warm Standby
Peer to Peer Replication

28
Are you backed up?

Did you create a new file today? Billion new
files are created every day!
Is your file protected? 82 percent dont!
What would it cost you to lose that file?
50,000/hr loss to re-create data
18,000 is the average hourly cost of downtime
for PC networks.

29
Backup Your First Layer of Protection

Create a multi-layered backup schedule.
Full backup, Incremental backup, Differential
backup
Rotate the media according to a well-defined
schedule
Grandfather-Father-Son This scheme uses daily
(Son), weekly (Father), and monthly (Grandfather)
backup media sets.
Tower of Hanoi

30
Logical Backups Use Third party tools

Physical backups Byte-for-byte image
Faster than logical backups. Entire volume is
backed up as a single entity.
Logical backup
Reads the superblock to obtain the names of all
the directories in the fs
Slower than Physical Backups
Benefit of logical backups is their ability to
restore single tables instead of the whole
database

31
Client Backup

How many clients are there?
What types of clients are there?
Do the clients have their own backup devices?
How are the clients distributed?
How autonomous are the client systems?

32
Tape Environment

What are the temperature and humidity like?
Optimal Operating conditions are 10-40 deg C,
storage as 16-32 deg C, and humidity between
20-80.
How often are the drive heads cleaned?
How old are the drives and tapes?

33
Agenda

Preparing a Business Continuation Plan
Topologies for High Availability
Backups
Clusters
Warm Standby
Peer to Peer Replication

34
Clusters

A cluster is a group of computers (referred to as
nodes) connected in a way that lets them work as
a single, continuously available system.
Highly available and scalable, viz. intranet
servers, which are increasingly relied upon for
daily operation, are a good candidate for
"conversion" into a cluster. The extra nodes help
ensure uptime and increase the server's
throughput and storage capacity.

35
Third Party Cluster software for ASE

Adaptive Server Enterprise 12.0s Companion
Server option is certified to work with the
following high availability solutions.
Sun Microsystems Sun Cluster
IBM HACMP
Hewlett-Packard ServiceGuard
Compaq TruCluster
Microsoft Windows NT MSCS
ASE12.0 uses these solutions to detect system
failure and initiate a failover.

36
How do clusters work?

The typical topology for an HA cluster is as
follows
Two nodes are connected by Ethernet or FDDI.
A "heartbeat" is passed between the nodes on the
private network to monitor the health of each
node.
The storage arrays are redundantly connected to
the servers. Only one node "owns" a given logical
diskset, the other node can takeover ownership in
the case of failover.

37
How a Cluster Works

A cluster has two adapters, a primary and a
non-primary adapter. The primary adapter is the
adapter that controls the RAID arrays.
When the cluster is first configured and the
systems turned on, the adapter that has the
higher unique ID is automatically defined to be
the primary adapter.

38
Failover in a cluster

Each adapter checks periodically that it can
still communicate with its system. The other
adapter detects that the first adapter has
stopped operating.
If the non-primary adapter detects that is has
lost access to the other adapter, the non-primary
adapter becomes the primary adapter. Commands
that were sent to the original primary adapter
after access was lost are sent again, to the new
primary adapter. This action is called failover .

39
Failover of a cluster

If writes are in progress to an array (or have
occurred within the last 20 sec.) when failover
occurs, the array is rebuilt after the new
primary adapter has taken control.
If an array has one of its members missing (that
is, the array is in the exposed or degraded
state) when a failover occurs, the status of the
array becomes offline and an error is logged.
Manual intervention is needed to resolve this
error.

40
Failback

After a failover has occurred and a new adapter
has been installed in place of the faulty one,
the new adapter might have an ID that is higher
than that of the remaining (current primary)
adapter. Under that condition, the new adapter
becomes the primary adapter. This action is
called failback .

41
Agenda

Preparing a Business Continuation Plan
Topologies for High Availability
Backups
Clusters
Warm Standby
Peer to Peer Replication

42
Replication Topologies

Replicate sites read_only (and do not update
primary data).
Remote primary update with client connection.
Remote site request for primary update, with
local changes.
Distributed primary fragments.
Corporate roll-up.
Warm Standby
Peer to Peer Topology

43
Replicate Sites Read Only(and do not update
primary data)

Replicate sites just need read only access to
data
Updates are done to the primary site and are
propagated to replicated sites
Replicate sites use replicate copies in
read-only mode.
Only a local client request can update the
primary.

44
Replicate Sites Read Only(and do not update
primary data)
Primary database
Read Only Replicate Side
One way Replication
Read Only
45
Remote primary Update with Client Connections

Replicate sites
Replicate sites wish to change data
Updates are done at primary site and are
propagated to replicate sites
Replicate sites use replicate copies in read
only mode
Updating primary data
Replicate sites remotely log in or use direct
client connections to make changes to primary
data

46
Remote primary Update with Client Connections
Read Only Replicate Site
Primary Site
One way Replication
Read
Remote Login to Primary
WRITE
47
Remote Site Request for Primary Update with Local
Changes

Replicate sites
Replicate sites wish to change data
Updates are done at primary site and are
propagated to replicate sites
Replicate sites use replicate copies in read
only mode, but change a local copy of their
replicate data
Updating Primary Data
Remote (replicate) sites change their own local
copy and request changes in the primary to bring
everything up to date

48
Remote Site Request for Primary Update with Local
Changes
Applied Function
Request Function
49
Distributed Primary Fragments

Partitioned Primary Data
Primary data is fragmented into any number of
databases in the system
Only one image of primary data set exists, but
not in a single table in a single server.
Remote Sites
Remote sites control their own primary data, and
can change it.

50
Distributed Primary Fragments
51
Corporate Roll-Up

Partitioned Primary Data
Primary data is fragmented into any number of
databases in the system
Only one image of primary data set exists, but
not in a single table in a single server
Remote Sites
Remote sites control their primary, and can
change it
Remote sites may not see the rest of the database
Updates done on primary data are propagated only
to corporate site.

52
Corporate Roll Up
53
Warm Standby

Redundant Primaries
Two complete versions of the primary database,
for warm standby
Only one of them acts as the primary at any
given time
Only one primary, the active one, may be updated
at a time
Replicate Sites
Remote (replicate) sites are read-only

54
Warm Standby - Switching Over

Switching Primaries
When first primary server is unavailable, the
application that updates the primary switches to
updating the backup primary site
Applications must update only one primary at a
time
Backup primary site propagates changes to
replicate sites

55
OpenSwitch

Client programs would migrate seamlessly to a
failover database in the event of an unplanned
outage. A simple flip of a software switch would
transfer client connections to backup servers
during planned downtime.

56
Threads used in the Replication Server
57
The Replicate Replication Server
58
LTM(Log Transfer Manager)

The RS maintains the secondary Truncation point
within the primary database transaction log.
The LTM truncation point is a pointer to the
OLDEST Active Transaction that has not been
completely read out by the LTM.
The LTM uses the RSSD table rs_locater to keep
track of this truncation point.

59
LTM Service Threads

The LTM will coordinate with the Dataserver to
persistently store the current LTM trunc. point
The rs_zeroltm stored proc is used to initialize
the locater value if the trunc.point is ever
invalidated.
At LTM startup, two threads are spawned to speak
with the Dataserver and Replication Server.
Log Scan service thread
Log transfer service thread

60
LTM(Log Transfer Manager)

The LT Service thread submits LTL to the
Replication Server containing the complete before
and after row images.
The LTL is composed by executing dbcc
logtransfer(scan, normal)
The Log Scan thread will allways synch with the
log transfer service thread each time it sends a
completed batch by fetching a new truncation
point.

61
LTM(Log Transfer Manager)

The LTM has very limited language handling
capabilities.
The LTM is only interested in log records which
have a sysobjects.sysstat -32768 (0x8000).
Other scanned log records are discarded including
the maintenance user transactions unless the LTM
has been started using a -A and/or -W flag.

62
LTM(Log Transfer Manager)

The LTM invoking the dbcc logtransfer() command
translates to
dbcc logtransfer(scan, normal) ---------gt
exec_dbcc() ----------gt
call_logtransfer()
The call logtransfer() is the entry point into
the SQL server log scanner.
When a scan is invoked, the logscan thread reads
qualifying records from the transaction log and
transmits then to LTM one record at a time.

63
SQL Servers Role in Replication

SQL Server plays a role in three stages during
data replication
To mark objects for replication use
sp_setreptable, sp_setrepcol
SQL Server marks the log records for these
objects with the LHSC_REPLICATE flag
Transmitting these log records to the LTM

64
Preparation for Replication

The stored proc sp_setreptable is used to mark
user tables for replication. The procedure sets
the O_REPLICATED flag in the sysobjects row for
that table
Performance Tip When sp_setrepcol is called on a
table with text/image columns, txtimg_upd_table()
scans the table, reading each text/image column
in each row and updating the text pointer. This
is a big overhead for large tables.

65
Preparation for Replication

Stored procedures are marked for Replication
using sp_setrepproc. The system stored proc. sets
the O_REPLICATED flag and also the
O_PROC_SUBSCRIBABLE flag in the sysobjects row
for the stored proc.
For replicated stored procedures, SQL Server
marks only the EXECBEGIN and EXECEND log records
with the LHSC_REPLICATE flag i.e. the DML LOG
records are not replicated.
Replication server does not replicate nested
stored procs.

66
Agenda

Preparing a Business Continuation Plan
Topologies for High Availability
Backups
Clusters
Warm Standby
Peer to Peer Replication

67
Update Everywhere - Peer to Peer Topology

Two way replication and all sites can use DML
statements at all times.
Updates and inserts are resolved by careful
application design
A replication server is required at all sites
Function strings need to be modified for conflict
resolution
One site is given priority based on the
discretion of the Replication Server Manager.
Other models use timestamp priority, site
priority, ownership priority.

68
Peer To Peer Topology
69
The Gameplan

Deletes on both sides not an issue
Updates and Inserts pose problems
Peer to Peer topology with update conflict
resolution
Two way replication with Version controlled
updates
Inserts for system generated numbers and
identity fields given different ranges
Conflicting updates go to log tables which are
resolved by the RepServer Administrator

70
Version Controlled Updates

Version control is a method of detecting and
resolving conflicting updates.
The version column can be a number that increases
with each update, a timestamp, or
Application must provide the current value of the
version column at the primary site. This is
passed to a stored procedure.
Rejected transactions are written into the
exceptions log

71
Version Controlled Updates (contd)

Add two columns to each table replicated called
row version and owner
The version column gets incremented by one every
time a row is updated
The owner column gets updated to either US or HK
based on who updates the row
The old owner, the primary key, the new owner and
the new incremented version number was passed to
a update conflict resolution stored procedure(one
for each replicated table).
This stored proc. is executed by the rs_update
function string

72
Version Controlled Updates (contd)

This stored procedure checks if the versions and
the transaction is successful if the old versions
and owners match
If the row has been updated simultaneously, the
data goes to a log table in the replicate site
which fires a trigger to populate a master table
of all exceptions.
The Replication administrator applies these
transactions using his discretion logging in as a
maintenance user.

73
Pre-installation tasks

Get the network topology(connections, bandwidth)
Hardware capacity planning
Understanding the Business Logic
Schema/ER diagram, list of tables
Eliminate static tables
Identify what tables get modified in each side
Frequency and volume of transactions
Bulk inserts and updates if any

74
Pre-installation tasks (contd)

Triggers on the replicate site
Identify text columns
Select from table_name clauses have to go.
Triggers updating more than 16 tables in pre 11.5
installations
Create a test environment

75
Post-installation stage

Test out the entire bi-directional strategy
What happens when the network goes down
Ignore duplicate rows (defaults to stop
replication).
Are logtables getting populated on conflict.
Empty the RSSD exceptions table with
rs_delexception
Close eye on disk space. We allocated 7 days
worth of transaction space in the stable queue

76
Limitations

Text columns cannot be handles by UCR
After insert triggers force two transactions,
additional overhead, additional exceptions
Network connection should be reliable
Front end program had to be written for doing the
conflict resolution. Manual intervention
required.
Needs careful monitoring

77
Recovery strategies

Regular full database backups suggested
Transaction log backups during the day
For tables out of synch use bcp or rs_subcmp
Dump and load procedure for extreme database
corruption.
For large databases it is advisable to keep
options to connect to the primary server during
cases of massive corruption

Write a Comment

User Comments (0)