DM235 Building a Robust Business Continuation Plan - PowerPoint PPT Presentation

1 / 77
About This Presentation
Title:

DM235 Building a Robust Business Continuation Plan

Description:

... 'payroll,' etc. depending on your situation) ... Don't Be Caught with Your Data Down ... Tape Environment. What are the temperature and humidity like? ... – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 78
Provided by: sybas
Category:

less

Transcript and Presenter's Notes

Title: DM235 Building a Robust Business Continuation Plan


1
DM235Building a Robust Business Continuation Plan
Jit Biswas Systems Director Prudential jit.biswas_at_
prudential.com
2
Agenda
  • Preparing a Business Continuation Plan
  • Topologies for High Availability
  • Backups
  • Clusters
  • Warm Standby
  • Peer to Peer Replication

3
A Word about Prudential
  • Prudential Market Strengths
  • 26.9 billion in revenue
  • 195.9 billion in statutory assets
  • 1 trillion of coverage in force for individual
    and group customers
  • Prudential is a large, widely distributed,
    US.-based company with an international presence
  • Over 30 million customers
  • 2,300 firms
  • Approximately 65,000 employees

4
Prudentials Technology Strengths
  • Sybase Is Prudentials database standard
  • 1 billion annual technology budget
  • Prudential.com gets 600,000 hits monthly
  • Almost 5000 IT employees

5
Preparing the Business Continuation Plan
  • Remember, a disaster plan is never a fixed
    finished document - it evolves
  • Be systematic in your plan - don't try to
    outguess Nature and plan for a flood, a
    hurricane, a fire, etc.
  • Appoint a second in command in case the primary
    contact is injured/unavailable

6
How to start planning?
  • Common elements in any disaster
  • loss of information,
  • loss of access to information facilities,
  • loss of people.
  • Make a matrix, with these three as the columns,
    and each of your activities as a row. (Also
    include things like "accounts receivable,"
    "payroll," etc. depending on your situation).
    Then figure out how you would respond to loss of
    information, access, and/or personnel for each
    function.

7
To Dos
  • List individual responsibilities ahead of time,
    and assign specific people to each task. This
    includes tasks such as notifying your suppliers
    where to deliver, calling your most important
    customers to tell them what has happened, calling
    your Board members, etc.
  • Protect critical paper records - such as
    "pending" contracts, advertising, research, loan
    applications, etc. - which only exist on paper.

8
To Dos
  • Set clear priorities among your activities.
  • Not everything will come back to normal at the
    same time
  • Decide beforehand the longest amount of time you
    are willing to be "dead in the water" for each of
    your activities.
  • In the event that your leased-lines are lost, or
    in the event that you must relocate to a
    different site, plan for it.

9
More to Dos
  • Keep copies of all of your forms off site. This
    includes extra checks so that you can buy the
    emergency supplies you need.
  • Keep a copy of your disaster plan at home. Make
    sure it includes the home phone numbers of the
    service people you rely on your insurance agent,
    plumber, electrician, etc.

10
Consider the Issues
  • What is the greatest risk?
  • How are various groups within your company
    affected by downtime?
  • What preventative measures are in place right
    now?
  • How will your data be recovered?

11
Create a Procedure
  • Define how to deal with various aspects of the
    network, including loss of servers,
    bridges/routers, etc
  • Specify who arranges for repairs or
    reconstruction and how the data recovery process
    occurs.
  • Create a check-list or test procedure to verify
    that everything is back to normal once repairs
    and data recovery have taken place.

12
Whats your risk?
  • A hurricane took out the reservation system of a
    major airline, forcing agents to write tickets by
    hand and causing huge losses.
  • The World Trade Center bombing One of the banks
    in the building lost revenues estimated at US20
    million per day, or 13,889 per minute.

13
How expensive is downtime?
  • Hardware repairs and missed sales opportunities
    are only the most obvious costs.
  • Lost productivity and idle employees
  • Increased technical support costs such as onsite
    repair
  • Missed SLAs
  • Loss of customer confidence and goodwill
  • Legal liability

14
Don't Be Caught with Your Data Down
  • 67 of companies that go through a disaster
    lasting more than two weeks are out of business
    within two years.
  • People don't plan to fail, they simply fail to
    plan.

15
Average Cost of Downtime
16
Pyramid of Availability
17
Hardware Redundancy The First Line of Defense
  • Hardware redundancy protects against computer
    and disk failure.
  • Hardware redundancy
  • RAID (redundant array of inexpensive disk)
  • Disk mirroring
  • Hardware redundancy cannot protect against
    failures that can cause corrupted data to be
    written to both the primary and the redundant
    disk.

18
RAID levels
  • How are the disks arranged into logical volumes?
  • RAID-0 (or stripes) increases overall
    performance, but significantly reduce overall
    volume reliability.
  • Various combinations of RAID-1 (mirroring) and
    RAID-0 increase performance while also increasing
    reliability.
  • RAID-5 also tends to increase both performance
    and reliability. Approx. two to three times more
    time should be planned for restoring data to a
    RAID-5 volume than it took to back it up

19
Cold Standby Backup and Restore
  • Operational databases are backed up on a daily
    basis so that all will not be lost in the event
    of a system failure, that destroys or corrupts
    your data.
  • One limitation to this approach is the time
    required to restore a database. While a database
    is recovering, it is inaccessible to the
    end-users.

20
Creating a Warm Standby with Sybase Replication
Server
  • For avoiding the problem of potentially
    corrupted data, one can use Sybase Replication
    Server to create a warm standby that can be
    brought up in the event of a system failure.
    Replication is usually combined with redundant
    hardware.
  • The combination of logical replication and
    hardware redundancy provides greater protection
    against loss of availability than either
    mechanism alone.

21
Active/Active Hot Standby
  • ASE 12.0 includes a Companion Server Option. 2
    ASE servers act as companions in either
    asymmetric (master-slave) or symmetric
    (active/active hot standby) config.
  • Two-node hardware cluster with two ASE servers
    running, both servers actively run applications.
    If server 1 goes down, server 2 will open up the
    devices of server 1 bring them online, while
    continuing to handle its own clients

22
Active/Active Config with apps. running on both
servers
23
Companion Server takes over in the event of a
failure
24
Automatic Client Failover
  • Clients should not have to reconnect in the event
    of a failover.
  • Open Client 12.0 has been enhanced to
    automatically try to connect to the companion
    server in the event of a failover.
  • If a transaction is in a partially completed
    state, an error message is generated to say a
    failover has occurred and that the current
    transaction must be resubmitted.

25
Automatic Client Failover
26
Failback
  • ASE 12.0 shuts down individual databases from
    one server while the connections that were using
    that database are held on the companion server.
    Once the shutdown is complete, the primary server
    can be restarted, and the companions proxy
    databases can be re-established.
  • Support for failback enables a seamless move
    back to the original configuration once the
    primary server has been restarted.

27
Agenda
  • Preparing a Business Continuation Plan
  • Topologies for High Availability
  • Backups
  • Clusters
  • Warm Standby
  • Peer to Peer Replication

28
Are you backed up?
  • Did you create a new file today? Billion new
    files are created every day!
  • Is your file protected? 82 percent dont!
  • What would it cost you to lose that file?
  • 50,000/hr loss to re-create data
  • 18,000 is the average hourly cost of downtime
    for PC networks.

29
Backup Your First Layer of Protection
  • Create a multi-layered backup schedule.
  • Full backup, Incremental backup, Differential
    backup
  • Rotate the media according to a well-defined
    schedule
  • Grandfather-Father-Son This scheme uses daily
    (Son), weekly (Father), and monthly (Grandfather)
    backup media sets.
  • Tower of Hanoi

30
Logical Backups Use Third party tools
  • Physical backups Byte-for-byte image
  • Faster than logical backups. Entire volume is
    backed up as a single entity.
  • Logical backup
  • Reads the superblock to obtain the names of all
    the directories in the fs
  • Slower than Physical Backups
  • Benefit of logical backups is their ability to
    restore single tables instead of the whole
    database

31
Client Backup
  • How many clients are there?
  • What types of clients are there?
  • Do the clients have their own backup devices?
  • How are the clients distributed?
  • How autonomous are the client systems?

32
Tape Environment
  • What are the temperature and humidity like?
    Optimal Operating conditions are 10-40 deg C,
    storage as 16-32 deg C, and humidity between
    20-80.
  • How often are the drive heads cleaned?
  • How old are the drives and tapes?

33
Agenda
  • Preparing a Business Continuation Plan
  • Topologies for High Availability
  • Backups
  • Clusters
  • Warm Standby
  • Peer to Peer Replication

34
Clusters
  • A cluster is a group of computers (referred to as
    nodes) connected in a way that lets them work as
    a single, continuously available system.
  • Highly available and scalable, viz. intranet
    servers, which are increasingly relied upon for
    daily operation, are a good candidate for
    "conversion" into a cluster. The extra nodes help
    ensure uptime and increase the server's
    throughput and storage capacity.

35
Third Party Cluster software for ASE
  • Adaptive Server Enterprise 12.0s Companion
    Server option is certified to work with the
    following high availability solutions.
  • Sun Microsystems Sun Cluster
  • IBM HACMP
  • Hewlett-Packard ServiceGuard
  • Compaq TruCluster
  • Microsoft Windows NT MSCS
  • ASE12.0 uses these solutions to detect system
    failure and initiate a failover.

36
How do clusters work?
  • The typical topology for an HA cluster is as
    follows
  • Two nodes are connected by Ethernet or FDDI.
  • A "heartbeat" is passed between the nodes on the
    private network to monitor the health of each
    node.
  • The storage arrays are redundantly connected to
    the servers. Only one node "owns" a given logical
    diskset, the other node can takeover ownership in
    the case of failover.

37
How a Cluster Works
  • A cluster has two adapters, a primary and a
    non-primary adapter. The primary adapter is the
    adapter that controls the RAID arrays.
  • When the cluster is first configured and the
    systems turned on, the adapter that has the
    higher unique ID is automatically defined to be
    the primary adapter.

38
Failover in a cluster
  • Each adapter checks periodically that it can
    still communicate with its system. The other
    adapter detects that the first adapter has
    stopped operating.
  • If the non-primary adapter detects that is has
    lost access to the other adapter, the non-primary
    adapter becomes the primary adapter. Commands
    that were sent to the original primary adapter
    after access was lost are sent again, to the new
    primary adapter. This action is called failover .

39
Failover of a cluster
  • If writes are in progress to an array (or have
    occurred within the last 20 sec.) when failover
    occurs, the array is rebuilt after the new
    primary adapter has taken control.
  • If an array has one of its members missing (that
    is, the array is in the exposed or degraded
    state) when a failover occurs, the status of the
    array becomes offline and an error is logged.
    Manual intervention is needed to resolve this
    error.

40
Failback
  • After a failover has occurred and a new adapter
    has been installed in place of the faulty one,
    the new adapter might have an ID that is higher
    than that of the remaining (current primary)
    adapter. Under that condition, the new adapter
    becomes the primary adapter. This action is
    called failback .

41
Agenda
  • Preparing a Business Continuation Plan
  • Topologies for High Availability
  • Backups
  • Clusters
  • Warm Standby
  • Peer to Peer Replication

42
Replication Topologies
  • Replicate sites read_only (and do not update
    primary data).
  • Remote primary update with client connection.
  • Remote site request for primary update, with
    local changes.
  • Distributed primary fragments.
  • Corporate roll-up.
  • Warm Standby
  • Peer to Peer Topology

43
Replicate Sites Read Only(and do not update
primary data)
  • Replicate sites just need read only access to
    data
  • Updates are done to the primary site and are
    propagated to replicated sites
  • Replicate sites use replicate copies in
    read-only mode.
  • Only a local client request can update the
    primary.

44
Replicate Sites Read Only(and do not update
primary data)
Primary database
Read Only Replicate Side
One way Replication
Read Only
45
Remote primary Update with Client Connections
  • Replicate sites
  • Replicate sites wish to change data
  • Updates are done at primary site and are
    propagated to replicate sites
  • Replicate sites use replicate copies in read
    only mode
  • Updating primary data
  • Replicate sites remotely log in or use direct
    client connections to make changes to primary
    data

46
Remote primary Update with Client Connections
Read Only Replicate Site
Primary Site
One way Replication
Read
Remote Login to Primary
WRITE
47
Remote Site Request for Primary Update with Local
Changes
  • Replicate sites
  • Replicate sites wish to change data
  • Updates are done at primary site and are
    propagated to replicate sites
  • Replicate sites use replicate copies in read
    only mode, but change a local copy of their
    replicate data
  • Updating Primary Data
  • Remote (replicate) sites change their own local
    copy and request changes in the primary to bring
    everything up to date

48
Remote Site Request for Primary Update with Local
Changes
Applied Function
Request Function
49
Distributed Primary Fragments
  • Partitioned Primary Data
  • Primary data is fragmented into any number of
    databases in the system
  • Only one image of primary data set exists, but
    not in a single table in a single server.
  • Remote Sites
  • Remote sites control their own primary data, and
    can change it.

50
Distributed Primary Fragments
51
Corporate Roll-Up
  • Partitioned Primary Data
  • Primary data is fragmented into any number of
    databases in the system
  • Only one image of primary data set exists, but
    not in a single table in a single server
  • Remote Sites
  • Remote sites control their primary, and can
    change it
  • Remote sites may not see the rest of the database
  • Updates done on primary data are propagated only
    to corporate site.

52
Corporate Roll Up
53
Warm Standby
  • Redundant Primaries
  • Two complete versions of the primary database,
    for warm standby
  • Only one of them acts as the primary at any
    given time
  • Only one primary, the active one, may be updated
    at a time
  • Replicate Sites
  • Remote (replicate) sites are read-only

54
Warm Standby - Switching Over
  • Switching Primaries
  • When first primary server is unavailable, the
    application that updates the primary switches to
    updating the backup primary site
  • Applications must update only one primary at a
    time
  • Backup primary site propagates changes to
    replicate sites

55
OpenSwitch
  • Client programs would migrate seamlessly to a
    failover database in the event of an unplanned
    outage. A simple flip of a software switch would
    transfer client connections to backup servers
    during planned downtime.

56
Threads used in the Replication Server
57
The Replicate Replication Server
58
LTM(Log Transfer Manager)
  • The RS maintains the secondary Truncation point
    within the primary database transaction log.
  • The LTM truncation point is a pointer to the
    OLDEST Active Transaction that has not been
    completely read out by the LTM.
  • The LTM uses the RSSD table rs_locater to keep
    track of this truncation point.

59
LTM Service Threads
  • The LTM will coordinate with the Dataserver to
    persistently store the current LTM trunc. point
  • The rs_zeroltm stored proc is used to initialize
    the locater value if the trunc.point is ever
    invalidated.
  • At LTM startup, two threads are spawned to speak
    with the Dataserver and Replication Server.
  • Log Scan service thread
  • Log transfer service thread

60
LTM(Log Transfer Manager)
  • The LT Service thread submits LTL to the
    Replication Server containing the complete before
    and after row images.
  • The LTL is composed by executing dbcc
    logtransfer(scan, normal)
  • The Log Scan thread will allways synch with the
    log transfer service thread each time it sends a
    completed batch by fetching a new truncation
    point.

61
LTM(Log Transfer Manager)
  • The LTM has very limited language handling
    capabilities.
  • The LTM is only interested in log records which
    have a sysobjects.sysstat -32768 (0x8000).
    Other scanned log records are discarded including
    the maintenance user transactions unless the LTM
    has been started using a -A and/or -W flag.

62
LTM(Log Transfer Manager)
  • The LTM invoking the dbcc logtransfer() command
    translates to
  • dbcc logtransfer(scan, normal) ---------gt
    exec_dbcc() ----------gt
    call_logtransfer()
  • The call logtransfer() is the entry point into
    the SQL server log scanner.
  • When a scan is invoked, the logscan thread reads
    qualifying records from the transaction log and
    transmits then to LTM one record at a time.

63
SQL Servers Role in Replication
  • SQL Server plays a role in three stages during
    data replication
  • To mark objects for replication use
    sp_setreptable, sp_setrepcol
  • SQL Server marks the log records for these
    objects with the LHSC_REPLICATE flag
  • Transmitting these log records to the LTM

64
Preparation for Replication
  • The stored proc sp_setreptable is used to mark
    user tables for replication. The procedure sets
    the O_REPLICATED flag in the sysobjects row for
    that table
  • Performance Tip When sp_setrepcol is called on a
    table with text/image columns, txtimg_upd_table()
    scans the table, reading each text/image column
    in each row and updating the text pointer. This
    is a big overhead for large tables.

65
Preparation for Replication
  • Stored procedures are marked for Replication
    using sp_setrepproc. The system stored proc. sets
    the O_REPLICATED flag and also the
    O_PROC_SUBSCRIBABLE flag in the sysobjects row
    for the stored proc.
  • For replicated stored procedures, SQL Server
    marks only the EXECBEGIN and EXECEND log records
    with the LHSC_REPLICATE flag i.e. the DML LOG
    records are not replicated.
  • Replication server does not replicate nested
    stored procs.

66
Agenda
  • Preparing a Business Continuation Plan
  • Topologies for High Availability
  • Backups
  • Clusters
  • Warm Standby
  • Peer to Peer Replication

67
Update Everywhere - Peer to Peer Topology
  • Two way replication and all sites can use DML
    statements at all times.
  • Updates and inserts are resolved by careful
    application design
  • A replication server is required at all sites
  • Function strings need to be modified for conflict
    resolution
  • One site is given priority based on the
    discretion of the Replication Server Manager.
    Other models use timestamp priority, site
    priority, ownership priority.

68
Peer To Peer Topology
69
The Gameplan
  • Deletes on both sides not an issue
  • Updates and Inserts pose problems
  • Peer to Peer topology with update conflict
    resolution
  • Two way replication with Version controlled
    updates
  • Inserts for system generated numbers and
    identity fields given different ranges
  • Conflicting updates go to log tables which are
    resolved by the RepServer Administrator

70
Version Controlled Updates
  • Version control is a method of detecting and
    resolving conflicting updates.
  • The version column can be a number that increases
    with each update, a timestamp, or
  • Application must provide the current value of the
    version column at the primary site. This is
    passed to a stored procedure.
  • Rejected transactions are written into the
    exceptions log

71
Version Controlled Updates (contd)
  • Add two columns to each table replicated called
    row version and owner
  • The version column gets incremented by one every
    time a row is updated
  • The owner column gets updated to either US or HK
    based on who updates the row
  • The old owner, the primary key, the new owner and
    the new incremented version number was passed to
    a update conflict resolution stored procedure(one
    for each replicated table).
  • This stored proc. is executed by the rs_update
    function string

72
Version Controlled Updates (contd)
  • This stored procedure checks if the versions and
    the transaction is successful if the old versions
    and owners match
  • If the row has been updated simultaneously, the
    data goes to a log table in the replicate site
    which fires a trigger to populate a master table
    of all exceptions.
  • The Replication administrator applies these
    transactions using his discretion logging in as a
    maintenance user.

73
Pre-installation tasks
  • Get the network topology(connections, bandwidth)
  • Hardware capacity planning
  • Understanding the Business Logic
  • Schema/ER diagram, list of tables
  • Eliminate static tables
  • Identify what tables get modified in each side
  • Frequency and volume of transactions
  • Bulk inserts and updates if any

74
Pre-installation tasks (contd)
  • Triggers on the replicate site
  • Identify text columns
  • Select from table_name clauses have to go.
  • Triggers updating more than 16 tables in pre 11.5
    installations
  • Create a test environment

75
Post-installation stage
  • Test out the entire bi-directional strategy
  • What happens when the network goes down
  • Ignore duplicate rows (defaults to stop
    replication).
  • Are logtables getting populated on conflict.
  • Empty the RSSD exceptions table with
    rs_delexception
  • Close eye on disk space. We allocated 7 days
    worth of transaction space in the stable queue

76
Limitations
  • Text columns cannot be handles by UCR
  • After insert triggers force two transactions,
    additional overhead, additional exceptions
  • Network connection should be reliable
  • Front end program had to be written for doing the
    conflict resolution. Manual intervention
    required.
  • Needs careful monitoring

77
Recovery strategies
  • Regular full database backups suggested
  • Transaction log backups during the day
  • For tables out of synch use bcp or rs_subcmp
  • Dump and load procedure for extreme database
    corruption.
  • For large databases it is advisable to keep
    options to connect to the primary server during
    cases of massive corruption
Write a Comment
User Comments (0)
About PowerShow.com