Disaster Recovery Setup at ECMWF - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

Disaster Recovery Setup at ECMWF

Description:

Backups of the critical DHS metadata ... The critical data was saved, but no hardware to access it would have been available. ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 13
Provided by: robh185
Category:

less

Transcript and Presenter's Notes

Title: Disaster Recovery Setup at ECMWF


1
Disaster RecoverySetup at ECMWF
  • Francis Dequenne
  • June 2005

2
Acknowledgments
  • The work described here was mostly performed by
  • Janos Nagy
  • Mike Connally, Kevin Murell and Francis Dequenne
    did provide some help and guidance.

3
Computer hall setup 2004
4
What DRS used to be.
  • The DRS building only contained
  • Second copy of some ECFS and MARS data, partially
    stored in a robot.
  • Systems backup tapes
  • Tiny TSM server with
  • Backups of the critical DHS metadata
  • Backups of some servers (e.g. NFS servers,
    General Purpose servers,..)

5
If computer hall was lost
  • Super-Computers
  • Require installation of new super-computers
    (months).
  • In the short term find a site able to run our
    models for a while.
  • Other servers
  • Require the installation of new hardware (weeks),
    plus bare-metal restore from DRS backups.
  • DHS
  • The critical data was saved, but no hardware to
    access it would have been available.
  • Require installation of new platforms (weeks),
    plus bare-metal restore of systems and metadata
    (HPSS, MARS, ECFS)
  • Never fully tested.

6
There was scope for improvements
  • A disaster in the computer hall may have stopped
    ECMWF activities for weeks.
  • In an ideal world
  • Create an alternative site in another part of
    Europe.
  • Distribute or duplicate our equipment to this new
    site.
  • Duplicate all data to this other site.
  • Install high speed links between the 2 sites.
  • But may be difficult to finance.
  • How to protect ourselves better, while keeping
    the costs under control?

7
There was scope for improvement.
  • Weather Community is ready to help.
  • NCEP disaster NCEP operational workload
    distributed on several sites.
  • ECMWFs Cray C90 burned-down Alternative site
    was identified in a few hours (UK Met).
  • Finding alternative super-computer sites is
    possible.
  • A disaster may only bring down parts of our
    equipment.
  • Computer hall is partitioned.
  • First priority
  • How to provide data to the available equipment?

8
What we wanted to achieve
  • Provide access quickly to the DHS data stored in
    the DRS.
  • Critical data could be exported to external
    sites.
  • Data could be provided to unaffected equipment
    onsite.
  • Transfer to other sites
  • By tapes.
  • Possibly in the future connection of the DRS to
    the WAN.
  • Provide a minimal DHS service to support the
    unaffected equipment.
  • Test regularly that we can restore a service.
  • Costs had to be kept low.

9
Computer hall setup
10
New layout (DHS)
DB
Test core
HPSS core
BKUP Logs, A. Logs
STK silos
HPSS Mover
HPSS Mover
AML/J
ECFS Cache
ECFS Cache
MARS server
MARS server
MARS server
Test server
Additional LAN equipment
11
New layout (DHS)
  • Time to recover 4 to 5 hours.
  • Data lost
  • Old data which is not backed up.
  • In particular
  • . MARS RD
  • . ECFS data without backup
  • Recent data not yet copied to DRS
  • tapes.
  • The service is expected to be very limited
  • - 12 tape drives only
  • - small disk cache
  • - limited CPU resources.
  • Recovery from Disaster
  • Bring the core system image on the test system.
  • Using mirror of DB2 backup and journals, Restore
    HPSSs DB2 metadata.
  • Bring on the remaining HPSS server HPSS Mover
    functionalities, and connect to it the remaining
    disk data mirrors, and DRS tape drives.
  • Restore ECFS and additional MARS servers.

DB
Test core
HPSS core
BKUP Logs, A. Logs
STK silos
HPSS Mover
HPSS Mover
AML/J
ECFS Cache
ECFS Cache
MARS server
MARS server
MARS server
Test server
Additional LAN equipment
12
New layout (DHS)
The only affected service is one MARS server. It
will be restored on one of the surviving MARS
server platform. Data lost anything from that
server which was not yet copied to tape. ECFS
data which was on un-mirrored disks in the DRS
building, and not copied to tape. Service will
be affected to some extend.
DB
Test core
HPSS core
BKUP Logs, A Logs
BKUP Logs, A Logs
STK silos
HPSS Mover
HPSS Mover
ECFS Cache
AML/J
ECFS Cache
ECFS Cache
MARS server
MARS server
MARS server
Test server
13
Some Technical Considerations
  • FC SAN
  • Ability to connect devices physically located in
    remote buildings
  • Ability to physically connect several servers to
    the same devices.
  • FASTt.
  • Ability to dynamically modify ownership of
    logical disks.
  • LVM mirroring.
  • We considered using FASTt remote mirroring
  • Expensive.
  • Impact on performances.

14
Interesting issues encountered during testing
Set primary Tape down, Delete disk copy Data can
be accessed
  • Naming services.
  • Locking of Volume Groups.
  • Alt-disk install post installation.
  • DCE.
  • Bypassing of primary copies.

Either delete all Level 0 copies, or Remove
(temporarily) Level 0 from the hierarchy
15
Why not HACMP?
  • Additional level of complexity.
  • Cost.
  • Need to keep software of several systems
    synchronised and symmetrical.
  • Preserve the ability to use the recovery
    machine as a test machine (e.g. for AIX or HPSS
    upgrade tests).
  • The recovery machine is using rootvgs
    replica. Less risk of missing something.
  • We can afford to be offline for a few hours.

16
Current Status (DHS)
  • First large scale test was performed in April.
  • Some issues were found, some already adressed.
  • We still need to do
  • Iron-out remaining problems
  • Test the restoration of one MARS server in the
    computer hall
  • Evaluate the management of some end cases
  • Introduce a regular testing schedule (twice a
    year?).
  • We are reasonably confident that we would be able
    to provide a service after a computer hall loss.

17
Protection of non-DHS servers Short Term
  • Install supercomputer clusters in different
    computer halls.
  • Other servers
  • Work has started in validating that some
    critical workload can be moved between various
    servers. (e.g. nfs service)
  • These servers could then be distributed between
    the 2 computer halls.
  • Resilient LAN connections between DRS building
    and both computer halls.
  • Split of telecom area.
  • These proposals are under investigation, no
    decisions have been taken yet.

18
In the future
  • Static subsets of popular data could be
    distributed to other sites
  • Already done for ERA-40 data.
  • Consider alternative WAN connection to the DRS
    building.
  • ECMWF may want to investigate the ability to
    distribute a minimal subset of data
    geographically (e.g. as part of DEISA project).
  • This may require that additional bandwidth is
    made available.
  • Distribute DHS equipment across computer halls.
  • Consider extending or replacing the Disaster
    Recovery Building.
  • An Integrated Disaster Recovery Action plan will
    be designed.
Write a Comment
User Comments (0)
About PowerShow.com