Title: Disaster Recovery Setup at ECMWF
1Disaster RecoverySetup at ECMWF
- Francis Dequenne
- June 2005
2Acknowledgments
- The work described here was mostly performed by
- Janos Nagy
- Mike Connally, Kevin Murell and Francis Dequenne
did provide some help and guidance.
3Computer hall setup 2004
4What DRS used to be.
- The DRS building only contained
- Second copy of some ECFS and MARS data, partially
stored in a robot. - Systems backup tapes
- Tiny TSM server with
- Backups of the critical DHS metadata
- Backups of some servers (e.g. NFS servers,
General Purpose servers,..)
5If computer hall was lost
- Super-Computers
- Require installation of new super-computers
(months). - In the short term find a site able to run our
models for a while. - Other servers
- Require the installation of new hardware (weeks),
plus bare-metal restore from DRS backups. - DHS
- The critical data was saved, but no hardware to
access it would have been available. - Require installation of new platforms (weeks),
plus bare-metal restore of systems and metadata
(HPSS, MARS, ECFS) - Never fully tested.
6There was scope for improvements
- A disaster in the computer hall may have stopped
ECMWF activities for weeks. - In an ideal world
- Create an alternative site in another part of
Europe. - Distribute or duplicate our equipment to this new
site. - Duplicate all data to this other site.
- Install high speed links between the 2 sites.
- But may be difficult to finance.
- How to protect ourselves better, while keeping
the costs under control?
7There was scope for improvement.
- Weather Community is ready to help.
- NCEP disaster NCEP operational workload
distributed on several sites. - ECMWFs Cray C90 burned-down Alternative site
was identified in a few hours (UK Met). - Finding alternative super-computer sites is
possible. - A disaster may only bring down parts of our
equipment. - Computer hall is partitioned.
- First priority
- How to provide data to the available equipment?
8What we wanted to achieve
- Provide access quickly to the DHS data stored in
the DRS. - Critical data could be exported to external
sites. - Data could be provided to unaffected equipment
onsite. - Transfer to other sites
- By tapes.
- Possibly in the future connection of the DRS to
the WAN. - Provide a minimal DHS service to support the
unaffected equipment. - Test regularly that we can restore a service.
- Costs had to be kept low.
9Computer hall setup
10New layout (DHS)
DB
Test core
HPSS core
BKUP Logs, A. Logs
STK silos
HPSS Mover
HPSS Mover
AML/J
ECFS Cache
ECFS Cache
MARS server
MARS server
MARS server
Test server
Additional LAN equipment
11New layout (DHS)
- Time to recover 4 to 5 hours.
- Data lost
- Old data which is not backed up.
- In particular
- . MARS RD
- . ECFS data without backup
- Recent data not yet copied to DRS
- tapes.
- The service is expected to be very limited
- - 12 tape drives only
- - small disk cache
- - limited CPU resources.
- Recovery from Disaster
- Bring the core system image on the test system.
- Using mirror of DB2 backup and journals, Restore
HPSSs DB2 metadata. - Bring on the remaining HPSS server HPSS Mover
functionalities, and connect to it the remaining
disk data mirrors, and DRS tape drives. - Restore ECFS and additional MARS servers.
DB
Test core
HPSS core
BKUP Logs, A. Logs
STK silos
HPSS Mover
HPSS Mover
AML/J
ECFS Cache
ECFS Cache
MARS server
MARS server
MARS server
Test server
Additional LAN equipment
12New layout (DHS)
The only affected service is one MARS server. It
will be restored on one of the surviving MARS
server platform. Data lost anything from that
server which was not yet copied to tape. ECFS
data which was on un-mirrored disks in the DRS
building, and not copied to tape. Service will
be affected to some extend.
DB
Test core
HPSS core
BKUP Logs, A Logs
BKUP Logs, A Logs
STK silos
HPSS Mover
HPSS Mover
ECFS Cache
AML/J
ECFS Cache
ECFS Cache
MARS server
MARS server
MARS server
Test server
13Some Technical Considerations
- FC SAN
- Ability to connect devices physically located in
remote buildings - Ability to physically connect several servers to
the same devices. - FASTt.
- Ability to dynamically modify ownership of
logical disks. - LVM mirroring.
- We considered using FASTt remote mirroring
- Expensive.
- Impact on performances.
14Interesting issues encountered during testing
Set primary Tape down, Delete disk copy Data can
be accessed
- Naming services.
- Locking of Volume Groups.
- Alt-disk install post installation.
- DCE.
- Bypassing of primary copies.
Either delete all Level 0 copies, or Remove
(temporarily) Level 0 from the hierarchy
15Why not HACMP?
- Additional level of complexity.
- Cost.
- Need to keep software of several systems
synchronised and symmetrical. - Preserve the ability to use the recovery
machine as a test machine (e.g. for AIX or HPSS
upgrade tests). - The recovery machine is using rootvgs
replica. Less risk of missing something. - We can afford to be offline for a few hours.
16Current Status (DHS)
- First large scale test was performed in April.
- Some issues were found, some already adressed.
- We still need to do
- Iron-out remaining problems
- Test the restoration of one MARS server in the
computer hall - Evaluate the management of some end cases
- Introduce a regular testing schedule (twice a
year?). - We are reasonably confident that we would be able
to provide a service after a computer hall loss.
17Protection of non-DHS servers Short Term
- Install supercomputer clusters in different
computer halls. - Other servers
- Work has started in validating that some
critical workload can be moved between various
servers. (e.g. nfs service) - These servers could then be distributed between
the 2 computer halls. - Resilient LAN connections between DRS building
and both computer halls. - Split of telecom area.
- These proposals are under investigation, no
decisions have been taken yet.
18In the future
- Static subsets of popular data could be
distributed to other sites - Already done for ERA-40 data.
- Consider alternative WAN connection to the DRS
building. - ECMWF may want to investigate the ability to
distribute a minimal subset of data
geographically (e.g. as part of DEISA project). - This may require that additional bandwidth is
made available. - Distribute DHS equipment across computer halls.
- Consider extending or replacing the Disaster
Recovery Building. - An Integrated Disaster Recovery Action plan will
be designed.