Title: Data Handling System at ECMWF
1Data Handling System at ECMWF
- Francis Dequenne, Stephen Richards
- HUF 2007
- francis.dequenne_at_ecmwf.int
- Stephen.Richards_at_ecmwf.int
2ECMWF Computer Environment
2x 155 p5-575 4.5 TFlops sustained 9TB of
memory 100 TB disks
To be replaced by two considerably more powerful
clusters
IBM 3584
3Data Handling Applications
- MARS
- Meteorological Archive and Retrieval System.
- Bulk of the data, few files
- Interfaced through an ECMWF application.
- Depends heavily on tape get-partials.
- ECFS
- HSM-like service for ad-hoc files.
- Millions of files, many very small.
- Both services uses HPSS as their underlying
archival system.
4Volume of data stored
100 PB around 2014
6 PB of data, (2 PB more for second copy)
NB These values do not include the second backup
copy of our most critical data.
5HPSS data growth
HPSS data growth
Typical Daily Write Workload MARS data
5.0 TB/day ECFS data 3.2 TB/day 2nd
Copy 3.4 TB/day Total HPSS
11.6 TB/day
Typical Daily Read Workload MARS HPSS
1.0 TB/day (MARS Total 3.5
TB/day) ECFS 1.2
TB/day Total HPSS 2.2 TB/day
6Number of files stored
ECFS files
ECFS files smaller than ½ MB
MARS files
7DHS Services MARS
- Over 4.5 Petabytes of data (plus 1.6 PB backups).
- 4.5 million files.
- Between 4 and 5.5 TB stored every day.
- Data indexed by in-house application, providing a
powerful virtualisation engine. - Requires many tape drives able to load and
position tapes quickly. - Medium to long term archive.
- Comprises
- MARS Operational (40 of the data,WORM,
backed-up). - MARS Research (60,WORS, no backup).
- HPSS provides
- Good support for partial retrieves from tape.
- Good metadata query tools.
- Good scaling.
8MARS
Meta data
HPSS
Clients
Large files
Forecast results, observations,...
HPSS API
MARS
Get pressures and temperatures over the
atlantic on July 17th, 1989
Clients
Multiple file parts (one user request may require
access to hundreds of tapes)
Cache
9DHS Services ECFS
- 1.3 Petabyte of data ( 350 TB backup copy).
- 35 Million files.
- 2.5 TB of data added daily (but peaks of more
than 150GB/hour). - Volatile data.
- Avg 35,000 retrieves/Day, peaks gt 70,000 have
been observed. - A lot of small files
- 10.5 million files lt 512 KB.
- Represent 30-50 of the ECFS retrieval activity.
- HPSS provides
- Ability to support tens of millions of files.
- Customisation allowing some type of files to stay
cached for long period of time. - Good levels of performances.
10ECFS User View
ec/syf
Ec/syf/dir1
ecd /syf/dir1 ecp local ecremote
Clients, using local commands ecp, els, ecd,...
ec/rdx
els /rdx/dir2 ecp ec/rdx/dir2/remote local2
Ec/rdx/dir2
Users have a logical view of a few remote virtual
file systems, accessed through rcp-like commands.
Ec/rdx/dir2/remote
11ECFS implementation
- Behind the scenes, one virtual file system is
mapped onto several HPSS filesets. - The back-end archiving-system interfaces are
hidden from the clients.
ECFS Server
Where should this file be ?
ECFS Client
HPSS fileset
ECFS Mover
Get/put it there
HPSS fileset
DHS access
123 TB High Performance
80 TB High Capacity
30 TB Sata Disk
Client
HPSS Core P570-4 CPU
Client
SAN
HPSS Mover 6M2-8 CPU
HPSS Mover 6M2-6 CPU
ECFS Server 6M2-6 CPU
Client
Client
Also used as a mover
13What is new since last HUF?
- Version 6.2 of the code was installed.
- Bye bye, DCE! (eeeeehhhhhhhaaaaaa)
- A few issues encountered initially, now pretty
stable. - New IBM TS3500 Library has been installed, with
LTO/3 drives - CONAN the Librarian
- Replace our old ADIC robot.
- Used for writing secondary tape copies.
- A few problems discovered with the PVR, esp
Checkin processing. - Discovered some issues in moving shelved tapes
from one robot to another. - Usage of 3592-XL tapes on TS1120 drives (aka
3592-E05) - 700GB/tape.
- Some problems with latest versions of microcode,
resulted in a few files becoming unreadable.
14Issues to be addressed real soon now
- Poor performances of writing small files to
tapes. We are eager to see the new mechanisms
which do not require tape marks between each
file, expected in version 7.1 - Need to stop PVL/PVR/Movers to add/modify
devices/drives. - Some administrative functions still require the
GUI. - To add 40 disks or modify the configuration of 30
drives through the GUI is - Error-prone,
- NOT fun.
15Issue DB2 maintenance tools
- Suggested practice is that DB2 runstats should be
run on regular basis. - This updates tables access statistics, and is
used to optimise table access. - Cannot be done on a busy system Errors are
reported, statistics are incorrect. - One should also reorganise tables from time to
time. - Running reorgs need to be done offline.
- At ECMWF requires HPSS downtime, for 5 to 10
hours. - We need to find an efficient, non-disruptive,
reliable way to do these operations.
16More issues
- In a tape -gt tape hierarchy, no automatic attempt
to access secondary tape copies if first copy is
down. - Migration optimisation in disk -gttape hierarchies
- Especially in family-rich environment
- Migration does not seem to manage all files from
one family at the same time. - Seems to result in tapes being dismounted/remounte
d unnecessarily - Some tapes are mounted 40 times a day (over 100
times a day in trashing conditions) for migration
purposes.
17HPSS needs a checksum facility.
- The data on your disk subsystem is probably OK
- Errors undetected by hardware are much more
common than expected. - Some studies have shown that one undetected
error appears per 100TB moved. - Technology migration of 10PB implies 100 errors
could be generated! - We need for HPSS to support optional bitfile
check-summing - Checksum generated either by user application or
clients when data is first introduced in the
system. - Checksum is validated every time the bitfile is
moved. - Checksums can be used to validate disks contents.
18Other things that we would like to see
- Ability to balance the allocation of scratch
tapes across multiple tape robots. - Could become a significant issue if we have to
start using smaller robots. - Ability for movers to claim tape drives ownership
as needed. - E.g. if mover shares platform with application
which writes data straight to tape. - Undelete facility.
- New methods of indexing the bitfiles (Content
Addressed Storage?) - Think google, longhorn,
19Future developments.
- Test and deploy Copan MAIDs ??
- SAN-3P deployment.
- AIX 5.3
- Keep being supported.
- Support for luns of size gt 1TB.
- 64 bits applications.
- STK silos replacement.
- Silos will not be supported after end 2010
- What to replace them with?
20show_hpss_tapevols
- List the HPSS status of volumes for a given
library. - Can be used for example to find out which tapes
are shelved.
show_hpss_tapevols -m mode -p pvr_id -h
mode can be one of
mount_pending - volume is waiting to be mounted
mounted - volume is mounted
dismount_pending - volume is waiting for
a dismount to complete dismounted
- volume is dismounted shelf -
volume shelved shelf_pending -
volume is waiting to be checked out
checkin - volume has been checked in
checkin_pending - volume is waiting to be
checked in displayed - volume is
in checkin or mount display window
move_pending - volume is waiting to be checked
into new library eject_pending -
volume is waiting to be ejected pvr_id is
the HPSS server name for a pvr.
Example of output show_hpss_tapevols p
CONAN Cartridge Status
PVR X00189 dismounted
CONAN X00190 dismounted
CONAN X00191 dismounted
CONAN X00192 shelf
CONAN X00193 shelf
CONAN X00194 shelf
CONAN X00195 shelf
CONAN X00196 dismounted
CONAN X00197 dismounted
CONAN X00198 dismounted
CONAN X00199 dismounted CONAN
21tape_recover
- A utility able to read bitfiles from tapes
without using DB2 or HPSS. - Could be used for
- Last chance recovery scenario (e.g. DB2 can not
be restored) - Recovery of files accidentally deleted from HPSS
- Migration to another archiving system
- Requires a list of files and attached tape
segments to be generated - At ECMWF, this is done every week, piggy-backing
the calls to list_file_subsys done to generated
user files lists. - Requires tapes to be mounted manually, e.g.
through acsls, mtlib, - Developed for non-striped tapes.
Hopefully in many many years
WORK IN PROGRESS! A working prototype is
currently tested.