Title: cdfSync: Networked Synchronization of netCDF Datasets
1cdfSync Networked Synchronization of netCDF
Datasets
- Joe Sirott
- L.C.Sun, Donald W. Denbo
- NODC/PMEL NOAA
- University of Washington
2What is cdfSync?
- Synchronizes netCDF datasets over the Internet
- Only differences between datasets are transmitted
- Based on rsync algorithm and program (Tridgell,
2003)
3Applications
- Local mirroring of dynamic datasets for faster
access - Mobile applications where network access may be
unreliable
4Rsync Algorithm
- Client divides file into blocks
- Calculates a hash based signature for each block
- Client sends signatures to server
- Server compares signatures from client and only
sends data that isnt already on client
5Rsync Algorithm (client)
WH(B0,S-1),SH(B0,S-1)
WH(BS,2S-1),SH(BS,2S-1)
WH(B2S,3S-1),SH(B2S,3S-1)
WH weak rolling hashSH strong (MD4) hash, S
block size
6Rsync Algorithm (server)
WH(B0,S-1)WH(B1,S)WH(B2,S1)
WH weak rolling hashSH strong (MD4) hash, S
block size
7Rsync Algorithm (server)
WH weak rolling hashSH strong (MD4) hash, S
block size
8cdfSync enhancements
- Take advantage of netCDF block structure
- Compress file metadata for efficient updates of
large number of small files - In-place updates for small updates to large files
9cdfSync algorithm (server)
WH weak rolling hashSH strong (MD4) hash,
S(i) block size i
10cdfSync enhancements
- Take advantage of netCDF block structure
- Compress file metadata for efficient updates of
large number of small files - In-place updates for small updates to large files
11Compressed File Metadata
- In-situ data frequently consists of large numbers
(105-6) of small files - Not many files change between updates (101-2)
- Transfer of file metadata (file name,
modification date, etc.) dominates update time - cdfSync compress (gzip) this data
12cdfSync enhancements
- Take advantage of netCDF block structure
- Compress file metadata for efficient updates of
large number of small files - In-place updates for small updates to large files
13In-place Updates
- Rsync inefficient for large netCDF datasets ( gt
1GB) with small updates - Writes all data (even local data) to temporary
file and then renames the temporary file - Data write time can be gtgt than network
transmission time
14In-place Updates (cont)
- cdfSync has option that allows data updates to be
written to existing file - If data block hasnt moved, no data is written
- Much more efficient for datasets where data is
appended on the netCDF record dimension - Downside file corrupt if update interrupted
15In-place Updates (cont)
- Find and resolve cyclic dependencies
2
1
M
1
2
Server
Client
16In-place Updates (cont)
17Results (use netCDF blocks)
- Synchronize identical 512 MB netCDF files,
compare with synchronization of identical 512 MB
non-netCDF files. Use in-place to measure only
disk reads, not writes - Rsync 105 sec
- Cdfsync 72 sec
18Results (compressed file list)
- 1.5 million identical netCDF files over low
bandwidth (256 Kb/sec) link - Rsync 910 sec
- Cdfsync 206 sec
19Results (in-place updates)
- 1.5 GB netCDF file with extra data appended to
record dimension - Rsync 434 sec
- Cdfsync 175 sec
20Availability
- http//www.epic.noaa.gov
- Joe.Sirott_at_noaa.gov