Title: HDF5
1HDF5
- A new file format software for high performance
scientific data management
2High performance data requirements
- larger datasets (gt terabyte)
- bigger, faster machines and storage systems
- varied architectures and I/O paradigms
- parallel computing environments
- complex subsetting
- complex data
3HDF5 based on lessons learned from
- Existing standards
- HDF, PDB, AIO, netCDF, MPI-IO and others
- Computer science
- ASCI physics applications and users
- Earth science applications and users
- Other users
4 and ASCI Requirements
- Compatibility with vector bundle model
- Collective access
- MPI-IO
- Transform data between memory storage
- Parallel file systems PIOFS, HPSS, etc.
5Data model
- Datatypes (array elements)
- integer float
- strings pointers
- compound (record structures)
- Aggregate object types
- dataset multidimensional array
- grouping structure
- each object has a name attributes
6Basic data object array of records
3
5
Dimensionality 5 x 3
int8
int4
int16
float32
Number type
Record
7Storage Capacity
HDF4
HDF5
- Store large objects
- Store large numbers of objects
Limit2 gigabytes
no limit
Limit 20,000 objects
no limit
8Dataset components
- a multidimensional array of data elements
- header with metadata
- datatype
- dataspace
- attributes
- storage info
Dataset Fred
Metadata header
Data
Dataspace
Attributes
int16
Datatype
2
Chunked compressed
Rank
Storage info
Dimensions
9Groups
- Group structure for organizing the file
- Every file starts with a root group
- Like directories in file system
- Groups have attributes
/
/foo
/foo/bar
10Special Storage Options
Improves subsetting access time
- chunked
- compressed
- extendable
- split file
Improves storage efficiency
Arrays can be extended individually
Metadata in one file. Raw data in another.
11The HDF5 Library
- New API and programming model
- Smaller, better, faster
- Able to support parallel I/O better
- OO compatible
- C Fortran still primary, others considered
- I/O performance emphasized
- Current platforms
- ASCI IBM SP2, SGI Origin 2000, Intel Teraflop
- Solaris, Linux, HPUX, IRIX, NT
12Sub-selection Options
- Flexibility in mappings between data in memory
and object in file - Selection regions can be
- points
- hyperslabs
- unions of hyperslabs
- Selection region in memory can be different shape
from selection in file - Supports I/O needs for parallel computation
13Mappings between file dataspaces/selections and
memory dataspaces/selections.
(b) A regular series of blocks from a 2D array
to a contiguous sequence at a certain offset in a
1D array
(a) A hyperslab from a 2D array to the corner of
a smaller 2D array
(c) A sequence of points from a 2D array to a
sequence of points in a 3D array.
(d) Union of hyperslabs in file to union of
hyperslabs in memory. Number of elements must be
equal.
14HDF5 Raw Data Pipeline
- Handles all aspects of data storage and transfer
of data between file and application. - Deals with multiple storage options
- chunking, compression, number conversion,...
- Optimized performance for common usage
- Hooks for new filters
- compression schemes, encryption, checksum,...
- user-specified filters
15Performance tuning
- Facilities for performance measurement
- timing tests in test suite
- Pablo instrumentation
- Caching
- app can set cache size for metadata chunks
- Parallel optimizations
- efficient metadata management
- chunking
- can control placement on physical media
16HDF5 and ASCI Applications
- Multi-lab collaboration
- DOE Tri-lab Livermore, Sandia, Los Alamos
- NCSA, Limit Point Systems
- Motivation
- Data sharability
- Application interoperability
- Leverage experiences
- EXODUS (SNL), SILO PDB (LLNL)
- HDF (NCSA) , netCDF (UCAR)
17 ASCI DMF Data Abstraction
- Objectives
- Sound data model withrobust data abstractions
- Computational mechanicsdata meshes fields
- Based on mathematical field of fiber bundles
- Common format allows common tools sharing
- Common API shield apps from model complexities
APPLICATION
Mesh APIs (SNL/LANL)
Fiber Bundle Kernel (LLNL)
Data Structure Layer (LLNL)
HDF5 (NCSA)
MPI IO (ANL)
18HDF5 driver projects
Project Application Types of data ASCI
Computational Fields on meshes
mechanics structured, unstructured
hierarchical CANIS UIUC Digital Concept
space Object store for large Library Project
analysis of medical collection of small
abstracts objects (noun phrases) TRAPPIST
non- Non-destructive NDT experiment
datadestructive testing testing tomography
and consortium radiology NASA Earth
Observing Earth Science data Remote sensing
System management swath, grid and point data