Title: Update on HDF5 1.8
1Update on HDF5 1.8
- The HDF Group
- HDF and HDF-EOS Workshop X
- November 28, 2006
2Why HDF5 1.8?
3 as we know, there are known knowns there are
things we know we know. We also know there are
known unknowns that is to say we know there are
some things we do not know. But there are also
unknown unknowns -- the ones we don't know we
don't know. Donald Rumsfeld
4Some things we knew we knew
- Need high level APIs image, etc.
- Need more datatypes - packed n-bit, etc.
- Need external and other links
- Tools needed h5pack, etc.
- Caching embellishments
- Eventually, multithreading
5Things we knew we did not know
- New requirements from EOS and ASCI
- New applications that would use HDF5
- How HDF5 would really perform in parallel
- What new tools, features and options needed
- New APIs, API features
6Things we didnt know we didnt know
- Completely unanticipated applications
- New data types and structures
- E.g. DNA sequences
- New operations
- E.g. write many real-time streams simultaneously
7HDF5 1.8 topics
- Dataset and datatype improvements
- Group improvements
- Link Revisions
- Shared object header nessages
- Metadata cache improvements
- Other improvements
- Platform-specific changes
- High level APIs
- Parallel HDF5
- Tool improvements
8Dataset and Datatype Improvements
9Text-based data type descriptions
- Why
- Simplify datatype creation
- Make datatype creation code more readable
- Facilitate debugging by printing the text
description of a data type - What
- New routine to create a data type through the
text description of the data type
H5LTdtype_to_text
10Text data type description Example
- Create a datatype of compound type.
- / Create the data type with text description /
- dtype H5Ttext_to_type(
- typedef struct foo int a float b foo_t)
- / Convert the data type back to text /
- H5Ttype_to_text(dtype, NULL, H5T_C, tsize)
11Serialized datatypes and dataspaces
- Why
- Allow datatype and dataspace info to be
transmitted between processes - Allow datatype/dataspace to be stored in non-HDF5
files - What
- A new set of routines to serialize/deserialize
HDF5 datatypes and dataspaces.
12Int to float convert during I/O
- Why Convert ints to floats during I/O
- What Int to float conversion supported during
I/O
13Revised conversion exception handling
- Why Give apps greater control over exceptions
(range errors, etc.) during datatype conversion. - What Revised conversion exception handling
14Revised conversion exception handling
- To handle exceptions during conversions, register
handling function through H5Pset_type_conv_cb(). - Cases of exception
- H5T_CONV_EXCEPT_RANGE_HI
- H5T_CONV_EXCEPT_RANGE_LOW
- H5T_CONV_EXCEPT_TRUNCATE
- H5T_CONV_EXCEPT_PRECISION
- H5T_CONV_EXCEPT_PINF
- H5T_CONV_EXCEPT_NINF
- H5T_CONV_EXCEPT_NAN
- Return values H5T_CONV_ABORT, H5T_CONV_UNHANDLED,
H5T_CONV_HANDLED
15Compression filter for n-bit data
- Why
- Compact storage for user-defined datatypes
- What
- When data stored on disk, padding bits chopped
off and only significant bits stored - Supports most datatypes
- Works with compound datatypes
16N-bit compression example
- In memory, one value of N-Bit datatype is stored
like this - byte 3 byte 2 byte 1 byte 0
- ????????????SPPPPPPPPPPPPPPP????
- S-sign bit P-significant bit ?-padding bit
-
- After passing through the N-Bit filter, all
padding bits are chopped off, and the bits are
stored on disk like this - 1st value 2nd value
- SPPPPPPP PPPPPPPPSPPPPPPP PPPPPPPP...
- Opposite (decompress) when going from disk to
memory
17Offsetsize storage filter
- WhyUse less storage when less precision needed
- What
- Performs scale/offset operation on each value
- Truncates result to fewer bits before storing
- Currently supports integers and floats
- Example
- H5Pset_scaleoffset (dcr,H5Z_SO_INT,H5Z_SO_INT
_MINBITS_DEFAULT) - H5Dcreate(, dcr)
- H5Dwrite ()
18Example with floating-point type
- Data 104.561, 99.459, 100.545, 105.644
- Choose scaling factor decimal precision to
keepE.g. scale factor D 2 - 1. Find minimum value (offset) 99.459
- 2. Subtract minimum value from each element
- Result 5.102, 0, 1.086, 6.185
- 3. Scale data by multiplying 10D 100
- Result 510.2, 0, 108.6, 618.5
- 4. Round the data to integer
- Result 510 , 0, 109, 619 5. Pack and store
using min number of bits
19NULL Dataspace
- Why
- Allow datasets with no elements to be described
- NetCDF 4 needed a place holder for attributes
- What
- A dataset with no dimensions, no data
20Group improvements
21Access links by creation-time order
- Why
- Allow iteration lookup of groups links
(children) by creation order as well as by name
order - Support netCDF access model for netCDF 4
- What Option to access objects in group
according to relative creation time
22Compact groups
- Why
- Save space and access time for small groups
- If groups small, dont need B-tree overhead
- What
- Alternate storage for groups with few links
- Example
- File with 11,600 groups
- With original group structure, file size 20 MB
- With compact groups, file size 12 MB
- Total savings 8 MB (40)
- Average savings/group 700 bytes
23Better large group storage
- Why Faster, more scalable storage and access
for large groups - What New format and method for storing groups
with many links
24Intermediate group creation
- Why
- Simplify creation of a series of connected groups
- Avoid having to create each intermediate group
separately, one by one - What
- Intermediate groups can be created when creating
an object in a file, with one function call
25Example add intermediate groups
- Want to create /A/B/C/dset1
- A exists, but B/C/dset1 do not
H5Dcreate(file_id, /A/B/C/dset1,..) One call
creates groups B C, then creates dset1
26Link Revisions
27What are links?
- Links connect groups to their members
- Hard links point to a target by address
- Soft links store the path to a target
root group
Hard link
Soft link
dataset
28New external Links
- Why Access objects by file path within file
- What
- Store location of file and path within that file
- Can link across files
file2.h5
root group
file1.h5
root group
dataset
29New User-defined Links
- Why
- Allow applications to create their own kinds of
links and link operations, such as - Create hard external link that finds an object
by address - Create link that accesses a URL
- Keep track of how often a link accessed, or other
behavior - What
- App can create new kinds of links by supplying
custom callback functions - Can do anything HDF5 hard, soft, or external
links do
30Shared Object Header Messages
31Shared object header messages
- Why metadata duplicated many times, wasting
space - Example
- You create a file with 10,000 datasets
- All use the same datatype and dataspace
- HDF5 needs to write this information 10,000 times!
32Shared object header messages
- What
- Enable messages to be shared automatically
- HDF5 shares duplicated messages on its own!
Dataset 1
Dataset 2
datatype
dataspace
data 1
data 2
33Shared Messages
- Happens automatically
- Works with datatypes, dataspaces, attributes,
fill values, and filter pipelines - Saves space if these objects are relatively large
- May be faster if HDF5 can cache shared messages
- Drawbacks
- Usually slower than non-shared messages
- Adds overhead to the file
- Index for storing shared datatypes
- 25 bytes per instance
- Older library versions cant read files with
shared messages
34Two informal tests
- File with 24 datasets, all with same big datatype
- 26,000 bytes normally
- 17,000 bytes with shared messages enabled
- Saves 375 bytes per dataset
- But, make a bad decision invoke shared messages
but only create one dataset - 9,000 bytes normally
- 12,000 bytes with shared messages enabled
- Probably slower when reading and writing, too.
- Moral shared messages can be a big help, but
only in the right situation!
35Metadata cache improvements
36Metadata Cache improvements
- Why
- Improve I/O performance and memory usage when
accessing many objects - What
- New metadata cache APIs
- control cache size
- monitor actual cache size and current hit rate
- Under the hood adaptive cache resizing
- Automatically detects the current working size
- Sets max cache size to the working set size
37Metadata cache improvements
- Note most applications do not need to worry
about the cache - See Advanced topics for details
- And if you do see unusual memory growth or poor
performance, please contact us. We want to help
you.
38Other improvements
39New extendible error-handling API
- Why Enable app to integrate error reporting with
HDF5 library error stack - What New error handling API
- H5Epush - push major and minor error ID on
specified error stack - H5Eprint print specified stack
- H5Ewalk walk through specified stack
- H5Eclear clear specified stack
- H5Eset_auto turn error printing on/off for
specified stack - H5Eget_auto return settings for specified stack
traversal
40Extendible ID API
- A ID management routines allow an application to
use the HDF5 ID-to-object mapping routines
41Attribute improvements
- Why
- Use less storage when large numbers of attributes
attached to a single object - Iterate over or look up attributes by creation
order - What
- Property to create index on the order in which
the attributes are created - Improved attribute storage
42Support for Unicode Character Set
- Why
- So apps can create names using Unicode
- netCDF 4 needed this
- What
- UTF-8 Unicode encoding now supported
- For string datatypes, names of links and
attributes - Example
- H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8)
- H5Llink(file_id, "UTF-8 name", , lcpl_id, )
43Efficient copying of HDF5 objects
- Why
- Enable apps to copy objects efficiently
- What
- New routines to copy an object in an HDF5 file
within the current file or to another file - Done at a low-level in the HDF5 file, allowing
- Entire group hierarchies to be copied quickly
- Compressed datasets to be copied without going
through a decompression/compression cycle
44Performance of object copy routines
45Data transformation filter
- Why
- Apply arithmetic operations to data during I/O
- What
- Data transformation filter
- Transform expressed by algebraic formula
- Only , -, , and /supported
- Example
- Expression parameter set, such as x(x-5)
- When dataset read/written, x(x-5) applied per
element - When reading, values in file are unchanged
- When writing, transformed data written to file
46Stackable Virtual File Drivers
- What is Virtual File Driver (VFD)?
47Structure of HDF5 Library
- Object API (C, Fortran 90, Java, C)
- Specify objects and transformation properties
- Invoke data movement operations and data
transformations
- Library internals
- Performs data transformations and other prep for
I/O - Configurable transformations (compression, etc.)
- Virtual file I/O (C only)
- Perform byte-stream I/O operations (open/close,
read/write, seek) - User-implementable I/O (stdio, network, memory,
etc.)
48Stackable VFD
- HDF5 VFD allows
- Storing data using different physical file
layout. E.g., Family VFD (writes file as family
of files) - Doing different types of I/O. E.g., stdio
(standard I/O) MPI-I/O (for parallel I/O)
49Stackable VFD
- Why stackable
- Before now, only one VFD could be used at a time
- VFDs could not inter-operative
- What is stackable
- A Non-terminal VFD may stack on top of compatible
non-terminal and eventually Terminal VFDs - Two kinds of VFD
- Non-terminal (e.g. Family)
- Terminal (e.g. stdio MPI-I/O)
50Stackable VFD
Application
HDF5 API
Non-terminal VFD
Family File
split
Default I/O path
metadata
rawdata
Terminal VFD
stdio
mpiio
Sec2
HDF5 Files
51Platform-specific changes
52Platform-specific changes
- Why Better UNIX/Linux Portability
- What
- 1.8 uses latest GNU auto tools (autoconf,
automake, libtool) - improves portability between many machine and OS
configurations - Build can now be done in parallel
- with gmake j flag
- speeds up build, test and install processes
- Build infrastructure includes many other
improvements as well
53Platforms to be dropped
- Operating systems
- HPUX 11.00
- MAC OS 10.3
- AIX 5.1 and 5.2
- SGI IRIX64-6.5
- Linux 2.4
- Solaris 2.8 and 2.9
- Compilers
- GNU C compilers older than 3.4 (Linux)
- Intel 8.
- PGI V. 5., 6.0
- MPICH 1.2.5
http//www.hdfgroup.org/HDF5/release/alpha/obtain5
18.html
54Platforms to be added
- Systems
- Alpha Open VMS
- MAC OSX 10.4 (Intel)
- Solaris 2. on Intel (?)
- Cray XT3
- Windows 64-bit (32-bit binaries)
- Linux 2.6
- BG/L
- Compilers
- g95
- PGI V. 6.1
- Intel 9.
- MPICH 1.2.7
- MPICH2
55High level APIs
56High-Level Fortran APIs
- Fortran APIs have been added for H5Lite, H5Image
and H5Table.
57Dimension scales
- Similar to
- Dimension scales in HDF4
- Coordinate variables in netCDF
- What is a dimension scale ?
- An HDF5 dataset with additional metadata that
identifies the dataset as a Dimension Scale - Associated with dimensions of HDF5 datasets
- Meaning of the association is left to
applications - A Dimension scale can be shared by two or more
dataset dimensions
58Dimension scales example
HDF Explorer image
59Dimension scales example
HDF Explorer image
60Sample dimension scale functions
- H5DSset_scale convert dataset to a dimension
scale - H5DSattach_scale attach scale to a dimension
- H5DSdetach_scale detach scale from a dimension
- H5DSis_attached verify if scale attached to
dataset - H5DSget_scale_name read name of scale
61HDF5Packet
- Why
- High performance table writing
- For data acquisition, when there are many sources
of data - E.g. flight test
- What
- Each row is a packet a collection of fields,
fixed or variable length - Append only
- Indexed retrieval
62Packets in HDF5
Variable-length records
Fixed-length data records
Data
Data
Time
Time
Data
Data
Data
Data
. . .
. . .
63Parallel HDF5
64Collective I/O improvements
- Why
- Collective I/O not available for chunked data
- Collective I/O not available for complex
selections - Collective I/O is key to improving performance
for parallel HDF5 - What
- Collective I/O works for chunked storage
- Works for irregular selections for both chunked
and contiguous storage
65Parallel h5diff (ph5diff)
- Compares two files in an MPI parallel
environment. - Compares multiple datasets simultaneously
66Windows MPICH support
- Windows MPICH support prototype
67Tool improvements
68New features for old tools
- h5dump
- Dump data in binary format
- Faster for files with large numbers of objects
- h5diff
- Can now compare dataset regions
- Parallel ph5diff now available
- h5repack
- Efficient data copy using H5Gcopy()
- Able to handle big datasets
69New HDF5 Tools
- h5copy
- Copies a group, dataset or named datatype from
one location to another - Copies within a file or across files
- h5repart
- Partition file into a family of files
- h5import
- Import binary/ascii data into an HDF5 file
- h5check
- Verifies an HDF5 file against the defined HDF5
File Format Specification - h5stat
- Reports statistics about a file and objects in a
file
70Thank You
71Questions/comments?
72For more information
- Go to http//www.hdfgroup.org/HDF5/
- Click on Obtain HDF5 1.8.0 Alpha
- Look at table Information
73Acknowledgement
- This report is based upon work supported in part
by a Cooperative Agreement with NASA under NASA
NNG05GC60A. Any opinions, findings, and
conclusions or recommendations expressed in this
material are those of the author(s) and do not
necessarily reflect the views of the National
Aeronautics and Space Administration.