Update on HDF5 1.8 - PowerPoint PPT Presentation

About This Presentation
Title:

Update on HDF5 1.8

Description:

We also know there are known unknowns; that is to say we know there are some ... Opposite (decompress) when going from disk to memory. Nov. 28, 2006 ... – PowerPoint PPT presentation

Number of Views:41
Avg rating:3.0/5.0
Slides: 73
Provided by: peter1061
Learn more at: http://hdfeos.org
Category:

less

Transcript and Presenter's Notes

Title: Update on HDF5 1.8


1
Update on HDF5 1.8
  • The HDF Group
  • HDF and HDF-EOS Workshop X
  • November 28, 2006

2
Why HDF5 1.8?
3
as we know, there are known knowns there are
things we know we know. We also know there are
known unknowns that is to say we know there are
some things we do not know. But there are also
unknown unknowns -- the ones we don't know we
don't know. Donald Rumsfeld
4
Some things we knew we knew
  • Need high level APIs image, etc.
  • Need more datatypes - packed n-bit, etc.
  • Need external and other links
  • Tools needed h5pack, etc.
  • Caching embellishments
  • Eventually, multithreading

5
Things we knew we did not know
  • New requirements from EOS and ASCI
  • New applications that would use HDF5
  • How HDF5 would really perform in parallel
  • What new tools, features and options needed
  • New APIs, API features

6
Things we didnt know we didnt know
  • Completely unanticipated applications
  • New data types and structures
  • E.g. DNA sequences
  • New operations
  • E.g. write many real-time streams simultaneously

7
HDF5 1.8 topics
  • Dataset and datatype improvements
  • Group improvements
  • Link Revisions
  • Shared object header nessages
  • Metadata cache improvements
  • Other improvements
  • Platform-specific changes
  • High level APIs
  • Parallel HDF5
  • Tool improvements

8
Dataset and Datatype Improvements
9
Text-based data type descriptions
  • Why
  • Simplify datatype creation
  • Make datatype creation code more readable
  • Facilitate debugging by printing the text
    description of a data type
  • What
  • New routine to create a data type through the
    text description of the data type
    H5LTdtype_to_text

10
Text data type description Example
  • Create a datatype of compound type.
  • / Create the data type with text description /
  • dtype H5Ttext_to_type(
  • typedef struct foo int a float b foo_t)
  • / Convert the data type back to text /
  • H5Ttype_to_text(dtype, NULL, H5T_C, tsize)

11
Serialized datatypes and dataspaces
  • Why
  • Allow datatype and dataspace info to be
    transmitted between processes
  • Allow datatype/dataspace to be stored in non-HDF5
    files
  • What
  • A new set of routines to serialize/deserialize
    HDF5 datatypes and dataspaces.

12
Int to float convert during I/O
  • Why Convert ints to floats during I/O
  • What Int to float conversion supported during
    I/O

13
Revised conversion exception handling
  • Why Give apps greater control over exceptions
    (range errors, etc.) during datatype conversion.
  • What Revised conversion exception handling

14
Revised conversion exception handling
  • To handle exceptions during conversions, register
    handling function through H5Pset_type_conv_cb().
  • Cases of exception
  • H5T_CONV_EXCEPT_RANGE_HI
  • H5T_CONV_EXCEPT_RANGE_LOW
  • H5T_CONV_EXCEPT_TRUNCATE
  • H5T_CONV_EXCEPT_PRECISION
  • H5T_CONV_EXCEPT_PINF
  • H5T_CONV_EXCEPT_NINF
  • H5T_CONV_EXCEPT_NAN
  • Return values H5T_CONV_ABORT, H5T_CONV_UNHANDLED,
    H5T_CONV_HANDLED

15
Compression filter for n-bit data
  • Why
  • Compact storage for user-defined datatypes
  • What
  • When data stored on disk, padding bits chopped
    off and only significant bits stored
  • Supports most datatypes
  • Works with compound datatypes

16
N-bit compression example
  • In memory, one value of N-Bit datatype is stored
    like this
  • byte 3 byte 2 byte 1 byte 0
  • ????????????SPPPPPPPPPPPPPPP????
  • S-sign bit P-significant bit ?-padding bit
  • After passing through the N-Bit filter, all
    padding bits are chopped off, and the bits are
    stored on disk like this
  • 1st value 2nd value
  • SPPPPPPP PPPPPPPPSPPPPPPP PPPPPPPP...
  • Opposite (decompress) when going from disk to
    memory

17
Offsetsize storage filter
  • WhyUse less storage when less precision needed
  • What
  • Performs scale/offset operation on each value
  • Truncates result to fewer bits before storing
  • Currently supports integers and floats
  • Example
  • H5Pset_scaleoffset (dcr,H5Z_SO_INT,H5Z_SO_INT
    _MINBITS_DEFAULT)
  • H5Dcreate(, dcr)
  • H5Dwrite ()

18
Example with floating-point type
  • Data 104.561, 99.459, 100.545, 105.644
  • Choose scaling factor decimal precision to
    keepE.g. scale factor D 2
  • 1. Find minimum value (offset) 99.459
  • 2. Subtract minimum value from each element
  • Result 5.102, 0, 1.086, 6.185
  • 3. Scale data by multiplying 10D 100
  • Result 510.2, 0, 108.6, 618.5
  • 4. Round the data to integer
  • Result 510 , 0, 109, 619 5. Pack and store
    using min number of bits

19
NULL Dataspace
  • Why
  • Allow datasets with no elements to be described
  • NetCDF 4 needed a place holder for attributes
  • What
  • A dataset with no dimensions, no data

20
Group improvements
21
Access links by creation-time order
  • Why
  • Allow iteration lookup of groups links
    (children) by creation order as well as by name
    order
  • Support netCDF access model for netCDF 4
  • What Option to access objects in group
    according to relative creation time

22
Compact groups
  • Why
  • Save space and access time for small groups
  • If groups small, dont need B-tree overhead
  • What
  • Alternate storage for groups with few links
  • Example
  • File with 11,600 groups
  • With original group structure, file size 20 MB
  • With compact groups, file size 12 MB
  • Total savings 8 MB (40)
  • Average savings/group 700 bytes

23
Better large group storage
  • Why Faster, more scalable storage and access
    for large groups
  • What New format and method for storing groups
    with many links

24
Intermediate group creation
  • Why
  • Simplify creation of a series of connected groups
  • Avoid having to create each intermediate group
    separately, one by one
  • What
  • Intermediate groups can be created when creating
    an object in a file, with one function call

25
Example add intermediate groups
  • Want to create /A/B/C/dset1
  • A exists, but B/C/dset1 do not

H5Dcreate(file_id, /A/B/C/dset1,..) One call
creates groups B C, then creates dset1
26
Link Revisions
27
What are links?
  • Links connect groups to their members
  • Hard links point to a target by address
  • Soft links store the path to a target

root group
Hard link
Soft link
dataset
28
New external Links
  • Why Access objects by file path within file
  • What
  • Store location of file and path within that file
  • Can link across files

file2.h5
root group
file1.h5
root group
dataset
29
New User-defined Links
  • Why
  • Allow applications to create their own kinds of
    links and link operations, such as
  • Create hard external link that finds an object
    by address
  • Create link that accesses a URL
  • Keep track of how often a link accessed, or other
    behavior
  • What
  • App can create new kinds of links by supplying
    custom callback functions
  • Can do anything HDF5 hard, soft, or external
    links do

30
Shared Object Header Messages
31
Shared object header messages
  • Why metadata duplicated many times, wasting
    space
  • Example
  • You create a file with 10,000 datasets
  • All use the same datatype and dataspace
  • HDF5 needs to write this information 10,000 times!

32
Shared object header messages
  • What
  • Enable messages to be shared automatically
  • HDF5 shares duplicated messages on its own!

Dataset 1
Dataset 2
datatype
dataspace
data 1
data 2
33
Shared Messages
  • Happens automatically
  • Works with datatypes, dataspaces, attributes,
    fill values, and filter pipelines
  • Saves space if these objects are relatively large
  • May be faster if HDF5 can cache shared messages
  • Drawbacks
  • Usually slower than non-shared messages
  • Adds overhead to the file
  • Index for storing shared datatypes
  • 25 bytes per instance
  • Older library versions cant read files with
    shared messages

34
Two informal tests
  • File with 24 datasets, all with same big datatype
  • 26,000 bytes normally
  • 17,000 bytes with shared messages enabled
  • Saves 375 bytes per dataset
  • But, make a bad decision invoke shared messages
    but only create one dataset
  • 9,000 bytes normally
  • 12,000 bytes with shared messages enabled
  • Probably slower when reading and writing, too.
  • Moral shared messages can be a big help, but
    only in the right situation!

35
Metadata cache improvements
36
Metadata Cache improvements
  • Why
  • Improve I/O performance and memory usage when
    accessing many objects
  • What
  • New metadata cache APIs
  • control cache size
  • monitor actual cache size and current hit rate
  • Under the hood adaptive cache resizing
  • Automatically detects the current working size
  • Sets max cache size to the working set size

37
Metadata cache improvements
  • Note most applications do not need to worry
    about the cache
  • See Advanced topics for details
  • And if you do see unusual memory growth or poor
    performance, please contact us. We want to help
    you.

38
Other improvements
39
New extendible error-handling API
  • Why Enable app to integrate error reporting with
    HDF5 library error stack
  • What New error handling API
  • H5Epush - push major and minor error ID on
    specified error stack
  • H5Eprint print specified stack
  • H5Ewalk walk through specified stack
  • H5Eclear clear specified stack
  • H5Eset_auto turn error printing on/off for
    specified stack
  • H5Eget_auto return settings for specified stack
    traversal

40
Extendible ID API
  • A ID management routines allow an application to
    use the HDF5 ID-to-object mapping routines

41
Attribute improvements
  • Why
  • Use less storage when large numbers of attributes
    attached to a single object
  • Iterate over or look up attributes by creation
    order
  • What
  • Property to create index on the order in which
    the attributes are created
  • Improved attribute storage

42
Support for Unicode Character Set
  • Why
  • So apps can create names using Unicode
  • netCDF 4 needed this
  • What
  • UTF-8 Unicode encoding now supported
  • For string datatypes, names of links and
    attributes
  • Example
  • H5Pset_char_encoding(lcpl_id, H5T_CSET_UTF8)
  • H5Llink(file_id, "UTF-8 name", , lcpl_id, )

43
Efficient copying of HDF5 objects
  • Why
  • Enable apps to copy objects efficiently
  • What
  • New routines to copy an object in an HDF5 file
    within the current file or to another file
  • Done at a low-level in the HDF5 file, allowing
  • Entire group hierarchies to be copied quickly
  • Compressed datasets to be copied without going
    through a decompression/compression cycle

44
Performance of object copy routines
45
Data transformation filter
  • Why
  • Apply arithmetic operations to data during I/O
  • What
  • Data transformation filter
  • Transform expressed by algebraic formula
  • Only , -, , and /supported
  • Example
  • Expression parameter set, such as x(x-5)
  • When dataset read/written, x(x-5) applied per
    element
  • When reading, values in file are unchanged
  • When writing, transformed data written to file

46
Stackable Virtual File Drivers
  • What is Virtual File Driver (VFD)?

47
Structure of HDF5 Library
  • Object API (C, Fortran 90, Java, C)
  • Specify objects and transformation properties
  • Invoke data movement operations and data
    transformations
  • Library internals
  • Performs data transformations and other prep for
    I/O
  • Configurable transformations (compression, etc.)
  • Virtual file I/O (C only)
  • Perform byte-stream I/O operations (open/close,
    read/write, seek)
  • User-implementable I/O (stdio, network, memory,
    etc.)

48
Stackable VFD
  • HDF5 VFD allows
  • Storing data using different physical file
    layout. E.g., Family VFD (writes file as family
    of files)
  • Doing different types of I/O. E.g., stdio
    (standard I/O) MPI-I/O (for parallel I/O)

49
Stackable VFD
  • Why stackable
  • Before now, only one VFD could be used at a time
  • VFDs could not inter-operative
  • What is stackable
  • A Non-terminal VFD may stack on top of compatible
    non-terminal and eventually Terminal VFDs
  • Two kinds of VFD
  • Non-terminal (e.g. Family)
  • Terminal (e.g. stdio MPI-I/O)

50
Stackable VFD
Application
HDF5 API
Non-terminal VFD
Family File
split
Default I/O path
metadata
rawdata
Terminal VFD
stdio
mpiio
Sec2
HDF5 Files
51
Platform-specific changes
52
Platform-specific changes
  • Why Better UNIX/Linux Portability
  • What
  • 1.8 uses latest GNU auto tools (autoconf,
    automake, libtool)
  • improves portability between many machine and OS
    configurations
  • Build can now be done in parallel
  • with gmake j flag
  • speeds up build, test and install processes
  • Build infrastructure includes many other
    improvements as well

53
Platforms to be dropped
  • Operating systems
  • HPUX 11.00
  • MAC OS 10.3
  • AIX 5.1 and 5.2
  • SGI IRIX64-6.5
  • Linux 2.4
  • Solaris 2.8 and 2.9
  • Compilers
  • GNU C compilers older than 3.4 (Linux)
  • Intel 8.
  • PGI V. 5., 6.0
  • MPICH 1.2.5

http//www.hdfgroup.org/HDF5/release/alpha/obtain5
18.html
54
Platforms to be added
  • Systems
  • Alpha Open VMS
  • MAC OSX 10.4 (Intel)
  • Solaris 2. on Intel (?)
  • Cray XT3
  • Windows 64-bit (32-bit binaries)
  • Linux 2.6
  • BG/L
  • Compilers
  • g95
  • PGI V. 6.1
  • Intel 9.
  • MPICH 1.2.7
  • MPICH2

55
High level APIs
56
High-Level Fortran APIs
  • Fortran APIs have been added for H5Lite, H5Image
    and H5Table.

57
Dimension scales
  • Similar to
  • Dimension scales in HDF4
  • Coordinate variables in netCDF
  • What is a dimension scale ?
  • An HDF5 dataset with additional metadata that
    identifies the dataset as a Dimension Scale
  • Associated with dimensions of HDF5 datasets
  • Meaning of the association is left to
    applications
  • A Dimension scale can be shared by two or more
    dataset dimensions

58
Dimension scales example
HDF Explorer image
59
Dimension scales example
HDF Explorer image
60
Sample dimension scale functions
  • H5DSset_scale convert dataset to a dimension
    scale
  • H5DSattach_scale attach scale to a dimension
  • H5DSdetach_scale detach scale from a dimension
  • H5DSis_attached verify if scale attached to
    dataset
  • H5DSget_scale_name read name of scale

61
HDF5Packet
  • Why
  • High performance table writing
  • For data acquisition, when there are many sources
    of data
  • E.g. flight test
  • What
  • Each row is a packet a collection of fields,
    fixed or variable length
  • Append only
  • Indexed retrieval

62
Packets in HDF5
Variable-length records
Fixed-length data records
Data
Data
Time
Time
Data
Data
Data
Data
. . .
. . .
63
Parallel HDF5
64
Collective I/O improvements
  • Why
  • Collective I/O not available for chunked data
  • Collective I/O not available for complex
    selections
  • Collective I/O is key to improving performance
    for parallel HDF5
  • What
  • Collective I/O works for chunked storage
  • Works for irregular selections for both chunked
    and contiguous storage

65
Parallel h5diff (ph5diff)
  • Compares two files in an MPI parallel
    environment.
  • Compares multiple datasets simultaneously

66
Windows MPICH support
  • Windows MPICH support prototype

67
Tool improvements
68
New features for old tools
  • h5dump
  • Dump data in binary format
  • Faster for files with large numbers of objects
  • h5diff
  • Can now compare dataset regions
  • Parallel ph5diff now available
  • h5repack
  • Efficient data copy using H5Gcopy()
  • Able to handle big datasets

69
New HDF5 Tools
  • h5copy
  • Copies a group, dataset or named datatype from
    one location to another
  • Copies within a file or across files
  • h5repart
  • Partition file into a family of files
  • h5import
  • Import binary/ascii data into an HDF5 file
  • h5check
  • Verifies an HDF5 file against the defined HDF5
    File Format Specification
  • h5stat
  • Reports statistics about a file and objects in a
    file

70
Thank You
71
Questions/comments?
72
For more information
  • Go to http//www.hdfgroup.org/HDF5/
  • Click on Obtain HDF5 1.8.0 Alpha
  • Look at table Information

73
Acknowledgement
  • This report is based upon work supported in part
    by a Cooperative Agreement with NASA under NASA
    NNG05GC60A. Any opinions, findings, and
    conclusions or recommendations expressed in this
    material are those of the author(s) and do not
    necessarily reflect the views of the National
    Aeronautics and Space Administration.
Write a Comment
User Comments (0)
About PowerShow.com