Title: Mike Folk, Elena Pourmal
1Update on HDFSustaining and Growing Data
Technology
- Mike Folk, Elena Pourmal
- National Center for Supercomputing Applications
- University of Illinois at Urbana-Champaign
- canSAS-IV RAL
- May 13, 2004
2Talk overview
- Data management challenges, and evolution of data
requirements and file formats - HDF in a nutshell
- Current HDF development effort
- Sustainability and growth of HDF
3We are drowning in data and still starving for
information
4Current data management challenge
- Data diversity
- Comes from mixed sources (experiment, simulation,
testing, remote sensing, etc.) - Comes in different sizes (KB, MB, GB, TB,)
- Comes in different forms and formats (ASCI,
binary, community formats, proprietary formats) - Hard to share, archive, and mine
- Leads to duplicated effort in visualization and
mining tools - Results in high costs, lost opportunities to
share, to access and to use already existing data
5Current software management challenge
Stovepipe applications
Climate Model Application
Agricultural Monitoring
Weather satellite
Produce data with unique model and format
Gather data with unique model and format
Gather data with unique model and format
Preprocess
Preprocess
Preprocess
Visualize, analyze
Visualize, analyze
Visualize, analyze
Archive
Archive
Archive
6Solution Common data models and formats
Standards-based applications
Climate Model Application
Agricultural Monitoring
Weather satellite
Common data model and storage format
Visualize, analyze
Visualize, analyze
Visualize, analyze
Archive
Archive
Archive
7Evolution of format and I/O requirements in the
last 15 years
1992 I want to mix images, metadata, and other
data. Better to use a multi-object binary format.
HDF2 is cool!
1998 Whoa, this data is big! I need really big
objects, parallel I/O. Its HDF5 for me!
2004 You human peebles betta use NeXus, or Ill
be baaaack!
1994 My objects are complex. I need groups and
tables. I like HDF4!
1991 Ive got lots o numbers and metadata.
Fortran print!
1990 I just need some numbers. Text is good.
vi will do the trick!
8HDF was created to address data management needs
in science and engineering, and to
provide building blocks for scientific communities
standards
9HDF in a nutshell
- Comes from HDF group, NCSA University of Illinois
- File format and I/O Library for storing,
managing and archiving large complex scientific
and other data - Accommodates data of diverse origins, sizes, and
types - Portable available for almost all OSs
- Scalable works in high performance computational
environments - Became underlying file format for community built
standards such as NeXus, HDF-EOS, NPOESS
10Example of HDF file mixing and grouping objects
Text This file was create as a part of
see http//hdf.ncsa.uiuc.edu
foo
a
z
1GB
lat lon temp -------------- 12 23
3.1 15 24 4.2 17 21 3.6
c
b
x
_foo_y
Table
Raster image
Raster image
2-D array
11HDF Community
- Broad range of disciplines and applications
- We try to support all users
- Provide open source libraries tools
- Provide user support, documentation
- Encourage widespread use, vendor participation
- Help to develop community standards
- Example NASA HDF-EOS and NPOESS
- Data reached 4,000,000,000,000,000 bytes
- NASA Aqua, Terra, and (soon) Aura satellites use
HDF4 and HDF5 - NPOESS will use HDF5 for data distribution
12Current HDF development effortsMajor directions
- Performance, library enhancements and tools to
facilitate access to the HDF data - Help different communities with common data
models and formats based on HDF - netCDF on top of HDF5 (Atmospheric Sciences)
- Storage and visualization of the FEM and CFD data
(ESA, NASA, US DOE Labs) - Bioinformatics (NIH, DNA sequencing)
- Real-time data processing (Boeing)
- Public records (NARA, geospatial data)
- HDF sustainability (non for profit organization
to support HDF effort)
13Current releases
- For details check http//hdf.ncsa.uiuc.edu
- HDF5 1.6.2 and HDF4 r2.0
- Bug fixes
- Performance enhancements
- New compression method NASA SZIP (fast, better
compression rations) - Better configuration
- New platforms Linux 64, Altix, MAC OS X
- Tools
- repack, diff
14Java Tools development
- HDFView
- Browse and edit HDF4 and HDF5 files
- Modularized Java packages
- Address the needs of the standards built on top
of HDF - HDF-EOS browser first successful prototype to
brows HDF file using HDF-EOS standard - Web Browser Plug-in to read HDF (current
research) - http//hdf.ncsa.uiuc.edu/hdf-java-html/
15HDF-EOS Browser
- HDF-EOS objects
- Swath
- Grid
- Point
- Represented by HDF objects
16HDF view of HDF-EOS data
17Natural view of HDF-EOS data
18netCDF4/HDF5
- netCDF4
- Funded by NASA
- Extension to current netCDF
- Build on top of HDF5
- Used by atmospheric science community
- http//my.unidata.ucar.edu/conten/software/netcdf/
netcdf-4/index.html
19Storage and Visualization of FEM and CFD Data
- Few examples
- NASA CGNM standard for CFD applications
- STEP/NRF standard for non-destructive tests,
thermal data analysis - Abaqus internal file format for FEM calculations
and visualization - EnSight internal file format for visualization of
FEM and CFD data - Want to use common model based on HDF5
- First prototype ftp//ftp.ensight.com/pub/HDF_RW/
20HDF5 and Bioinformatics
- Goals
- Make DNA sequencing available to any educational,
research, or clinical laboratory, and individual
researchers - Solutions
- Create common file format and data model based on
HDF5 for solving alignment problems in DNA and
protein sequence analysis - Develop visualization and analysis tools to work
with raw data and to produce final publishable
results - Develop general approach to address numerous
sequencing activities
21HDF5 and Bioinformatics
- HDF5 challenges
- Efficient handling of element deletions
- Efficient handling of variable-length records
- Request to support new data structures in HDF5
- Link-lists
- Hash tables
- Sorting mechanisms
- Multithreaded support
22Real-time data processing (Boeing)
- Challenge
- Multiple data sources
- Aircraft real-time test data (500Mb/sec)
- Voice communications
- Video data
- Ground tracking data
- Satellite/GPS
- Simulations
- Difficulty to share data between different
companys divisions - Solution use common file format based on HDF5
23Boeing Variable length array storage
24Boeing Variable length array storage
- Variable Length Array Storage in HDF5
- Needed for flight test data systems
- Must handle up to 500Mb/sec
- Must handle raw, real-time and/or embedded data
- NCSA implementing API to read/write data
- Based on HDF5 table API
- Potential applications to many domains
- Part of effort to adopt HDF5 as Boeing-wide
standard for engineering data
25National Archives and Records Administration
(NARA)
- Huge challenges managing digital data
- Investigate HDF5 as format for large and/or
complex data records - Initial focus on geospatial data
- Images (e.g. elevation models, aerial
photography) - Features (e.g. boundaries, roads, rivers)
- Results so far
- HDF5 data model handles all data types
- Feature data present access and size problems for
HDF5 - Research leading to good performance lessons
26Sustaining HDF non for profit organization
- Investigating idea of non for profit
organization dedicated to long-term
sustainability of HDF-based technologies - HDF remains free and open
- Funding similar to current mechanisms, plus
consulting, donors, TBD.
27HDF Information
- HDF website
- http//hdf.ncsa.uiuc.edu/
- HDF Help email address
- hdfhelp_at_ncsa.uiuc.edu
- HDF users mailing list
- hdfnews_at_ncsa.uiuc.edu
28Acknowledgements
- This report is based upon work supported in part
by - Cooperative Agreement with NASA under NASA grant
NAG 5-2040 and NAG NCCS-599. Any opinions,
findings, and conclusions or recommendations
expressed in this material are those of the
author(s) and do not necessarily reflect the
views of the National Aeronautics and Space
Administration. - Lawrence Livermore National Laboratory contract
DOE LLNL B507374 and B527300. - Electronic Records Archive of the US National
Archives and Records Administration under grant
number NARA NSF 02-02GPG - Boeing, NCSA and others (http//hdf.ncsa.uiuc.edu
/acknowledge.html)