Title: The Python Programming Language and HDF5: H5py.
1The Python Programming Language and HDF5 H5py.
Application to Satellite Remote Sensing Data
Processing.
Needs of satellite remote sensing data processing
and programming languages.
Introduction to some Python elements.
Introduction to Numerical Python with Numpy.
Tying them together with H5py.
Examples.
Daniel Kahn, Science Systems and Applications Inc.
HDF/HDF-EOS Workshop XIII, Riverdale, MD 3 Nov
2009
2Where credit is due
Andrew Collette, UCLA, author of h5py
Francesc Alted, author of PyTables and the HDF5
1.6 Pyrex definitions (i.e Python-C defs used in
H5py)?
Andrew also requested acknowledgment for Darren
Dale (Cornell) and Armando Sole (ESRF) for their
suggestions, patches and encouragement.
3Modern remote sensing satellite data and data
processing needs have several characteristics.
1) The data need to be subjected to numerical
algorithms. The algorithms vary in their
sophistication but few, if any, require
supercomputer architectures. E.g. Computing a
zonal mean. (Fortran, C, C)?
2) The data need to be subjected to non-numerical
bookkeeping algorithms. Associated metadata
needs to be read, matched, sliced, diced, acted
upon and committed to databases. (Perl)
3) Modern systems automatically the schedule
processing on to lots of inexpensive, commodity
CPUs. This complicates using managed (i.e.
licensed) commercial runtime environments, e.g.
IDL, Matlab, Mathematica.
4) Satellite processing systems are dynamic.
They need customization several times over the
course of each instrument's life. Development
time needs to be short. (Perl, IDL, Matlab,
Mathematica)?
5) Data processing systems need to track data
provenance.
4A Remote Sensing Data Processing Challenge...
Problem
We lack a programming language which spans the
breath of our remote sensing data processing
needs, i.e. Which can handle both numerical and
non-numerical processing (including HDF5) and
that has a short development cycle.
Goal
Find a programming language which allows fast
development and is adept at numerical processing
as well as non-numerical processing. Adept
doesn't mean best, just pretty good.
5...and the Python Programming Language
Why Python fits the bill better than
Fortran
As an interpreted language with dynamic variable
bindings Python offers more flexibility during
development. The multidimensional array
extension (numpy) provides efficient and
competitive array operations. It is more similar
to IDL or Matlab...
IDL or Matlab
No license required which can save money on
several levels.
Python language can be more well suited to
non-numerical data processing tasks such as are
encountered in production processing of remote
sensing. It is more similar to Perl...
Perl/PDL
Nicer syntax for array operations.
More complete HDF5 support (h5py vs PDL HDF5).
6Some elements of Python
Python is an interpreted language. Python
statements can easily be run from an interactive
prompt. Python programs are written in ASCII
files.
Python has many of the same elements as
programming languages used for numerical
processing (I.e. loops, conditionals, etc) plus...
Name spaces
Lists
Dictionaries
7Python has name spaces
Name spaces allow related functions and data to
be grouped together in a way that will not
conflict with unrelated functions and data which
happens to be named the same.
Name space are very useful, but are only relevant
here because Python name spaces correspond to
Python modules. The numerical and HDF5 routines
we will see later are implemented as Python
modules.
For example, to open an HDF5 file in Python we
first import the h5py module and then we open the
file using the File function of the h5py module.
import h5py First import the module. FileID
h5py.File('hdf5_test.h5','r')Now use the File
function,
note the module name in red. . . .
We could import other modules simultaneously
which have an unrelated function called File and
there would be no conflict or error.
8Python has Lists
These are 1D arrays which are useful for solving
many problems. List index notation looks like 1D
array notation in many languages.
gtgtgt MyList 1.0,70 initialize variable to 2
element list gtgtgt MyList.append("Demo Data!")
Append an element, NOT like fortran gtgtgt
MyList.append(MyList0 MyList1) Append sum
of first 2 elements gtgtgt print MyList 1.0, 70,
'Demo Data!', 71.0
Python has Dictionaries
(Aka hash tables) These map symbols to objects
in analogy to how a 1D list maps an integer index
value to an object.
gtgtgt MyDictionary Initialize empty
Dictionary gtgtgt MyDictionary0 1 gtgtgt
MyDictionary'Gadzooks' 70 gtgtgt
MyDictionary'Kind of Data' 'Demo Data!' gtgtgt
MyDictionary'Result' MyDictionary0
MyDictionary'Gadzooks' gtgtgt print
MyDictionary 0 1, 'Kind of Data' 'Demo Data!',
'Result' 71, 'Gadzooks' 70
9Python Lists vs Python Dictionaries vs Arrays in
Fortran
Index to a List must be an integer, but a
Dictionary can be indexed by an integer or string.
MyListInteger vs. MyDictionaryInteger or
String
The objects referenced by a List or Dictionary
can be any Python object. This is in contrast to
more traditional languages, eg. The elements
Fortran array must be of a particular type such
as real4.
MyList 34,String Datum or MyDictionary'Fir
stKey'34,'SecondKey'String Datum
10The List and Dictionary data structures make it
possible to write flexible programs very quickly.
However, they are not good for numerical
computation. The variety of objects (number,
string, etc) which they can reference makes
computation slow.
For example, adding array elements MyList0
MyList1
Python must check at run time if MyList0 and
MyList1 are objects that can be added together,
and not, say, a number and a string. In a loop
this check is performed at every iteration!
What to do?
Enter Numpy
Numpy is a package for use with Python which
provides multidimensional arrays of numeric (and
other) types and extends Python syntax to provide
a familiar interface to the arrays.
Numpy extends Python syntax to allow the
expression of vector operations similar to those
in IDL, Matlab, or Fortran (gt90).
11Numpy Example Create, Print, and add two 2D
arrays
Build from Python lists of lists of elements (the
module name is numpy)...
gtgtgt a numpy.array(1,2,3,4,5,6)? gtgtgt print
a 1 2 3 4 5 6
Build from dimension sizes...
gtgtgt b numpy.ones(2,3)? gtgtgt print b 1. 1.
1. 1. 1. 1.
Print a selected element...
gtgtgt print a1,2 6
Example Add two dimensional arrays.
gtgtgt print ab 2. 3. 4. 5. 6. 7. gtgtgt
12The HDF5 connection H5py
H5py is an Python-HDF5 interface is a Python
module written by Andrew Collette. Its design
allows the use of familiar Python structures for
working with HDF5 files.
The interface maps Python syntax for familiar
Python objects to similar objects in the HDF5
file.
13Here is a familiar example HDF5 file from the
HDFView distribution
Here is how to read the 3D int array using h5py.
gtgtgt import h5py gtgtgt fid h5py.File('hdf5_test.h5'
,'r')? gtgtgt group fid'arrays' gtgtgt The3DArray
group'3D int array'.value gtgtgt fid.close()? gtgtgt
The3DArray array( 174, 27, 0,
..., 102, 71, 100009,
171, 27, 0, ..., 194, 79,
100109, 172, 27, 0, ...,
102, 55, 100209, ...,
14Equivalence of HDF5 Groups and Python Dictionaries
Print value of dictionary entry
gtgtgt MyDictionary 'RandArray'numpy.random.rando
m(2,2) gtgtgt print MyDictionary'RandArray'
0.82066938 0.39219683 0.86546443
0.91276533
Print value of HDF5 file entry
gtgtgt fid h5py.File('RandomArray.h5','r')? gtgtgt
print fid'RandArray'.value 0.1
3.14152908 2.71828008 0.
15Simple Real world example
Goal Retrieve HDF5 file from Configuration
Management (CM) and insert CM metadata into HDF5
file.
We want to place the CM path and revision number
inside the HDF5 file as an attribute.
16Python script to retrieve file from CM and store
Rev number as attribute.
! /usr/bin/env python import sys import
os import h5py Rev sys.argv1 Specifiy CM
path on command line SVNFilepath sys.argv2
Specify revision number on
comand
line. command 'svn export -r ' Rev ' '
SVNFilepath Subversion
Command InStream os.popen(command,'r'
)? ExportString InStream.read()? ExportReturnCod
e InStream.close()? Elements
SVNFilepath.split('/')? HDF5 code fid
h5py.File(Elements-1) Elements-1 is file
name fid.attrs'SVN Path and Revision'
SVNFilepath '_at_' Rev fid.close()?
H5py code in red. Note the minimal effort coding
HDF5 calls.
17Another real world example (if NPOESS is your
real world).
We have had several occasions to do data
aggregation on HDF5 files for the OMPS Limb
instrument.
Our retrieval code (Fortran) processes an orbit
of data as 480 distinct pieces and places the
results into 480 distinct HDF5 files. We wish to
aggregate the files such that an N dimensional
array in a unaggregated file becomes a N1
dimensional array in the aggregated HDF5 file.
The Fortran code is (mostly) the product of the
retrieval scientist, while the aggregation is a
requirement of the production data processing
system. It makes sense to aggregate as a
post-processing step to the Fortran code so as to
minimize changes to the original Fortran code.
18Aggregation Algorithm
1) Input is list of HDF5 files 2) Analyze
structure of one file to generate list of fully
qualified dataset names, dimensions, and type
(In the code I use a Python Dictionary and not a
list). 3) Assume all files have that structure.
Read corresponding datasets (of dim N) from each
file into aggregation variable (of dim N1). 4)
After corresponding datasets have been read from
all files write aggregation variable to HDF5
file. 5) Repeat 3 until all datasets have been
aggregated.
Array 4
Array 3
File 1 Array
File 2 Array
File 3 Array
File 4 Array
Array 2
Array 1
Schematic of Array Aggregation for One Dataset
19 Data Aggregator. This script takes a set of
HDF5 files as input. The files are expected to
have identical struture. All the corresponding
arrays in the input files are combined into an
array which has dimensions N1, where N is the
number of dimensions in the original,
constituent arrays. import sys import
h5py import numpy Files sys.argv1 Get
file names from command line First step is to
select one file and create a map of the shape
and data type of all datasets. This is
naturally done via a recursive function, called
VisitAllObjects FirstFid h5py.File(Files0,'r')
Open first HDF5 file FileInfo
Initialize FileInfo to be a dictionary. We will
use it to build a mapping from
dataset name to a tuple containing shape and type
of dataset. EVALUATE HDF5 HIERARCHY
Evaluating a a hierarchy is naturally a recursive
process so we define a function.... def
VisitAllObjects(Group,Path) for i in
Group.items() if isinstance(i1,h5py.Gro
up) VisitAllObjects(i1,Path '/'
i0)? else DatasetName
Path '/' i0 FileInfoDatasetName
(GroupDatasetName.shape, GroupDatasetName.
dtype)? VisitAllObjects(FirstFid,'')? FirstFid.cl
ose()?
20 Print dataset paths and info to screen for
(k,v) in FileInfo.items() print k,v
AGGREGATE DATA Now that we understand the file
structure we can perform the aggregation. OutputFi
leID h5py.File('AggregatedData.h5','w')? NumberO
fFiles len(Files)? Here is the meat of the
code. The outer loop is over datasets, the inner
over all files. for Dataset in
FileInfo.keys() AggregatedData
numpy.ndarray(FileInfoDataset0(NumberOfFiles,
),dtype FileInfoDataset1)? for
FileNumber in range(NumberOfFiles)
Open file, read data into aggregation array, and
close fid h5py.File(FilesFileNumber,'r
')? AggregatedData...,FileNumber
fidDataset.value fid.close()? Path
Dataset.split('/')? map((lambda(x)
OutputFileID.require_group(x)), Path1-1)?
OutputFileIDDataset AggregatedData
OutputFileID.create_dataset(Dataset,dataAggregate
dData,compression5,chunksFileInfoDataset0(1
,))? OutputFileID.close()?
21Original, Unaggregated Data Field
h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_
NCEP \ Data/OMPS_LP_SDR_20
041121_55_146_-69_-119.h5 HDF5 "Data/OMPS_LP_SDR_2
0041121_55_146_-69_-119.h5" DATASET
"/ANCILLARY_DATA/GeopotentialHeight_NCEP"
DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE (
5, 3, 21 ) / ( 5, 3, 21 ) ATTRIBUTE "Title"
DATATYPE H5T_STRING
STRSIZE 25 DATASPACE SCALAR
ATTRIBUTE "Units" DATATYPE
H5T_STRING STRSIZE 2
DATASPACE SCALAR ATTRIBUTE
"_FillValue" DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE ( 1 ) / ( 1 )
Note 3 dimensions
22Aggregated Data Field
h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_
NCEP \
AggregatedData.h5 HDF5 "AggregatedData.h5"
DATASET "/ANCILLARY_DATA/GeopotentialHeight_NCEP
" DATATYPE H5T_IEEE_F32LE DATASPACE
SIMPLE ( 5, 3, 21, 4 ) / ( 5, 3, 21, 4 )
Now we have 4 dimensions. The new dimension has
extent 4, corresponding to the number of input
files.
Note that none of the attributes were copied.
This is a bug, but easily fixed.
23To fix the bug, I assume attributes (like
Units) are not aggregated, and I take
attributes and values from first file.
New code consists of only 2 additional lines,
shown below in green.
def VisitAllObjects(Group,Path) for i in
Group.items() if isinstance(i1,h5py.Gro
up) VisitAllObjects(i1,Path '/'
i0)? else DatasetName
Path '/' i0 FileInfoDatasetName
(GroupDatasetName.shape,
GroupDatasetName.dtype,
GroupDatasetName.attrs.listitems())?
And also....
DSOutputFileID.create_dataset(Dataset,dataAg
gregatedData,compression5,chunksFileInfoDataset
0(1,))? DS.attrs.__setitem__(Attribute0
,Attribute1) for Attribute in
FileInfoDataset2
24Fixed output
h5dump -H -d /ANCILLARY_DATA/GeopotentialHeight_
NCEP \
AggregatedDataWithAttributes.h5 HDF5
"AggregatedDataWithAttributes.h5" DATASET
"/ANCILLARY_DATA/GeopotentialHeight_NCEP"
DATATYPE H5T_IEEE_F32LE DATASPACE SIMPLE (
5, 3, 21, 4 ) / ( 5, 3, 21, 4 ) ATTRIBUTE
"Title" DATATYPE H5T_STRING
STRSIZE 25 CTYPE H5T_C_S1
DATASPACE SCALAR ATTRIBUTE
"Units" DATATYPE H5T_STRING
STRSIZE 2 CTYPE H5T_C_S1
DATASPACE SCALAR ATTRIBUTE
"_FillValue" DATATYPE H5T_IEEE_F32LE
DATASPACE SIMPLE ( 1 ) / ( 1 )
Now we have attributes.
25Summary
Python offers a high degree of flexibility for
code development, combined with the ability to do
easy text, numerical array and HDF5 coding make
it a good candidate for solving problems in
satellite remote sensing data processing. Few,
if any, other computer languages offer this
combination of benefits making it uniquely suited
for this task.
Future Work
Tracking module version provenance is likely one
of the outstanding questions for Python use in
production.
Acknowledgment
Curt Tilmes at NASA Goddard funded this work via
contract NNG06HX18C task 614.5-01-07