This is part of a series on data storage with Python. HDF stands for hierarchical data files, and HDF5 was a standard that was developed to allow massive data sets to be stored and read in a computationally and space-efficient way. HDF5 files can be very complex and typically require more overhead in the code to manage, but they give one of the richest feature sets of any of the data formats discussed in this series on data storage with Python. You may not find yourself working that often with the base HDF5 API, but you may instead find yourself doing something like netCDF4 or PyTables instead. Still, it's sometimes helpful to understand the base HDF5 format to better understand how PyTables and netCDF4 work.
HDF5 is a very mature technology that has support for multiple applications, including C, MATLAB, and Python. The HDF5 file stores stuff in a file that is organized in a tree-like strucutre (similar to how files and folders are stored on a computer).
The above figure (which is a screenshot of an HDF5 file using HDF5View) shows the structure of a file called 2021-08-06.h5. Attached to the root of the file are three "groups" called calibration
, setup
and shots
. These groups can hold other groups, or they can hold arrays. If we look at the calibration
group, we see that it holds three arrays, namely dVdI_calibration
, phase_calibration_earth_rate
, and phase_calibration_rest
. Each of the tables can be thought of as a numpy array.
HDF5 files have a number of key advantages over other file formats, as illustrated in the attached table
Format | Compression | Hierarchy | Metadata |
---|---|---|---|
csv | No | No | No |
yaml | No | Yes | Yes |
npz | Yes | Multiple arrays, no hierarchy | No |
xlsx | Yes (xlsx, not xls) | Multiple sheets, no hierarchy | No |
hdf5 | Yes | Yes | Yes |
There is a python library called h5py that makes it super easy to save and recall metadata. If all you need to do is store and recall arrays, and edit the metadata on them, here is a simple API:
import numpy as np
import h5py
# w: Overwrite if exists
# a: Append if exists
# r: Read-only
f = h5py.File('test.h5', 'w')
f['consecutive'] = np.arange(5)
np.random.seed(42)
f['random/seed42'] = np.random.randn(5)
np.random.seed(43)
f['random/seed43'] = np.random.randn(5)
f['random/seed42'].attrs['seed'] = 42
print(f.keys())
# <KeysViewHDF5 ['consecutive', 'random']>
saved_data = f['consecutive'][()]
print(saved_data)
# array([1, 2, 3, 4, 5])
print(f['random/seed42'][()])
# [ 0.49671415 -0.1382643 0.64768854 1.52302986 -0.23415337]
print(f['random/seed43'][()])
# [ 0.25739993 -0.90848143 -0.37850311 -0.5349156 0.85807335]
print(f['random/seed42'].attrs['seed'])
# 42
f.close()
HDF5 files have some other fancy features, like EArrays that can have rows of data appended to them after creation. While you can do this in the h5py library, there are other libraries that make this easier, such as PyTables. I'll be writing about that in a subsequent post.