Pytables


This is part of a series on data storage with Python.

A previous post covered HDF5 files, and mentioned that they offer the ability to append rows of data as an experiment is in progress. In this post, we'll cover a library specifically designed to use this feature. The name of the library is PyTables.

Let's start by importing a few libraries:

import numpy as np
import matplotlib.pyplot as plt
import tables as pt

For this example, we're going to pretend that we're running a measurement and measuring the value of some device over time. Let's suppose that we have some digital acquisition unit (DAQ). We can simulate this device as outputting Gaussian white noise with some sampling rate f_s in Hz, with a root-mean-square noise of rms.

class DAQ():
    def __init__(self, f_s, rms):
        self.f_s = f_s
        self.rms = rms

    def read(self, T):
        num_points = int(T * self.f_s)
        return np.random.randn(num_points) * self.rms

daq = DAQ(f_s=10, rms=5)
data = daq.read(100)

Great, now let's create a file that we're going to put the output of this DAQ into

h5file = pt.open_file('2020-10-07-demo.h5', 'w')

In this example, I'm using the 'w' option to say that we'll overwrite the file if it already exists. To append, we would use a 'a', and for read-only, we'd use a 'r'

Before we put any data in, we need to define the structure that we're going to use. There is a more compact way of doing this, but perhaps the easist way is to use define a class that extends PyTable's IsDescription class.

class MyDescription(pt.IsDescription):
    timestamp = pt.Float64Col()
    voltage = pt.Float32Col()

h5file.create_table(h5file.root, 'detector_noise', MyDescription)

In this snippet, we're telling PyTables to mount a table called detector_noise to the root of the HDF5 file. Detector noise is going to have two columns. The first column is called timestamp and each value is a 64-bit floating-point number. The second column is called voltage and is a 32-bit floating-point number.

If you wanted to do this in one line, you'd write

h5file.create_table(h5file.root, 'detector_noise', {'timestamp': pt.Float64Col(), 'voltage': pt.Float32Col()})

You can also set this structure up to save more than one value per column and row. For example, if you were digitizing a waveform on an oscilloscope every second, and your oscilloscope output an array of 500 16-bit floating-point numbers, you could write

h5file.create_table(h5file.root, 'oscilloscope_table', {'timestamp': pt.Float64Col(), 'trace': pt.Float16Col(shape=(500,))})

Now that we've defined the table structure, let's investigate how to load data into this file. We're going to simulate that we're getting data off of our DAQ, and entering it row by row into the file. We'll assume that each time we read the DAQ, we get 5 points that correspond to 5 measurements of 0.1 s each.

initial_timestamp = time.time()
last_time = initial_timestamp

for _ in range(10):
    data = daq.read(5)

    for data_value in data:
        row = table_detector_noise.row
        row['voltage'] = data_value
        row['timestamp'] = last_time
        last_time += 1 / daq.f_s

        row.append()
    table_detector_noise.flush()

To understand what we did here

  1. Set up the timestamp at the beginning of the experiment
  2. Measure the DAQ
  3. For each value that we measured:
    1. Initialize an empty row
    2. Add the value to the voltage column
    3. Add the timestamp to the timestamp column
    4. Increment a running variable that we're using to track what timestamp that value corresponds to
  4. Periodically write the data to the disk by flushing the table.

Reading Back the Data

We can read back our data by asking for the entire column using the following API

voltages = table_detector_noise.col('voltage')

For time-like data, we can take an extra step to cast it into numpy's datetime64 format.

timestamps = table_detector_noise.col('timestamp')
timestamps = (timestamps * 1e9).astype('datetime64[ns]')

And then plot using matplotlib

plt.plot(timestamps, voltages)

Attributes

Attributes on tables can be set using a dictionary-like interface.

table_detector_noise.attrs['foo'] = 'bar'

Tables can be organized in groups. By default, everything is going to be mounted to the "root" group. In addition to table-level attributes, you can also attach attributes to groups using the _v_attrs property.

my_group = h5file.root
my_group._v_attrs['foo'] = 'bar'

I wrote some API for PyTables to include column-level attributes, but it hasn't been pulled in yet (mostly because I still need to write the docs and unittests for it). Hopefully someday that can be used as well for even more granular writing of metadata.