Numpy Files


This is part of a series on data storage with Python.

In this post, we'll be covering .npy and .npz files. NumPy is short for "Numerical Python", which is the library for handling arrays in Python.

Let's suppose that we have some dataset which consists of one million random 64-bit floating-point numbers

import numpy as np

np.random.seed(42)

num_samples = int(1e6)
data = np.random.randn(num_samples) # Generate 1M 64-bit floating point numbers

We can save these data to the disk with one line of code:

np.save('filename.npy', data)

and read it back

data = np.load('filename.npy')

If we look at the size of this file, we find that it is exactly 8,000,128 bytes. The first 128 bytes are header information (the header is a multiple of 64 bytes, and may be longer than 128 bytes for more complicated record arrays). The other 8,000,000 bytes contain the data: 8 bytes (64 bits) per value in the array. On most computers, the file size will read 7.6 MB, and this is because 1 MB is 2^20 = 1,048,576 bytes. 8,000,128 / 2^20 = 7.62 MB.

The np.save function only saves one array to a file. If you want to save multiple arrays to the same file, numpy offers np.savez which saves multiple arrays into a Zip file. An example is shown here:

np.random.seed(42)
data_1M = np.random.randn(int(1e6)) # Generate 1M 64-bit floating point numbers
data_1k = np.random.randn(int(1e3)) # Generate 1M 64-bit floating point numbers
np.savez('filename.npz', data_1M=data_1M, data_1k=data_1k)

The resulting file is 8,008,514 bytes long. 8,000,000 bytes for the 1-million-point array, 8,000 bytes for the 1-thousand-point array, and 514 bytes of overhead.

The data can be read back

npz = np.load('filename.npz')
print(npz.files) # Use this to peek at what the names of the variables are
# ['data_1M', 'data_1k']
data_1M = npz['data_1M']
data_1k = npz['data_1k']

Furthermore, we can also use compression with these zip files.

np.random.seed(64)
data_random = np.random.randn(int(1e6)) # 64-bit float
data_linear = np.arange(int(1e6)) # 64-bit int
data_zeros = np.zeros(int(1e6)) # 64-bit float

np.savez('random.npz', data_random) # 8,000,264 bytes
np.savez_compressed('random-compressed.npz', data_random) # 7,685,544

np.savez('linear.npz', data_linear) # 8,000,264 bytes
np.savez_compressed('linear-compressed.npz', data_linear) # 1,513,988 bytes

np.savez('zeros.npz', data_zeros) # 8,000,264 bytes
np.savez_compressed('zeros-compressed.npz', data_zeros) # 7,998 bytes

np.savez('all.npz', zeros=data_zeros, linear=data_linear, random=data_random)
np.savez_compressed('all-compressed.npz', zeros=data_zeros, linear=data_linear, random=data_random)
24000752
Description Uncompressed (bytes) Compressed (bytes) Ratio
A million samples from normal distribution 8,000,264 7,685,544 96%
First million integers 8,000,264 1,513,988 19%
A million zeros 8,000,264 7,998 0.1%
A zip of the other three 24,000,752 9,207,490 96%/3 + 19%/3 + 0.1%/3 = 38%

As you can see from the table, compression can be an effective way to store data more compactly. The compression will do a great job of deflating arrays with a lot of repitition, even if other arrays don't have much repitition.

Numpy files are an easy, fast way to store data when you're working with Numpy. However, it's difficult to open the .npy and .npz files in other tools like Excel or Matlab (though if you insist, there are Matlab libraries for this).

The biggest drawback for Numpy files is that they lack tools to annotate or describe the data. In CSV, JSON, YAML, or Text files, you can write header information at the top. However, the only place to stuff metadata in numpy files is in the filename. This can sometimes lead to whacky filenames, like 2021-03-25-jonathans-data-rigol-function-generator-2.06V-60Hz-batter-turned-off-v6.npz... yuck.

In the future, there is some talk that the Numpy Format Specification will upgrade to store files in the much more robust HDF5 format. A lot of the scientific community is drifting towards using HDF5 to store everything, and I wouldn't be surprised if in 10 years, .npy and .npz files were really simple .hdf5 files.

Numpy files are great if you're writing and reading "naked" Numpy arrays, where you're only concerned about the data alone. They don't work well for storing annotations about your variables. If you want to store annotations to better describe your data, consider using an HDF5-based format.