This is part of a series on data storage with Python.
In this post, we'll be covering .npy and .npz files. NumPy is short for "Numerical Python", which is the library for handling arrays in Python.
Let's suppose that we have some dataset which consists of one million random 64-bit floating-point numbers
import numpy as np np.random.seed(42) num_samples = int(1e6) data = np.random.randn(num_samples) # Generate 1M 64-bit floating point numbers
We can save these data to the disk with one line of code:
and read it back
data = np.load('filename.npy')
If we look at the size of this file, we find that it is exactly 8,000,128 bytes. The first 128 bytes are header information (the header is a multiple of 64 bytes, and may be longer than 128 bytes for more complicated record arrays). The other 8,000,000 bytes contain the data: 8 bytes (64 bits) per value in the array. On most computers, the file size will read 7.6 MB, and this is because 1 MB is 2^20 = 1,048,576 bytes. 8,000,128 / 2^20 = 7.62 MB.
np.save function only saves one array to a file. If you want to save multiple arrays to the same file, numpy offers
np.savez which saves multiple arrays into a Zip file. An example is shown here:
np.random.seed(42) data_1M = np.random.randn(int(1e6)) # Generate 1M 64-bit floating point numbers data_1k = np.random.randn(int(1e3)) # Generate 1M 64-bit floating point numbers np.savez('filename.npz', data_1M=data_1M, data_1k=data_1k)
The resulting file is 8,008,514 bytes long. 8,000,000 bytes for the 1-million-point array, 8,000 bytes for the 1-thousand-point array, and 514 bytes of overhead.
The data can be read back
npz = np.load('filename.npz') print(npz.files) # Use this to peek at what the names of the variables are # ['data_1M', 'data_1k'] data_1M = npz['data_1M'] data_1k = npz['data_1k']
Furthermore, we can also use compression with these zip files.
np.random.seed(64) data_random = np.random.randn(int(1e6)) # 64-bit float data_linear = np.arange(int(1e6)) # 64-bit int data_zeros = np.zeros(int(1e6)) # 64-bit float np.savez('random.npz', data_random) # 8,000,264 bytes np.savez_compressed('random-compressed.npz', data_random) # 7,685,544 np.savez('linear.npz', data_linear) # 8,000,264 bytes np.savez_compressed('linear-compressed.npz', data_linear) # 1,513,988 bytes np.savez('zeros.npz', data_zeros) # 8,000,264 bytes np.savez_compressed('zeros-compressed.npz', data_zeros) # 7,998 bytes np.savez('all.npz', zeros=data_zeros, linear=data_linear, random=data_random) np.savez_compressed('all-compressed.npz', zeros=data_zeros, linear=data_linear, random=data_random) 24000752
|Description||Uncompressed (bytes)||Compressed (bytes)||Ratio|
|A million samples from normal distribution||8,000,264||7,685,544||96%|
|First million integers||8,000,264||1,513,988||19%|
|A million zeros||8,000,264||7,998||0.1%|
|A zip of the other three||24,000,752||9,207,490||96%/3 + 19%/3 + 0.1%/3 = 38%|
As you can see from the table, compression can be an effective way to store data more compactly. The compression will do a great job of deflating arrays with a lot of repitition, even if other arrays don't have much repitition.
Numpy files are an easy, fast way to store data when you're working with Numpy. However, it's difficult to open the
.npz files in other tools like Excel or Matlab (though if you insist, there are Matlab libraries for this).
The biggest drawback for Numpy files is that they lack tools to annotate or describe the data. In CSV, JSON, YAML, or Text files, you can write header information at the top. However, the only place to stuff metadata in numpy files is in the filename. This can sometimes lead to whacky filenames, like
In the future, there is some talk that the Numpy Format Specification will upgrade to store files in the much more robust HDF5 format. A lot of the scientific community is drifting towards using HDF5 to store everything, and I wouldn't be surprised if in 10 years,
.npz files were really simple
Numpy files are great if you're writing and reading "naked" Numpy arrays, where you're only concerned about the data alone. They don't work well for storing annotations about your variables. If you want to store annotations to better describe your data, consider using an HDF5-based format.