This is part of a series on data storage with Python.

In this post, we'll be covering .npy and .npz files. NumPy is short for "Numerical Python", which is *the* library for handling arrays in Python.

Let's suppose that we have some dataset which consists of one million random 64-bit floating-point numbers

```
import numpy as np
np.random.seed(42)
num_samples = int(1e6)
data = np.random.randn(num_samples) # Generate 1M 64-bit floating point numbers
```

We can save these data to the disk with one line of code:

```
np.save('filename.npy', data)
```

and read it back

```
data = np.load('filename.npy')
```

If we look at the size of this file, we find that it is exactly 8,000,128 bytes. The first 128 bytes are header information (the header is a multiple of 64 bytes, and may be longer than 128 bytes for more complicated record arrays). The other 8,000,000 bytes contain the data: 8 bytes (64 bits) per value in the array. On most computers, the file size will read 7.6 MB, and this is because 1 MB is 2^20 = 1,048,576 bytes. 8,000,128 / 2^20 = 7.62 MB.

The `np.save`

function only saves one array to a file. If you want to save multiple arrays to the same file, numpy offers `np.savez`

which saves multiple arrays into a Zip file. An example is shown here:

```
np.random.seed(42)
data_1M = np.random.randn(int(1e6)) # Generate 1M 64-bit floating point numbers
data_1k = np.random.randn(int(1e3)) # Generate 1M 64-bit floating point numbers
np.savez('filename.npz', data_1M=data_1M, data_1k=data_1k)
```

The resulting file is 8,008,514 bytes long. 8,000,000 bytes for the 1-million-point array, 8,000 bytes for the 1-thousand-point array, and 514 bytes of overhead.

The data can be read back

```
npz = np.load('filename.npz')
print(npz.files) # Use this to peek at what the names of the variables are
# ['data_1M', 'data_1k']
data_1M = npz['data_1M']
data_1k = npz['data_1k']
```

Furthermore, we can also use compression with these zip files.

```
np.random.seed(64)
data_random = np.random.randn(int(1e6)) # 64-bit float
data_linear = np.arange(int(1e6)) # 64-bit int
data_zeros = np.zeros(int(1e6)) # 64-bit float
np.savez('random.npz', data_random) # 8,000,264 bytes
np.savez_compressed('random-compressed.npz', data_random) # 7,685,544
np.savez('linear.npz', data_linear) # 8,000,264 bytes
np.savez_compressed('linear-compressed.npz', data_linear) # 1,513,988 bytes
np.savez('zeros.npz', data_zeros) # 8,000,264 bytes
np.savez_compressed('zeros-compressed.npz', data_zeros) # 7,998 bytes
np.savez('all.npz', zeros=data_zeros, linear=data_linear, random=data_random)
np.savez_compressed('all-compressed.npz', zeros=data_zeros, linear=data_linear, random=data_random)
24000752
```

Description | Uncompressed (bytes) | Compressed (bytes) | Ratio |
---|---|---|---|

A million samples from normal distribution | 8,000,264 | 7,685,544 | 96% |

First million integers | 8,000,264 | 1,513,988 | 19% |

A million zeros | 8,000,264 | 7,998 | 0.1% |

A zip of the other three | 24,000,752 | 9,207,490 | 96%/3 + 19%/3 + 0.1%/3 = 38% |

As you can see from the table, compression can be an effective way to store data more compactly. The compression will do a great job of deflating arrays with a lot of repitition, even if other arrays don't have much repitition.

Numpy files are an easy, fast way to store data when you're working with Numpy. However, it's difficult to open the `.npy`

and `.npz`

files in other tools like Excel or Matlab (though if you insist, there are Matlab libraries for this).

The biggest drawback for Numpy files is that they lack tools to annotate or describe the data. In CSV, JSON, YAML, or Text files, you can write header information at the top. However, the only place to stuff metadata in numpy files is in the filename. This can sometimes lead to whacky filenames, like `2021-03-25-jonathans-data-rigol-function-generator-2.06V-60Hz-batter-turned-off-v6.npz`

... yuck.

In the future, there is some talk that the Numpy Format Specification will upgrade to store files in the much more robust HDF5 format. A lot of the scientific community is drifting towards using HDF5 to store everything, and I wouldn't be surprised if in 10 years, `.npy`

and `.npz`

files were really simple `.hdf5`

files.

Numpy files are great if you're writing and reading "naked" Numpy arrays, where you're only concerned about the data alone. They don't work well for storing annotations about your variables. If you want to store annotations to better describe your data, consider using an HDF5-based format.