Hdf5 File formatΒΆ

Vaex uses hdf5 (Hierarchical Data Format) for storing data. You can think of hdf5 files as being a file system, where the ‘files’ contain N-dimensional arrays, or think of it as the binary equivalent of XML files. Being almost like a filesystem, you can store data anyway, for instance under ‘/mydata/somearray’.

For vaex we based our layout on VOTable, any recommendation, comments or requests to standardize are welcome.

In vaex, every column is stored under /data, which can be found out using the h5ls tool

$ h5ls data/helmi-dezeeuw-2000-10p.hdf5
data                     Group

All columns are stored under this group, and can be listed:

$ h5ls data/helmi-dezeeuw-2000-10p.hdf5/data
E                        Dataset {330000}
FeH                      Dataset {330000}
L                        Dataset {330000}
Lz                       Dataset {330000}
random_index             Dataset {330000}
vx                       Dataset {330000}
vy                       Dataset {330000}
vz                       Dataset {330000}
x                        Dataset {330000}
y                        Dataset {330000}
z                        Dataset {330000}

If you for some reason don’t want to use vaex, but access the data using Python, you could do something like this:

import h5py
import numpy as np
h5file = h5py.File("/Users/users/breddels/src/vaex/data/helmi-dezeeuw-2000-10p.hdf5", "r")
FeH = h5file["/data/FeH"]
# FeH is your regular numpy array (with some extras)
print("mean FeH", np.mean(FeH), "length", len(FeH))
('mean FeH', -1.6934730008384034, 'length', 330000)

More information about a column can be found using:

h5ls -v data/helmi-dezeeuw-2000-10p.hdf5/data/FeH
Opened "data/helmi-dezeeuw-2000-10p.hdf5" with sec2 driver.
FeH                      Dataset {330000/330000}
    Attribute: ucd scalar
        Type:      variable-length null-terminated ASCII string
        Data:  "phys.abund.fe"
    Attribute: unit scalar
        Type:      variable-length null-terminated ASCII string
        Data:  "dex"
    Location:  1:2644064
    Links:     1
    Storage:   2640000 logical bytes, 2640000 allocated bytes, 100.00% utilization
    Type:      native double

Here we see that the (similar to VOTable), we have a ucd attribute which describes what the column represents, and its units.

These can be accessed using h5py as well

print(FeH.attrs["ucd"], FeH.attrs["unit"])
('phys.abund.fe', 'dex')

Further restrictions are the first character of the column name should be an underscope (_) an ascii letter (a-z or A-Z), and following characters can also include a digit.

For completeness, the layout is as follows

  • /data
  • /column1 (with optional attribute ucd and unit)
  • ...
  • /columnN (with optional attribute ucd and unit)