daf.storage.h5fs

Store daf data in an h5fs file.

The intent here is not to define a “new format”, but to use h5fs as simply as possible with a transparent naming convention.

An h5fs file contains “groups” and “data sets”, where a group contains groups and/or data sets, and each data set is a single 1D/2D data array, both allowing for additional arbitrary metadata.

For maximal flexibility, the code here does not deal with creating or opening the h5fs file. Instead, given a file opened or created using the h5py package, it allows using an arbitrary group in the file to hold all data for some daf storage. This allows multiple daf data sets to co-exist in the same h5fs file; the downside is that given an h5fs file, you need to know the name of the group that contains the daf data set. Therefore, by convention, if you name a file .h5df, you are saying that it contains just one daf data set at the root group of the file (that is, viewing the file object as a group). In contrast, if you name the file .h5fs, you are saying it may contain “anything”, and you need to provide additional information such as “the group /foo/bar contains some daf data set, the group /foo/baz contains another daf data set, and the group /vaz contains non-daf data”.

Note

Do not directly modify the content of a daf group in the h5fs after creating a H5fsReader or H5fsWriter for it. External modifications may or may not become visible, causing subtle problems.

Group Structure

A daf group inside an h5fs file will contain the following data sets:

  • A single __daf__ data set whose value is an array of two integers, the major and minor format version numbers, to protect against future extensions of the format. . This version of the library will generate [1, 0] files and will accept any files with a major version of 1.

  • Every 0D data will be stored as an attribute of the __daf__ data set.

  • For axes, there will be an axis# data set containing the unique names of the entries along the axis.

  • For 1D data, there will be an axis#property data set containing the data.

  • For dense 2D data, there will be a row_axis,column_axis#property data set containing the data.

Note

Storing 1D/2D data of strings in h5fs is built around the concept of a fixed number of bytes per element. This requires us to convert all strings to byte arrays before passing them on to h5fs (and the reverse when accessing the data). But numpy can’t encode None in a byte array; instead it silently converts it to the 6-character string 'None'. To work around this, in h5fs, we store the magic string value \001 to indicate the None value. So do not use this magic string value in arrays of strings you pass to daf, and don’t be surprised if you see this value if you access the data directly from h5fs. Sigh.

  • For sparse 2D data, there will be a group row_axis,column_axis#property which will contain three data sets: data, indices and indptr, needed to construct the sparse scipy.sparse.csr_matrix. The group will have a shape attribute whose value is an array of two integers, the rows count and the columns count of the matrix.

Other data sets and/or groups, if any, are silently ignored.

Note

Even though AnnData can also be used to access data in h5fs files, these files must be in a specific format (h5ad) which is not compatible with the format used here, and is much more restrictive; it isn’t even possible to store multiple AnnData in a single h5ad file, because “reasons”. See the anndata module if you need to read or write h5ad files with daf.

Using h5fs as a storage format has some advantages over using simple files storage:

  • The data is contained in a single file, making it easier to send it across a network.

  • Using an h5fs file only consumes a single file descriptor, as opposed to one per memory-mapped 1D/2D data for the files storage.

There are of course also downsides to this approach:

  • All access to the data must be via the h5py API. This means that you can’t apply any of the multitude of file-based tools to the data. Putting aside the loss of the convenience of using bash or the Windows file explorer to simply see and manipulate the data, this also rules out the possibility of using build tools like make to create complex reproducible multi-program computation pipelines, and automatically re-run just the necessary steps if/when some input data or control parameters are changed.

  • Accessing data from h5fs creates an in-memory copy. To clarify, the h5py API does lazily load data only when it is accessed, and does allow to only access a slice of the data, but it will create an in-memory copy of that slice.

    When using daf to access h5fs data, you can’t even ask it for just a slice, since daf always asks for the whole thing (in theory we could do something clever with views - we don’t). If you are accessing large data, this will hurt performance; in extreme cases, when the data is bigger than the available RAM, the program will crash.

    All that said, the implementation here uses the low-level h5py APIs to memory-map 1D/2D data, so the above applies only to using h5fs through the “normal” h5py high-level API, which does not support memory-mapping (at least such time that https://github.com/h5py/h5py/issues/1607 is resolved).

Note

The h5py API provides advanced storage features (e.g., chunking, compression). While daf doesn’t support creating data using these features, it will happily read them. You can therefore either manually create daf data using these advanced features (following the above naming convention), or you can create the data using daf and then run a post-processing step that optimizes the storage format of the data as you see fit. However, if you do so, daf will no longer be able to memory-map the data, so for large data you may end up losing rather than gaining performance.

Classes:

H5fsReader(group, *[, name])

Implement the StorageReader interface for a group in an h5fs file.

H5fsWriter(group, *[, name, copy, overwrite])

Implement the StorageWriter interface for simple files storage inside an empty group in an h5fs file.

class daf.storage.h5fs.H5fsReader(group: Group, *, name: str = 'h5fs#')[source]

Bases: StorageReader

Implement the StorageReader interface for a group in an h5fs file.

If the name ends with #, we append the object id to it to make it unique.

Attributes:

group

The h5fs group containing the data.

group

The h5fs group containing the data.

class daf.storage.h5fs.H5fsWriter(group: Group, *, name: str = 'h5fs#', copy: Optional[StorageReader] = None, overwrite: bool = False)[source]

Bases: H5fsReader, StorageWriter

Implement the StorageWriter interface for simple files storage inside an empty group in an h5fs file.

If the name ends with #, we append the object id to it to make it unique.

If copy is specified, it is copied into the directory, using the overwrite.