daf.storage.h5fs¶
Store daf
data in an h5fs
file.
The intent here is not to define a “new format”, but to use h5fs
as simply as possible with a transparent naming
convention.
An h5fs
file contains “groups” and “data sets”, where a group contains groups and/or data sets, and each data set is
a single 1D/2D data array, both allowing for additional arbitrary metadata.
For maximal flexibility, the code here does not deal with creating or opening the h5fs
file. Instead, given a file
opened or created using the h5py
package, it allows using an arbitrary group in the file to hold all data for some
daf
storage. This allows multiple daf
data sets to co-exist in the same h5fs
file; the downside is that
given an h5fs
file, you need to know the name of the group that contains the daf
data set. Therefore, by
convention, if you name a file .h5df
, you are saying that it contains just one daf
data set at the root group
of the file (that is, viewing the file object as a group). In contrast, if you name the file .h5fs
, you are saying
it may contain “anything”, and you need to provide additional information such as “the group /foo/bar
contains some
daf
data set, the group /foo/baz
contains another daf
data set, and the group /vaz
contains non-daf
data”.
Note
Do not directly modify the content of a daf
group in the h5fs
after creating a H5fsReader
or
H5fsWriter
for it. External modifications may or may not become visible, causing subtle problems.
Group Structure
A daf
group inside an h5fs
file will contain the following data sets:
A single
__daf__
data set whose value is an array of two integers, the major and minor format version numbers, to protect against future extensions of the format. . This version of the library will generate[1, 0]
files and will accept any files with a major version of1
.Every 0D data will be stored as an attribute of the
__daf__
data set.For axes, there will be an
axis#
data set containing the unique names of the entries along the axis.For 1D data, there will be an
axis#property
data set containing the data.For dense 2D data, there will be a
row_axis,column_axis#property
data set containing the data.
Note
Storing 1D/2D data of strings in h5fs
is built around the concept of a fixed number of bytes per element. This
requires us to convert all strings to byte arrays before passing them on to h5fs
(and the reverse when accessing
the data). But numpy
can’t encode None
in a byte array; instead it silently converts it to the 6-character
string 'None'
. To work around this, in h5fs
, we store the magic string value \001
to indicate the
None
value. So do not use this magic string value in arrays of strings you pass to daf
, and don’t be
surprised if you see this value if you access the data directly from h5fs
. Sigh.
For sparse 2D data, there will be a group
row_axis,column_axis#property
which will contain three data sets:data
,indices
andindptr
, needed to construct the sparsescipy.sparse.csr_matrix
. The group will have ashape
attribute whose value is an array of two integers, the rows count and the columns count of the matrix.
Other data sets and/or groups, if any, are silently ignored.
Note
Even though AnnData
can also be used to access data in h5fs
files, these files must be in a specific format
(h5ad
) which is not compatible with the format used here, and is much more restrictive; it isn’t even
possible to store multiple AnnData
in a single h5ad
file, because “reasons”. See the anndata
module if
you need to read or write h5ad
files with daf
.
Using h5fs
as a storage format has some advantages over using simple files
storage:
The data is contained in a single file, making it easier to send it across a network.
Using an
h5fs
file only consumes a single file descriptor, as opposed to one per memory-mapped 1D/2D data for thefiles
storage.
There are of course also downsides to this approach:
All access to the data must be via the
h5py
API. This means that you can’t apply any of the multitude of file-based tools to the data. Putting aside the loss of the convenience of usingbash
or the Windows file explorer to simply see and manipulate the data, this also rules out the possibility of using build tools likemake
to create complex reproducible multi-program computation pipelines, and automatically re-run just the necessary steps if/when some input data or control parameters are changed.Accessing data from
h5fs
creates an in-memory copy. To clarify, theh5py
API does lazily load data only when it is accessed, and does allow to only access a slice of the data, but it will create an in-memory copy of that slice.When using
daf
to accessh5fs
data, you can’t even ask it for just a slice, sincedaf
always asks for the whole thing (in theory we could do something clever with views - we don’t). If you are accessing large data, this will hurt performance; in extreme cases, when the data is bigger than the available RAM, the program will crash.All that said, the implementation here uses the low-level
h5py
APIs to memory-map 1D/2D data, so the above applies only to usingh5fs
through the “normal”h5py
high-level API, which does not support memory-mapping (at least such time that https://github.com/h5py/h5py/issues/1607 is resolved).
Note
The h5py
API provides advanced storage features (e.g., chunking, compression). While daf
doesn’t support
creating data using these features, it will happily read them. You can therefore either manually create daf
data
using these advanced features (following the above naming convention), or you can create the data using daf
and
then run a post-processing step that optimizes the storage format of the data as you see fit. However, if you do so,
daf
will no longer be able to memory-map the data, so for large data you may end up losing rather than gaining
performance.
Classes:
|
Implement the |
|
Implement the |
- class daf.storage.h5fs.H5fsReader(group: Group, *, name: str = 'h5fs#')[source]¶
Bases:
StorageReader
Implement the
StorageReader
interface for agroup
in anh5fs
file.If the name ends with
#
, we append the object id to it to make it unique.Attributes:
The
h5fs
group containing the data.- group¶
The
h5fs
group containing the data.
- class daf.storage.h5fs.H5fsWriter(group: Group, *, name: str = 'h5fs#', copy: Optional[StorageReader] = None, overwrite: bool = False)[source]¶
Bases:
H5fsReader
,StorageWriter
Implement the
StorageWriter
interface for simple files storage inside an emptygroup
in anh5fs
file.If the name ends with
#
, we append the object id to it to make it unique.If
copy
is specified, it is copied into the directory, using theoverwrite
.