daf.storage.files¶
This stores all the data as simple files in a trivial format in a single directory.
The intent here is not to define a “new format”, but to use the trivial/standard existing formats to store the data in files in a directory in the simplest way possible.
Note
Do not directly modify the storage files after creating a FilesReader
or FilesWriter
. External
modifications may or may not become visible, causing subtle problems.
The only exception is that it is safe to create new axis/data files in the directory; these will not be reflected in
any existing FilesReader
or FilesWriter
object. To access the new data, you will need to create a new
FilesReader
or FilesWriter
object.
Directory Structure
A daf
storage directory will contain the following files:
A single
__daf__.yaml
identifies the directory as containingdaf
data. This should contain a mapping with a singleversion
key whose value must be a sequence of two integers, the major and minor format version numbers, to protect against future extensions of the format. . This version of the library will generate[1, 0]
files and will accept any files with a major version of1
.Every 0D data will be stored as a separate
name.yaml
file, to maximize human-readability of the data.For axes, there will be an
axis#.csv
file with a single column with the axis name header, containing the unique names of the entries along the axis.For 1D string data, there will be an
axis#property.csv
file with two columns, with the axis and the property name header (if the property name is identical to the axis, we suffix it with.value
). Any missing entries will be set toNone
. The entries may be in any order, butFilesWriter
always writes them in the axis order (skiupping writing ofNone
values).For 1D binary data, there will be an
axis#property.yaml
file and anaxis#property.array
containing the data (always in the axis entries order). Seecreate_memory_mapped_array
for details in the (trivial) format of these files.For 2D string data, there will be a
row_axis,column_axis#property.csv
file with three columns, with the rows axis, columns axis, and property name header (if the axis names are identical we suffix them with.row
and.column
, and if the property name is identical to either we suffix it with.value
). Any missing entries will be set toNone
. The entries may be in any order, butFilesWriter
always writes them inROW_MAJOR
order (skipping writing ofNone
values).For 2D
Dense
binary data, there will be arow_axis,column_axis#property.yaml
file accompanied by arow_axis,column_axis#property.array
file (always inROW_MAJOR
order based on the axis entries order). Seecreate_memory_mapped_array
for details on the (trivial) format of these files.For 2D
Sparse
binary data, there will be arow_axis,column_axis#property.yaml
file accompanied by three files:row_axis,column_axis#property.data
,row_axis,column_axis#property.indices
androw_axis,column_axis#property.indptr
(always inROW_MAJOR
, that is, CSR order, based on the axis entries order). Seewrite_memory_mapped_sparse
for details on the (trivial) format of these files.
Other files, if any, are silently ignored.
Note
The formats of non-binary (0D data and 1D/2D strings) data were chosen to maximize robustness as opposed to
maximizing performance. This is an explicit design choice made since this performance has almost no impact on the
data sets we created daf
for (single-cell RNA sequencing data), and we saw the advantage of using simple
self-describing text files to maximize the direct accessibility of the data for non-daf
tools. If/when this
becomes an issue, these formats can be replaced (at least as an option) with more efficient (but more opaque)
formats.
Similarly, we currently only support the most trivial format for binary data, to maximize their accessibility to
non-daf
tools . In particular, no compressed format is available, which may be important for some data sets.
If these restrictions are an issue for your data, you can use the h5fs
storage instead (even then, using
compression will require some effort).
Using a directory of separate files for separate data instead of a complex single-file format such as h5fs
has some
advantages:
One can apply the multitude of file-based tools to the data. Putting aside the convenience of using
bash
or the Windows file explorer to simply see and manipulate the data, this allows using build tools likemake
to create complex reproducible multi-program computation pipelines, and automatically re-run just the necessary steps if/when some input data or control parameters are changed.Using memory-mapped files never creates an in-memory copy when accessing data, which is faster, and allows you to access data files larger than the available RAM (thanks to the wonders of paged virtual address spaces). You would need to always use
StorageWriter.create_dense_in_rows
to create your data, though.
Note
We have improved the implementation of daf
storage for the h5fs
format, using the low-level h5py
APIs, to
also use memory mapping, which avoids copies “almost all the time”. You still need to always use
StorageWriter.create_dense_in_rows
to create your data, though.
There are of course also downsides to this approach:
It requires you create an archive (using
tar
orzip
or the like) if you want to send the data across the network. This isn’t much of a hardship, as typically a data set consists of multiple files anyway. Using an archive also allows for compression, which is important when sending files across the network.It uses one file descriptor per memory-mapped file (that is, any actually accessed 1D/2D data). If you access “too many” such data files at the same time, you may see an error saying something like “too many open files”. This isn’t typically a problem for normal usage. If you do encounter such an error, try calling
allow_maximal_open_files
which will increase the limit as much as possible without requiring changing operating system settings.
Classes:
|
Implement the |
|
Implement the |
- class daf.storage.files.FilesReader(path: str, *, name: Optional[str] = None)[source]¶
Bases:
StorageReader
Implement the
StorageReader
interface for simple files storage inside apath
directory.If
name
is not specified, thepath
is used instead, adding a trailing/
to ensure no ambiguity if/when the name is suffixed later. If the name ends with#
, we append the object id to it to make it unique.Attributes:
The path of the directory containing the data files.
- path¶
The path of the directory containing the data files.
- class daf.storage.files.FilesWriter(path: str, *, name: Optional[str] = None, copy: Optional[StorageReader] = None, overwrite: bool = False)[source]¶
Bases:
FilesReader
,StorageWriter
Implement the
StorageWriter
interface for simple files storage inside apath
directory.If
name
is not specified, thepath
is used instead, adding a trailing/
to ensure no ambiguity if/when the name is suffixed later. If the name ends with#
, we append the object id to it to make it unique.If
copy
is specified, it is copied into the directory, using theoverwrite
.