daf.storage.files

This stores all the data as simple files in a trivial format in a single directory.

The intent here is not to define a “new format”, but to use the trivial/standard existing formats to store the data in files in a directory in the simplest way possible.

Note

Do not directly modify the storage files after creating a FilesReader or FilesWriter. External modifications may or may not become visible, causing subtle problems.

The only exception is that it is safe to create new axis/data files in the directory; these will not be reflected in any existing FilesReader or FilesWriter object. To access the new data, you will need to create a new FilesReader or FilesWriter object.

Directory Structure

A daf storage directory will contain the following files:

  • A single __daf__.yaml identifies the directory as containing daf data. This should contain a mapping with a single version key whose value must be a sequence of two integers, the major and minor format version numbers, to protect against future extensions of the format. . This version of the library will generate [1, 0] files and will accept any files with a major version of 1.

  • Every 0D data will be stored as a separate name.yaml file, to maximize human-readability of the data.

  • For axes, there will be an axis#.csv file with a single column with the axis name header, containing the unique names of the entries along the axis.

  • For 1D string data, there will be an axis#property.csv file with two columns, with the axis and the property name header (if the property name is identical to the axis, we suffix it with .value). Any missing entries will be set to None. The entries may be in any order, but FilesWriter always writes them in the axis order (skiupping writing of None values).

  • For 1D binary data, there will be an axis#property.yaml file and an axis#property.array containing the data (always in the axis entries order). See create_memory_mapped_array for details in the (trivial) format of these files.

  • For 2D string data, there will be a row_axis,column_axis#property.csv file with three columns, with the rows axis, columns axis, and property name header (if the axis names are identical we suffix them with .row and .column, and if the property name is identical to either we suffix it with .value). Any missing entries will be set to None. The entries may be in any order, but FilesWriter always writes them in ROW_MAJOR order (skipping writing of None values).

  • For 2D Dense binary data, there will be a row_axis,column_axis#property.yaml file accompanied by a row_axis,column_axis#property.array file (always in ROW_MAJOR order based on the axis entries order). See create_memory_mapped_array for details on the (trivial) format of these files.

  • For 2D Sparse binary data, there will be a row_axis,column_axis#property.yaml file accompanied by three files: row_axis,column_axis#property.data, row_axis,column_axis#property.indices and row_axis,column_axis#property.indptr (always in ROW_MAJOR, that is, CSR order, based on the axis entries order). See write_memory_mapped_sparse for details on the (trivial) format of these files.

Other files, if any, are silently ignored.

Note

The formats of non-binary (0D data and 1D/2D strings) data were chosen to maximize robustness as opposed to maximizing performance. This is an explicit design choice made since this performance has almost no impact on the data sets we created daf for (single-cell RNA sequencing data), and we saw the advantage of using simple self-describing text files to maximize the direct accessibility of the data for non-daf tools. If/when this becomes an issue, these formats can be replaced (at least as an option) with more efficient (but more opaque) formats.

Similarly, we currently only support the most trivial format for binary data, to maximize their accessibility to non-daf tools . In particular, no compressed format is available, which may be important for some data sets.

If these restrictions are an issue for your data, you can use the h5fs storage instead (even then, using compression will require some effort).

Using a directory of separate files for separate data instead of a complex single-file format such as h5fs has some advantages:

  • One can apply the multitude of file-based tools to the data. Putting aside the convenience of using bash or the Windows file explorer to simply see and manipulate the data, this allows using build tools like make to create complex reproducible multi-program computation pipelines, and automatically re-run just the necessary steps if/when some input data or control parameters are changed.

  • Using memory-mapped files never creates an in-memory copy when accessing data, which is faster, and allows you to access data files larger than the available RAM (thanks to the wonders of paged virtual address spaces). You would need to always use StorageWriter.create_dense_in_rows to create your data, though.

Note

We have improved the implementation of daf storage for the h5fs format, using the low-level h5py APIs, to also use memory mapping, which avoids copies “almost all the time”. You still need to always use StorageWriter.create_dense_in_rows to create your data, though.

There are of course also downsides to this approach:

  • It requires you create an archive (using tar or zip or the like) if you want to send the data across the network. This isn’t much of a hardship, as typically a data set consists of multiple files anyway. Using an archive also allows for compression, which is important when sending files across the network.

  • It uses one file descriptor per memory-mapped file (that is, any actually accessed 1D/2D data). If you access “too many” such data files at the same time, you may see an error saying something like “too many open files”. This isn’t typically a problem for normal usage. If you do encounter such an error, try calling allow_maximal_open_files which will increase the limit as much as possible without requiring changing operating system settings.

Classes:

FilesReader(path, *[, name])

Implement the StorageReader interface for simple files storage inside a path directory.

FilesWriter(path, *[, name, copy, overwrite])

Implement the StorageWriter interface for simple files storage inside a path directory.

class daf.storage.files.FilesReader(path: str, *, name: Optional[str] = None)[source]

Bases: StorageReader

Implement the StorageReader interface for simple files storage inside a path directory.

If name is not specified, the path is used instead, adding a trailing / to ensure no ambiguity if/when the name is suffixed later. If the name ends with #, we append the object id to it to make it unique.

Attributes:

path

The path of the directory containing the data files.

path

The path of the directory containing the data files.

class daf.storage.files.FilesWriter(path: str, *, name: Optional[str] = None, copy: Optional[StorageReader] = None, overwrite: bool = False)[source]

Bases: FilesReader, StorageWriter

Implement the StorageWriter interface for simple files storage inside a path directory.

If name is not specified, the path is used instead, adding a trailing / to ensure no ambiguity if/when the name is suffixed later. If the name ends with #, we append the object id to it to make it unique.

If copy is specified, it is copied into the directory, using the overwrite.