daf.access¶
High-level API for accessing data in a daf
data set.
These interfaces are intended for daf
users (that is, applications built on top of daf
). They allows placing any
“reasonable” type of data into a DafWriter
, while ensuring that accessing data in a DafReader
will always return
“clean” data. For example, 2D data returned by DafReader
is always is_optimal
is_frozen
MatrixInRows
regardless of whatever was put into it.
This is in contrast to the low-level StorageReader
and StorageWriter
interface which, to simplify writing storage
adapters, requires storing “clean” data as above, but does not guarantee anything when accessing the stored data. That
is, DafReader
and DafWriter
satisfy the robustness principle both “up” towards the application and “down” towards the storage
format adapters.
Accessing data in daf
is based on string names in the following format(s):
0D data is identified by a simple
name
, e.g.doi_url
might be a string describing the overall data set. There is no restriction on the data type of 0D data except that it should be reasonably de/serializable to allow storing it in a disk file.1D/2D data is specified along some axes, where each
axis
has a simple name and a string name for each entry along the axis.1D data along some axis is identified by
axis#name
, e.g.cell#age
might assign an age to every cell in the data set. Such data is returned as anumpy
1D array (that is,Vector
) or as apandas.Series
.2D data along two axes is identifies by
rows_axis,columns_axis#name
, e.g.cell,gene#UMIs
would give the number of unique molecular identifiers (that is, the count of mRNA molecules) for each gene in each cell.All such data is provided in
ROW_MAJOR
order; that is, in the above example, each row will describe a cell, and will contain (consecutively in memory) the UMIs of each gene. Requestinggene,cell#UMIs
will return data where each row describes a cell, and will contain (consecutively in memory) its UMIs in each cell.Note
Calling
.transpose()
on 2D data does not modify the memory layout; this is why it is an extremely fast operation. That is, the transpose ofcell,gene#UMIs
data contains the same rows, columns, and values asgene,cell#UMIs
data, but the former will be inCOLUMN_MAJOR
layout and the latter will be inROW_MAJOR
layout. The two may be “equal” but will not be identical when it comes to performance (for non-trivial data sizes). For example, summing the UMIs of each cell would be much slower for thegene,cell#UMIs
data. It is therefore important to keep track of the memory order of any non-trivial 2D data, and ensure operations are applied to the right layout. Otherwise the code will experience extreme slowdowns.2D data can be stored in either dense (
numpy
2D array) or sparse (scipy.sparse.csr_matrix
andscipy.sparse.csc_matrix
) formats. Which one you’ll get when accessing the data will depend on what was stored. This allows for efficient storage and processing of large sparse matrices, at the cost of requiring the users to examine the fetched data (e.g. usingis_sparse
oris_dense
) to pick the right code path to process it (sincenumpy
arrays andscipy.sparse
matrices don’t really support the same operations).You can also request the data as a
pandas.DataFrame
(that is,Frame
), in which case, due topandas
limitations, the data will always be returned in the dense (numpy
) format. The index and columns of the frame will be the relevant axis entries.
- daf.access.readers
DafReader
DafReader.name
DafReader.base
DafReader.derived
DafReader.chain
DafReader.as_reader()
DafReader.description()
DafReader.verify_has()
DafReader.has_data()
DafReader.item_names()
DafReader.has_item()
DafReader.get_item()
DafReader.axis_names()
DafReader.has_axis()
DafReader.axis_size()
DafReader.axis_entries()
DafReader.axis_indices()
DafReader.axis_index()
DafReader.data1d_names()
DafReader.has_data1d()
DafReader.get_vector()
DafReader.get_series()
DafReader.data2d_names()
DafReader.has_data2d()
DafReader.get_matrix()
DafReader.get_frame()
DafReader.get_columns()
DafReader.view()
transpose_name()
- daf.access.writers
- daf.access.operations