daf.access.readers¶

Read-only interface for daf data sets.

In daf, data access uses a string name in the format described below. Even though each name uniquely identifies whether the data is 0D, 1D or 2D, there are separate functions for accessing the data based on its dimension. This both makes the code more readable, and also allows mypy to provide some semblence of effective type checking (if you choose to use it).

Note

To avoid ambiguities and to ensure that storing daf data in files works as expected, do not use ,, #, = or | characters in axis, property or entry names. In addition, since axis and property names are used as part of file names in certain storage formats, also avoid characters that are invalid in file names, most importantly /, but also ", :, and \. If you want to be friendly to interactive shell usage, try to avoid characters used by shell such as ', ", *, ?, &, $ and ;, even though these can be quoted.

The following data is used in all the examples below:

>>> import daf
>>> data = daf.DafReader(daf.FilesReader(daf.DAF_EXAMPLE_PATH), name="example")

2D Names

All 2D names start with rows_axis,columns_axis#.

rows_axis , columns_axis # property [ | !? ElementWise [ , param = value ]* ]*

The name of a property with a value per each combination of two axes entries, optionally processed by a series of ElementWise operations. For example:

>>> data.get_matrix("metacell,gene#UMIs")
array([[  5.,   6.,  66.,   1.,   1.,   1.,   0., 110.,  13.,   1.],
       [ 13.,   2.,   1.,   2.,   1.,   3.,   2.,   3.,   7.,   1.],
       [211.,   1.,   2.,   0.,  91.,   0.,   0.,   1.,   2.,   4.],
       [  1.,   0., 179.,   1.,   0.,   2.,   0.,   9.,   1.,   2.],
       [  3.,   0.,   2.,  18.,   1.,   1.,   1.,   1., 126.,   1.],
       [ 14.,   0.,   1.,   1.,   2.,  10.,   3.,   3.,   6.,   6.],
       [  3.,   2.,   0.,   0.,   1.,   2.,   0.,   2.,   2.,   3.],
       [  0.,   1.,   0.,   0.,   0.,   1.,   1.,   0.,   5.,   1.],
       [ 62.,   0.,   0.,   0.,   2.,   0.,   2.,   0.,   1.,   0.],
       [326.,   0.,   0.,   0., 151.,   0.,   0.,   1.,   0.,   2.]],
      dtype=float32)

>>> data.get_matrix("metacell,gene#UMIs|Fraction|Log,base=2,factor=1e-1|Abs")
array([[3.0056686 , 2.9499593 , 1.239466  , 3.2528863 , 3.2528863 ,
        3.2528863 , 3.321928  , 0.64562523, 2.610649  , 3.2528863 ],
       [1.0848889 , 2.6698513 , 2.959358  , 2.6698513 , 2.959358  ,
        2.4288433 , 2.6698513 , 2.4288433 , 1.7369655 , 2.959358  ],
       [0.36534712, 3.2764134 , 3.2322907 , 3.321928  , 1.3523018 ,
        3.321928  , 3.321928  , 3.2764134 , 3.2322907 , 3.1478987 ],
       [3.2497783 , 3.321928  , 0.02566493, 3.2497783 , 3.321928  ,
        3.1810656 , 3.321928  , 2.7744403 , 3.2497783 , 3.1810656 ],
       [3.0651526 , 3.321928  , 3.1457713 , 2.2050104 , 3.2311625 ,
        3.2311625 , 3.2311625 , 3.2311625 , 0.1231482 , 3.2311625 ],
       [1.3063312 , 3.321928  , 3.038135  , 3.038135  , 2.801096  ,
        1.6556655 , 2.5975626 , 2.5975626 , 2.1175697 , 2.1175697 ],
       [1.7369655 , 2.0995355 , 3.321928  , 3.321928  , 2.5849624 ,
        2.0995355 , 3.321928  , 2.0995355 , 2.0995355 , 1.7369655 ],
       [3.321928  , 2.2439256 , 3.321928  , 3.321928  , 3.321928  ,
        2.2439256 , 2.2439256 , 3.321928  , 0.6092099 , 2.2439256 ],
       [0.03614896, 3.321928  , 3.321928  , 3.321928  , 2.9450738 ,
        3.321928  , 2.9450738 , 3.321928  , 3.1212306 , 3.321928  ],
       [0.35999608, 3.321928  , 3.321928  , 3.321928  , 1.2702659 ,
        3.321928  , 3.321928  , 3.2921808 , 3.321928  , 3.2630343 ]],
      dtype=float32)

1D Names

All 1D names start with axis#.

axis #

The name of the entries of the axis. That is, get_vector("axis#") is the same as axis_entries("axis"). For example:

>>> data.get_vector("cell_type#")
array(['Amnion', 'Forebrain/Midbrain/Hindbrain', 'Neural tube Posterior',
       'Presomitic mesoderm', 'Surface ectoderm', 'caudal mesoderm',
       'epiblast'], dtype=object)

>>> data.axis_entries("cell_type")
array(['Amnion', 'Forebrain/Midbrain/Hindbrain', 'Neural tube Posterior',
       'Presomitic mesoderm', 'Surface ectoderm', 'caudal mesoderm',
       'epiblast'], dtype=object)

axis # property [ | !? ElementWise [ , param = value ]* ]*

The name of a property with a value per entry along some axis, optionally processed by a series of ElementWise operations. For example:

>>> data.get_vector("batch#age")
array([51, 38, 21, 31, 26, 43, 36, 27, 33, 45, 49, 41, 45])

>>> data.get_vector("batch#age|Clip,min=30,max=45")
array([45, 38, 30, 31, 30, 43, 36, 30, 33, 45, 45, 41, 45])

axis [ # axis_property ]+ # property? [ | !? ElementWise [ , param = value ]* ]*

The name of properties which are indices or entry names of some axes, followed by the name of a property of the final axis, optionally processed by a series of ElementWise operations. For example:

  >>> data.get_vector("metacell#cell_type#color")
  array(['#f7f79e', '#CDE089', '#1a3f52', '#f7f79e', '#cc7818', '#647A4F',
         '#635547', '#635547', '#A8DBF7', '#1a3f52'], dtype=object)

A property can refer to an axis either by using its exact name as above or adding some qualifier using ``.``. For
example, if we had a ``metacell#cell_type.projected`` property containing the cell type obtained by projecting the
data on an atlas, we could write ``metacell#cell_type.projected#color`` to access the color of the projected cell type
of each metacell, using the ``cell_type#color`` property.

axis # second_axis = entry , property [ | !? ElementWise [ , param = value ]* ]*

The slice for a specific entry of the data of a 2D property, optionally processed by a series of ElementWise operations. For example:

>>> data.get_vector("metacell#gene=FOXA1,UMIs")
array([6., 2., 1., 0., 0., 0., 2., 1., 0., 0.], dtype=float32)

>>> data.get_vector("metacell#gene=FOXA1,UMIs|Clip,min=1,max=4")
array([4., 2., 1., 1., 1., 1., 2., 1., 1., 1.], dtype=float32)

axis # second_axis , property [ | !? ElementWise [ , param = value ]* ]*

| !? Reduction [ , param = value ]* [ | !? ElementWise [ , param = value ]* ]*

A reduction of 2D data into a single value per row, optionally processed by a series of ElementWise operations. For example:

>>> data.get_vector("metacell#gene,UMIs|Sum")
array([204.,  35., 312., 195., 154.,  46.,  15.,   9.,  67., 480.],
      dtype=float32)

>>> data.get_vector("metacell#gene,UMIs|Fraction|Log,base=2,factor=1e-5|Max|Clip,min=-1.5,max=-0.5")
array([-0.89103884, -1.4288044 , -0.5642817 , -0.5       , -0.5       ,
       -1.5       , -1.5       , -0.84797084, -0.5       , -0.5581411 ],
      dtype=float32)

0D Names

No 0D names contain # (at least not before the first |).

property

The name of a 0D data item property. For example:

>>> data.get_item("created")
datetime.datetime(2022, 7, 6, 16, 49, 44)

axis = entry , property

The value for a specific entry of the data of a 1D property. For example:

>>> data.get_item("batch=Batch_1,age")
38

axis = entry , second_axis = second_entry , property

The value for a specific entry of the data of a 2D property. For example:

>>> data.get_item("metacell=Metacell_1,gene=FOXA1,UMIs")
2.0

axis , property

[ | !? ElementWise [ , param = value ]* ]* | !? Reduction [ , param = value ]*

A reduction into a single value of 1D property with a value per entry along some axis, optionally processed by a series of ElementWise operations. For example:

>>> data.get_item("batch,age|Mean")
37.38461538461539

>>> data.get_item("batch,age|Clip,min=30,max=40|Mean")
36.0

axis , second_axis = entry , property

[ | !? ElementWise [ , param = value ]* ]* | !? Reduction [ , param = value ]*

A reduction into a single value of a slice for a specific entry of the data of a 2D property, optionally processed by a series of ElementWise operations. For example:

>>> data.get_item("metacell,gene=FOXA1,UMIs|Max")
6.0

>>> data.get_item("metacell,gene=FOXA1,UMIs|Clip,min=1,max=3|Mean")
1.4

axis , second_axis , property

[ | !? ElementWise [ , param = value ]* ]* | !? Reduction [ , param = value ]*

[ | !? ElementWise [ , param = value ]* ]* | !? Reduction [ , param = value ]*

A reduction of 2D data into a single value per row and then to a single value, optionally processed by a series of
ElementWise operations. For example:

>>> data.get_item("metacell,gene,UMIs|Sum|Max")
480.0

>>> data.get_item("metacell,gene,UMIs|Fraction|Log,base=2,factor=1e-5|Max|Clip,min=-1.5,max=-0.5|Mean")
-0.8790237

Note

See operations for the list of built-in ElementWise and Reduction operations. Additional operations can be offered by other Python packages. In all the above, prefixing the operation name with ! will prevent their results from being cached. For example, cell#gene,UMIs|!Sum will not cache the total number of UMIs per cell. The current implementation doesn’t cache any 0D data regardless of whether a ! was specified.

Motivation

The above scheme makes sense if you consider that each name starts with a description of the axes/shape of the result, followed by how to extract the result from the data set. This means that to get the sum of the UMIs of all the genes for each cell, we first consider this is per-cell 1D data and therefore must start with cell#. We therefore write cell#gene,UMIs|Sum instead of cell,gene#UMIs|Sum.

This may seem unintuitive at first, but it has some advantages, such as clearly identify the axes/shape of the result of a pipeline. An important feature of the scheme is that the name of any 1D data along some axis has the common prefix axis#. This makes it easy to express data for get_columns, or describe the X and Y coordinates of a scatter plot, or anything along these lines, by providing the common axis and a list suffixes to append to it.

Classes:

DafReader(base, *[, derived, name])

Read-only access to a daf data set.

Functions:

transpose_name(name)

Given a 2D data name rows_axis,columns_axis#name return the transposed data name columns_axis,rows_axis#name.

class daf.access.readers.DafReader(base: StorageReader, *, derived: Optional[StorageWriter] = None, name: str = '.daf#')[source]¶

Bases: object

Read-only access to a daf data set.

The following data is used in all the examples below:

>>> import daf
>>> import yaml
>>> data = daf.DafReader(daf.FilesReader(daf.DAF_EXAMPLE_PATH), name="example")

If the name starts with ., it is appended to the base name. If the name ends with #, we append the object id to it to make it unique.

Attributes:

`name`	The name of the data set for messages.
`base`	The storage the `daf` data set is based on.
`derived`	How to store derived data computed from the storage data, for example, an alternate layout of 2D data, of the result of a pipeline (e.g.
`chain`	A `StorageChain` to use to actually access the data.

Methods:

`as_reader`()	Return the data set as a `DafReader`.
`description`(*[, detail, deep, description])	Return a dictionary describing the `daf` data set, useful for debugging.
`verify_has`(names, *[, reason])	Assert that all the listed data `names` exist in the data set, regardless if each is a 0D, 1D or 2D data name.
`has_data`(name)	Return whether the data set contains the `name` data, regardless of whether it is a 0D, 1D or 2D data.
`item_names`()	Return the list of names of the 0D data items that exists in the data set, in alphabetical order.
`has_item`(name)	Check whether the `name` 0D data item exists in the data set.
`get_item`(name, *[, default])	Access a 0D data item from the data set by its `name`.
`axis_names`()	Return the list of names of the axes that exist in the data set, in alphabetical order.
`has_axis`(axis)	Check whether the `axis` exists in the data set.
`axis_size`(axis)	Get the number of entries along some `axis` (which must exist).
`axis_entries`(axis)	Get the unique name of each entry in the data set along some `axis` (which must exist).
`axis_indices`(axis)	Return a mapping from the axis string entries to the integer indices.
`axis_index`(axis, entry)	Return the index of the `entry` (which must exist) in the entries of the `axis` (which must exist).
`data1d_names`(axis, *[, full])	Return the names of the 1D data that exists in the data set for a specific `axis` (which must exist), in alphabetical order.
`has_data1d`(name)	Check whether the `name` 1D data exists.
`get_vector`(name, *[, default])	Get the `name` 1D data as a `Vector`.
`get_series`(name, *[, default])	Get the `name` 1D data as a `pandas.Series`.
`data2d_names`(axes, *[, full])	Return the names of the 2D data that exists in the data set for a specific pair of `axes` (which must exist).
`has_data2d`(name)	Check whether the `name` 2D data exists.
`get_matrix`(name, *[, default])	Get the `name` 2D data (which must exist) as a `MatrixInRows`.
`get_frame`(name, *[, default])	Get the `name` 2D data (which must exist) as a `pandas.DataFrame`.
`get_columns`(axis[, columns, defaults])	Get an arbitrary collection of 1D data for the same `axis` as `columns` of a `pandas.DataFrame`.
`view`(*[, axes, data, name, cache, hide_implicit])	Create a read-only view of the data set.

name¶: The name of the data set for messages.

base¶: The storage the daf data set is based on.

derived¶: How to store derived data computed from the storage data, for example, an alternate layout of 2D data, of the result of a pipeline (e.g. cell,gene#UMIs|Sum). By default this is stored in a MemoryStorage so expensive operations (such as as_layout) will only be computed once in the application’s lifetime. You can explicitly set this to NO_STORAGE to disable the caching, or specify some persistent storage such as FilesWriter to allow the caching to be reused across multiple application invocations. You can even set this to be the same as the base storage to have everything (base and derived data) be stored in the same place.

chain¶: A StorageChain to use to actually access the data. This looks first in derived and then in the base.

as_reader() → DafReader[source]¶

Return the data set as a DafReader.

This is a no-op (returns self) for “real” read-only data sets, but for writable data sets, it returns a “real” read-only wrapper object (that does not implement the writing methods). This ensures that the result can’t be used to modify the data if passed by mistake to a function that takes a DafWriter.

description(*, detail: bool = False, deep: bool = False, description: Optional[Dict] = None) → Dict[source]¶

Return a dictionary describing the daf data set, useful for debugging.

The result uses the name field as a key, with a nested dictionary value with the keys class, axes, and data.

If not detail, the axes will contain a dictionary mapping each axis to a description of its size, and the data will contain just a list of the data names, data, except for StorageView where it will be a dictionary mapping each exposed name to the base name.

If detail, both the axes and the data will contain a mapping providing additional data_description of the relevant data.

If deep, there may be additional keys describing the internal storage.

If description is provided, collect the result into it. This allows collecting multiple data set descriptions into a single overall system state description.

For example:

>>> print(yaml.dump(data.description()).strip())
example:
  axes:
    batch: 13 entries
    cell: 524 entries
    cell_type: 7 entries
    gene: 10 entries
    metacell: 10 entries
    sex: 2 entries
  class: daf.access.readers.DafReader
  data:
  - created
  - batch#age
  - batch#sex
  - cell#batch
  - cell#metacell
  - cell_type#color
  - gene#forbidden
  - gene#marker
  - metacell#cell_type
  - metacell#umap_x
  - metacell#umap_y
  - cell,gene#UMIs
  - metacell,gene#UMIs

verify_has(names: Union[str, Collection[str]], *, reason: str = 'required') → None[source]¶

Assert that all the listed data names exist in the data set, regardless if each is a 0D, 1D or 2D data name.

To verify an axis exists, list it as axis#.

For example:

>>> data.verify_has("cell#")
>>> data.verify_has(["metacell,gene#UMIs", "batch#age"])
>>> data.verify_has(["cell#color"])  
Traceback (most recent call last):
 ...
AssertionError: missing the data: cell#color which is required in the data set: example

has_data(name: str) → bool[source]¶

Return whether the data set contains the name data, regardless of whether it is a 0D, 1D or 2D data.

To test whether an axis exists, you can use the axis# name.

For example:

>>> data.has_data("cell#")
True

>>> data.has_data("cell,gene#fraction")
False

item_names() → List[str][source]¶

Return the list of names of the 0D data items that exists in the data set, in alphabetical order.

For example:

>>> data.item_names()
['created']

has_item(name: str) → bool[source]¶

Check whether the name 0D data item exists in the data set.

For example:

>>> data.has_item("created")
True

>>> data.has_item("modified")
False

get_item(name: str, *, default: ~typing.Any = <object object>) → Any[source]¶

Access a 0D data item from the data set by its name.

Normally, requesting missing data results in an error. If default is specified, it is returned instead.

The name is the name of some 0D data as described above.

For example:

>>> data.get_item("created")
datetime.datetime(2022, 7, 6, 16, 49, 44)

axis_names() → List[str][source]¶

Return the list of names of the axes that exist in the data set, in alphabetical order.

For example:

>>> data.axis_names()
['batch', 'cell', 'cell_type', 'gene', 'metacell', 'sex']

has_axis(axis: str) → bool[source]¶

Check whether the axis exists in the data set.

For example:

>>> data.has_axis("cell")
True

>>> data.has_axis("height")
False

axis_size(axis: str) → int[source]¶

Get the number of entries along some axis (which must exist).

For example:

>>> data.axis_size("metacell")
10

axis_entries(axis: str) → Vector[source]¶

Get the unique name of each entry in the data set along some axis (which must exist).

Note

You can also get the axis entries using .get_vector by passing it the 1D data name axis#.

For example:

>>> data.axis_entries("gene")
array(['RSPO3', 'FOXA1', 'WNT6', 'TNNI1', 'MSGN1', 'LMO2', 'SFRP5',
       'DLX5', 'ITGA4', 'FOXA2'], dtype=object)

>>> data.get_vector("gene#")
array(['RSPO3', 'FOXA1', 'WNT6', 'TNNI1', 'MSGN1', 'LMO2', 'SFRP5',
       'DLX5', 'ITGA4', 'FOXA2'], dtype=object)

axis_indices(axis: str) → Mapping[str, int][source]¶

Return a mapping from the axis string entries to the integer indices.

For example:

>>> print(yaml.dump(data.axis_indices("gene")).strip())
DLX5: 7
FOXA1: 1
FOXA2: 9
ITGA4: 8
LMO2: 5
MSGN1: 4
RSPO3: 0
SFRP5: 6
TNNI1: 3
WNT6: 2

axis_index(axis: str, entry: str) → int[source]¶

Return the index of the entry (which must exist) in the entries of the axis (which must exist).

For example:

>>> data.axis_index("gene", "FOXA2")
9

data1d_names(axis: str, *, full: bool = True) → List[str][source]¶

Return the names of the 1D data that exists in the data set for a specific axis (which must exist), in alphabetical order.

The returned names are in the format axis#name which uniquely identifies the 1D data. If not full, the returned names include only the simple name without the axis# prefix.

For example:

>>> data.data1d_names("batch")
['batch#age', 'batch#sex']

>>> data.data1d_names("batch", full=False)
['age', 'sex']

has_data1d(name: str) → bool[source]¶

Check whether the name 1D data exists.

The name must be in the format axis#name which uniquely identifies the 1D data.

For example:

>>> data.has_data1d("batch#age")
True

>>> data.has_data1d("batch#height")
False

get_vector(name: str, *, default: Optional[Tuple[Any, Union[str, dtype]]] = None) → Vector[source]¶

Get the name 1D data as a Vector.

Normally, requesting missing data results in an error. If default is specified, a vector containing the specified value and data type is returned.

The name is the name of some 1D data as described above.

For example:

>>> data.get_vector("batch#age")
array([51, 38, 21, 31, 26, 43, 36, 27, 33, 45, 49, 41, 45])

get_series(name: str, *, default: Optional[Tuple[Any, Union[str, dtype]]] = None) → Series[source]¶

Get the name 1D data as a pandas.Series.

Normally, requesting missing data results in an error. If default is specified, a series containing the specified value and data type is returned.

The name is the name of some 1D data as described above.

>>> data.get_series("batch#age")
Batch_0     51
Batch_1     38
Batch_2     21
Batch_3     31
Batch_4     26
Batch_5     43
Batch_6     36
Batch_7     27
Batch_8     33
Batch_9     45
Batch_10    49
Batch_11    41
Batch_12    45
dtype: int64

data2d_names(axes: Union[str, Tuple[str, str]], *, full: bool = True) → List[str][source]¶

Return the names of the 2D data that exists in the data set for a specific pair of axes (which must exist).

The returned names are in the format rows_axis,columns_axis#name which uniquely identifies the 2D data. If not full, the returned names include only the simple name without the row_axis,columns_axis# prefix.

Note

Data will be listed in the results even if it is only stored in the other layout (that is, as columns_axis,rows_axis#name). Such data can still be fetched (e.g. using get_matrix), in which case it will be re-layout internally (and the result will be cached in derived).

>>> data.data2d_names("metacell,gene")
['metacell,gene#UMIs']

>>> data.data2d_names("metacell,gene", full=False)
['UMIs']

has_data2d(name: str) → bool[source]¶

Check whether the name 2D data exists.

The name must be in the format rows_axis,columns_axis#name which uniquely identifies the 2D data.

This will also succeed if only the transposed columns_axis,rows_axis#name data exists in the data set. However, fetching the data in the specified order is likely to be less efficient.

For example:

>>> data.has_data2d("cell,gene#UMIs")
True

>>> data.has_data2d("cell,gene#fraction")
False

get_matrix(name: str, *, default: Optional[Tuple[Any, Union[str, dtype]]] = None) → Union[DenseInRows, SparseInRows][source]¶

Get the name 2D data (which must exist) as a MatrixInRows.

Normally, requesting missing data results in an error. If default is specified, a matrix containing the specified value and data type is returned.

The name is the name of some 2D data as described above.

For example:

>>> data.get_matrix("metacell,gene#UMIs")
array([[  5.,   6.,  66.,   1.,   1.,   1.,   0., 110.,  13.,   1.],
       [ 13.,   2.,   1.,   2.,   1.,   3.,   2.,   3.,   7.,   1.],
       [211.,   1.,   2.,   0.,  91.,   0.,   0.,   1.,   2.,   4.],
       [  1.,   0., 179.,   1.,   0.,   2.,   0.,   9.,   1.,   2.],
       [  3.,   0.,   2.,  18.,   1.,   1.,   1.,   1., 126.,   1.],
       [ 14.,   0.,   1.,   1.,   2.,  10.,   3.,   3.,   6.,   6.],
       [  3.,   2.,   0.,   0.,   1.,   2.,   0.,   2.,   2.,   3.],
       [  0.,   1.,   0.,   0.,   0.,   1.,   1.,   0.,   5.,   1.],
       [ 62.,   0.,   0.,   0.,   2.,   0.,   2.,   0.,   1.,   0.],
       [326.,   0.,   0.,   0., 151.,   0.,   0.,   1.,   0.,   2.]],
      dtype=float32)

get_frame(name: str, *, default: Optional[Tuple[Any, Union[str, dtype]]] = None) → FrameInRows[source]¶

Get the name 2D data (which must exist) as a pandas.DataFrame.

The name is the name of some 2D data as described above.

Note

Storing Sparse data in a pandas.DataFrame fails in various unpleasant ways. Therefore, data for get_frame is always returned in a Dense format. Do not call get_frame unless you are certain that the data size is “within reason”, or that the data is memory-mapped from a Dense format on disk. In one of our data sets, calling get_frame("cell,gene#UMIs") would result in creating a numpy.ndarray of ~240GB(!), compared to the “mere” ~6GB needed to hold the data in a scipy.csr_matrix.

For example:

>>> data.get_frame("metacell,gene#UMIs")  
gene        RSPO3  FOXA1   WNT6  TNNI1  MSGN1  LMO2  SFRP5   DLX5  ITGA4  FOXA2
metacell...
Metacell_0    5.0    6.0   66.0    1.0    1.0   1.0    0.0  110.0   13.0    1.0
Metacell_1   13.0    2.0    1.0    2.0    1.0   3.0    2.0    3.0    7.0    1.0
Metacell_2  211.0    1.0    2.0    0.0   91.0   0.0    0.0    1.0    2.0    4.0
Metacell_3    1.0    0.0  179.0    1.0    0.0   2.0    0.0    9.0    1.0    2.0
Metacell_4    3.0    0.0    2.0   18.0    1.0   1.0    1.0    1.0  126.0    1.0
Metacell_5   14.0    0.0    1.0    1.0    2.0  10.0    3.0    3.0    6.0    6.0
Metacell_6    3.0    2.0    0.0    0.0    1.0   2.0    0.0    2.0    2.0    3.0
Metacell_7    0.0    1.0    0.0    0.0    0.0   1.0    1.0    0.0    5.0    1.0
Metacell_8   62.0    0.0    0.0    0.0    2.0   0.0    2.0    0.0    1.0    0.0
Metacell_9  326.0    0.0    0.0    0.0  151.0   0.0    0.0    1.0    0.0    2.0

get_columns(axis: str, columns: Optional[Sequence[str]] = None, *, defaults: Optional[Sequence[Optional[Tuple[Any, Union[str, dtype]]]]] = None) → FrameInColumns[source]¶

Get an arbitrary collection of 1D data for the same axis as columns of a pandas.DataFrame.

Normally, requesting missing columns results in an error. If a defaults entry is specified for some columns, a vector with the specified value and data type is returned.

The returned data will always be in COLUMN_MAJOR order.

If no columns are specified, returns all the 1D properties for the axis, in alphabetical order (that is, as if columns was set to data1d_names with full=False for the axis).

The specified columns names should only be the suffix following the axis# prefix in the 1D name of the data, as described above.

For example:

>>> data.get_columns("batch")  
         age     sex
batch...
Batch_0   51  female
Batch_1   38  female
Batch_2   21    male
Batch_3   31  female
Batch_4   26    male
Batch_5   43  female
Batch_6   36  female
Batch_7   27    male
Batch_8   33    male
Batch_9   45  female
Batch_10  49    male
Batch_11  41    male
Batch_12  45    male

view(*, axes: Optional[Mapping[str, Union[None, str, Sequence[Any], ndarray, _fake_sparse.spmatrix, Series, DataFrame, AxisView]]] = None, data: Optional[Mapping[str, Optional[str]]] = None, name: str = '.view#', cache: Optional[StorageWriter] = None, hide_implicit: bool = False) → DafReader[source]¶

Create a read-only view of the data set.

This can be used to create slices of some axes, rename axes and/or data, and/or hide some data. It is a wrapper around the constructor of StorageView; see there for the semantics of the parameters, with the exception that here keys of the data dictionary may be any data name, including derived data.

If the name starts with ., it is appended to both the StorageView and the DafReader names. If the name ends with #, we append the object id to it to make it unique.

Note

If any of the axes is sliced, the view will ignore any derived data based on the sliced axes. While some derived data is safe to slice, some isn’t, and it isn’t easy to tell the difference; for example, when slicing the gene axis, then cell,gene#Log,... is safe to slice, but cell,gene#Folds|Significant,... is not. The code therefore plays it safe by ignoring any derived data using any of the sliced axes.

For example:

>>> view = data.view(axes=dict(gene=['FOXA1', 'FOXA2']))
>>> view.axis_entries("gene")
array(['FOXA1', 'FOXA2'], dtype=object)

>>> view = data.view(data={"metacell,gene#UMIs|Fraction": "fraction"})
>>> view.get_series("gene#metacell=Metacell_0,fraction")
RSPO3    0.024510
FOXA1    0.029412
WNT6     0.323529
TNNI1    0.004902
MSGN1    0.004902
LMO2     0.004902
SFRP5    0.000000
DLX5     0.539216
ITGA4    0.063725
FOXA2    0.004902
dtype: float32

daf.access.readers.transpose_name(name: str) → str[source]¶

Given a 2D data name rows_axis,columns_axis#name return the transposed data name columns_axis,rows_axis#name.

Note

This will refuse to transpose pipelined names rows_axis,columns_axis#name|operation|... as doing so would change the meaning of the name. For example, cell,gene#UMIs|Sum gives the sum of the UMIs of all the genes in each cell, while gene,cell#UMIs|Sum gives the sum of the UMIs for all the cells each gene.

For example:

>>> daf.transpose_name("metacell,gene#UMIs")
'gene,metacell#UMIs'