daf.access.operations¶

Operations whose results can be cached in daf data sets.

There are some operations that apply to 1D and 2D data, that are commonly used, so it makes sense to cache their results instead of computing them every time from scratch. A trivial example is the sum of the values in each row of a matrix, which is used for computing averages, normalizing the values into fractions, etc.

To allow for efficiently caching such computations, we place severe restrictions on them. Specifically we support two kind of operations:

ElementWise operations transform 1D/2D data values but maintain its shape (e.g. absolute value).
Reduction operations remove one dimension of the data (e.g. sum), so that vectors become scalars and matrices become vectors. We always apply the reduction to each row of the input (ROW_MAJOR) function.

In both cases, the operation must be “pure”, that is, the result must depend only on the input data and the operation parameters (if any).

Functions:

`operation`(klass)	Mark a class as implementing a `Operation` step in a pipeline.
`parse_float_parameter`(text, description)	Given the `text` value of a parameter, convert it to a floating point value, and if it is invalid, assert using the `description`.
`parse_int_parameter`(text, description)	Given the `text` value of a parameter, convert it to an integer value, and if it is invalid, assert using the `description`.
`parse_number_parameter`(text, description)	Given the `text` value of a parameter, convert it to an numeric value, and if it is invalid, assert using the `description`.
`parse_bool_parameter`(text, description)	Given the `text` value of a parameter, convert it to a Boolean value, and if it is invalid, assert using the `description`.
`float_dtype_for`(dtype)	Given an input `dtype`, return a reasonable output dtype for operations with floating point output.

Classes:

`Abs`(*, _input_dtype[, dtype])	An operation that converts each value to its absolute value.
`Floor`(*, _input_dtype[, dtype])	An operation that converts each value to the largest integer no bigger than the value.
`Round`(*, _input_dtype[, dtype])	An operation that converts each value to the nearest integer.
`Ceil`(*, _input_dtype[, dtype])	An operation that converts each value to the lowest integer no smaller than the value.
`Clip`(*, _input_dtype[, dtype])	An operation that converts each value to the lowest integer no smaller than the value.
`Log`(*, _input_dtype[, dtype])	An operation that converts each value to its `log`.
`Fraction`(*, _input_dtype[, dtype])	An operation that scales the values such that the sum will be 1.
`Reformat`(*, _input_dtype[, dtype])	An operation that converts between `Sparse` and `Dense` matrices.
`Densify`(*, _input_dtype[, dtype])	An operation that converts `Sparse` matrices to `Dense`.
`Sparsify`(*, _input_dtype[, dtype])	An operation that converts `Dense` matrices to `Sparse`.
`Significant`(*, _input_dtype[, dtype, abs])	An operation that converts any data to sparse format, preserving only the significant values.
`Sum`(*, _input_dtype[, dtype])	An operation that sums all the values, assuming there are no `None` (that is, `NaN`) values in the data.
`Min`(*, _input_dtype)	An operation that returns the minimal value, assuming there are no `None` (that is, `NaN`) values in the data.
`Max`(*, _input_dtype)	An operation that returns the maximal value, assuming there are no `None` (that is, `NaN`) values in the data.
`Mean`(*, _input_dtype[, dtype])	An operation that returns the mean value, assuming there are no `None` (that is, `NaN`) values in the data.
`Var`(*, _input_dtype[, dtype])	An operation that returns the variance of the values, assuming there are no `None` (that is, `NaN`) values in the data.
`Std`(*, _input_dtype[, dtype])	An operation that returns the standard deviation of the values, assuming there are no `None` (that is, `NaN`) values in the data.
`Operation`(*, dtype[, canonical])	Common functionality for all operations.
`ElementWise`(*, densifies, sparsifies, dtype)	Describe an element-wise operation (e.g., absolute value).
`Reduction`(*, dtype[, canonical])	Describe a reduction operation (e.g., sum).

daf.access.operations.operation(klass: type) → type[source]¶

Mark a class as implementing a Operation step in a pipeline.

The class should inherit from either Reduction or ElementWise. The point of the annotation is to register the class in the list of known operations, so it would be available for use in ...|OperationName....

For simplicity we register the operation under the (unqualified) class name and assert there are no ambiguities.

For example (this is the actual implementation of Sum):

@operation
class Sum(Reduction):
    def __init__(self, *, _input_dtype: str, dtype: Optional[str] = None) -> None:
        super().__init__(dtype=dtype or _input_dtype)

    def vector_to_scalar(self, input_vector: Vector) -> Any:
        return input_vector.sum()

    def dense_to_vector(self, input_dense: DenseInRows) -> Vector:
        return input_dense.sum(axis=1, dtype=self.dtype)

    def sparse_to_vector(self, input_sparse: SparseInRows) -> Vector:
        return as_vector(input_sparse.sum(axis=1, dtype=self.dtype))

class daf.access.operations.Abs(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: ElementWise

An operation that converts each value to its absolute value.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input.

class daf.access.operations.Floor(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: ElementWise

An operation that converts each value to the largest integer no bigger than the value.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input.

class daf.access.operations.Round(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: ElementWise

An operation that converts each value to the nearest integer.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input.

class daf.access.operations.Ceil(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: ElementWise

An operation that converts each value to the lowest integer no smaller than the value.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input.

class daf.access.operations.Clip(*, _input_dtype: str, dtype: Optional[str] = None, min: str, max: str)[source]¶

Bases: ElementWise

An operation that converts each value to the lowest integer no smaller than the value.

Required Parameters:

min: The minimal allowed value in the result. Lower values will be raised to this value.
max: The maximal allowed value. Higher values will be lowered to this value.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input. However, if either of min or max are floating point numbers, we use float32 if the input data type is up to 32 bits and float64 otherwise.

class daf.access.operations.Log(*, _input_dtype: str, dtype: Optional[str] = None, base: str, factor: str)[source]¶

Bases: ElementWise

An operation that converts each value to its log.

Required Parameters:

base: The base of the log. This can be a number or the special value e to designate the natural logarithm.
factor: A normalization factor added to all values before computing the log, to avoid running into log(0).

Optional Parameters:

dtype: The data type of the output. By default, we use float32 if the input data type is up to 32 bits and float64 otherwise.

class daf.access.operations.Fraction(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: ElementWise

An operation that scales the values such that the sum will be 1.

For 2D data, scales each row separately such that the sum of its entries will be 1.

Note

This assumes the data is non-negative. All-zero data is allowed and will stay all-zero, rather than be converted to NaN.

Optional Parameters:

dtype: The data type of the output. By default, we use float32 if the input data type is up to 32 bits and float64 otherwise.

Note

Be very careful when applying this to a subset of the data. For example, if slicing the data to include just a single module of related genes, then writing cell,gene#UMIs|Fraction will give the fraction of the UMIs of each gene out of the total UMIs of this small subset of the genes, which is very different from the fraction of the UMIs of each gene out of the total UMIs of all the genes in the cell. It is therefore common to provide an explicit cell,gene#fraction property computed for the full data, which will maintain its values if the data is later sliced to focus on a subset of the genes.

class daf.access.operations.Reformat(*, _input_dtype: str, dtype: Optional[str] = None, densifies: bool, sparsifies: bool)[source]¶

Bases: ElementWise

An operation that converts between Sparse and Dense matrices.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input.

Methods:

`vector_to_vector`(input_vector)	Compute the operation on an `input` vector into a new output vector.
`dense_to_dense`(input_dense, output_dense)	Compute the operation on a dense `ROW_MAJOR` `input` matrix into a dense `ROW_MAJOR` output matrix.
`dense_to_sparse`(input_dense)	Compute the operation on a dense `ROW_MAJOR` `input` matrix into a new sparse `ROW_MAJOR` output matrix.
`sparse_to_dense`(input_sparse, output_dense)	Compute the operation on a sparse `ROW_MAJOR` `input` matrix into a dense `ROW_MAJOR` output matrix.
`sparse_to_sparse`(input_sparse)	Compute the operation on a sparse `ROW_MAJOR` `input` matrix into a new sparse `ROW_MAJOR` output matrix.

vector_to_vector(input_vector: Vector) → Vector[source]¶: Compute the operation on an input vector into a new output vector.

dense_to_dense(input_dense: DenseInRows, output_dense: DenseInRows) → None[source]¶

Compute the operation on a dense ROW_MAJOR input matrix into a dense ROW_MAJOR output matrix.

This allows us to pre-allocate the output matrix using StorageWriter.create_dense_in_rows, allowing us to efficiently process large data without consuming excessive amount of RAM.

dense_to_sparse(input_dense: DenseInRows) → SparseInRows[source]¶: Compute the operation on a dense ROW_MAJOR input matrix into a new sparse ROW_MAJOR output matrix.

sparse_to_dense(input_sparse: SparseInRows, output_dense: DenseInRows) → None[source]¶

Compute the operation on a sparse ROW_MAJOR input matrix into a dense ROW_MAJOR output matrix.

This allows us to pre-allocate the output matrix using StorageWriter.create_dense_in_rows, allowing us to efficiently process large data without consuming excessive amount of RAM.

sparse_to_sparse(input_sparse: SparseInRows) → SparseInRows[source]¶

Compute the operation on a sparse ROW_MAJOR input matrix into a new sparse ROW_MAJOR output matrix.

A common idiom is to reuse the indices and indptr arrays of the input and only create a new data array.

class daf.access.operations.Densify(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: Reformat

An operation that converts Sparse matrices to Dense.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input.

class daf.access.operations.Sparsify(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: Reformat

An operation that converts Dense matrices to Sparse.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input.

class daf.access.operations.Significant(*, _input_dtype: str, dtype: Optional[str] = None, low: str, high: str, abs: str = 'True')[source]¶

Bases: ElementWise

An operation that converts any data to sparse format, preserving only the significant values.

Required Parameters:

high: A value of at least this is always preserved.
low: A value of at least this is preserved if, in the same row, there is at least one value which is at least high.

Optional Parameters:

abs: Whether to consider the absolute values when doing the filtering (by default, True). E.g., for fold factors, this will preserve both the strong positive and strong negative fold factors, which is exactly what you’d want for visualization in a heatmap.
dtype: The data type of the output. By default, we use the same data type as the input.

Todo

Is it possible to implement Significant more efficiently for sparse matrices (in pure Python)?

class daf.access.operations.Sum(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: Reduction

An operation that sums all the values, assuming there are no None (that is, NaN) values in the data.

Optional Parameters:

dtype: The data type of the output. By default, we use the same data type as the input. This may be insufficient if the input is a small integer type.

class daf.access.operations.Min(*, _input_dtype: str)[source]¶

Bases: Reduction

An operation that returns the minimal value, assuming there are no None (that is, NaN) values in the data.

class daf.access.operations.Max(*, _input_dtype: str)[source]¶

Bases: Reduction

An operation that returns the maximal value, assuming there are no None (that is, NaN) values in the data.

class daf.access.operations.Mean(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: Reduction

An operation that returns the mean value, assuming there are no None (that is, NaN) values in the data.

Optional Parameters:

dtype: The data type of the output. By default, we use float32 if the input data type is up to 32 bits and float64 otherwise.

class daf.access.operations.Var(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: Reduction

An operation that returns the variance of the values, assuming there are no None (that is, NaN) values in the data.

Optional Parameters:

dtype: The data type of the output. By default, we use float32 if the input data type is up to 32 bits and float64 otherwise.

class daf.access.operations.Std(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶

Bases: Var

An operation that returns the standard deviation of the values, assuming there are no None (that is, NaN) values in the data.

Optional Parameters:

dtype: The data type of the output. By default, we use float32 if the input data type is up to 32 bits and float64 otherwise.

class daf.access.operations.Operation(*, dtype: str, canonical: Optional[str] = None)[source]¶

Bases: ABC

Common functionality for all operations.

When a user writes ...|Operation,parameter=value,..., this is converted to creating an instance of a sub-class of this base class, passing it all the parameter values as keyword arguments, with the addition of a _input_dtype keyword parameter.

The sub-class then calls super().__init__(...) to initialize itself as Operation. It can decide on the value of the following parameters based on the _input_dtype as well as user-provided parameters, if any:

Attributes:

`dtype`	Normally the output `dtype` would be the same as the `_input_dtype`, but sometimes it needs to be different (e.g.
`canonical`	This is used for caching; that is, if the canonical form of the operation is the same for two instances, the operation is assumed to be identical.

dtype¶: Normally the output dtype would be the same as the _input_dtype, but sometimes it needs to be different (e.g. Mean will generate floating point results for integer data). It is expected that all sub-classes will allow the user to specify an optional explicit dtype parameter to override this.

canonical¶: This is used for caching; that is, if the canonical form of the operation is the same for two instances, the operation is assumed to be identical. To maximize its effectiveness, the canonical form should make all the parameters explicit, in the same order, with standard formatting. As a convenience, the dtype is automatically added to the canonical parameters string.

class daf.access.operations.ElementWise(*, densifies: bool, sparsifies: bool, dtype: str, canonical: Optional[str] = None, nop: bool = False)[source]¶

Bases: Operation

Describe an element-wise operation (e.g., absolute value).

Technically, name “element-wise” is inaccurate, since in principle each output element may depend on all the input elements. The actual requirement is that the input and output of element-wise operations have the same shape. They need not produce the same data type, and the output may be dense even if the input is sparse.

The concrete sub-class needs to specify whether this operation densifies sparse matrices or sparsifies dense matrices. This, plus the type of the input, will determine which of the member functions will be called to actually do the computation.

Attributes:

`densifies`	Whether the result of the operation on a `Sparse` input matrix will be `Dense`.
`sparsifies`	Whether the result of the operation on a `Dense` input matrix will be `Sparse`.
`nop`	If `nop`, the sub-class indicates the whole operation is a no-op.

Methods:

`vector_to_vector`(input_vector)	Compute the operation on an `input` vector into a new output vector.
`dense_to_dense`(input_dense, output_dense)	Compute the operation on a dense `ROW_MAJOR` `input` matrix into a dense `ROW_MAJOR` output matrix.
`dense_to_sparse`(input_dense)	Compute the operation on a dense `ROW_MAJOR` `input` matrix into a new sparse `ROW_MAJOR` output matrix.
`sparse_to_dense`(input_sparse, output_dense)	Compute the operation on a sparse `ROW_MAJOR` `input` matrix into a dense `ROW_MAJOR` output matrix.
`sparse_to_sparse`(input_sparse)	Compute the operation on a sparse `ROW_MAJOR` `input` matrix into a new sparse `ROW_MAJOR` output matrix.

densifies¶: Whether the result of the operation on a Sparse input matrix will be Dense. This determines whether ElementWise.sparse_to_sparse or ElementWise.sparse_to_dense will be called when the input is Sparse.

sparsifies¶: Whether the result of the operation on a Dense input matrix will be Sparse. This determines whether ElementWise.dense_to_dense or ElementWise.dense_to_sparse will be called when the input is Dense.

nop¶: If nop, the sub-class indicates the whole operation is a no-op. In this case we just directly use the input data as the output data, unless this is an ElementWise operation that densifies or sparsifies the data.

vector_to_vector(input_vector: Vector) → Vector[source]¶: Compute the operation on an input vector into a new output vector.

dense_to_dense(input_dense: DenseInRows, output_dense: DenseInRows) → None[source]¶

Compute the operation on a dense ROW_MAJOR input matrix into a dense ROW_MAJOR output matrix.

This allows us to pre-allocate the output matrix using StorageWriter.create_dense_in_rows, allowing us to efficiently process large data without consuming excessive amount of RAM.

dense_to_sparse(input_dense: DenseInRows) → SparseInRows[source]¶: Compute the operation on a dense ROW_MAJOR input matrix into a new sparse ROW_MAJOR output matrix.

sparse_to_dense(input_sparse: SparseInRows, output_dense: DenseInRows) → None[source]¶

Compute the operation on a sparse ROW_MAJOR input matrix into a dense ROW_MAJOR output matrix.

This allows us to pre-allocate the output matrix using StorageWriter.create_dense_in_rows, allowing us to efficiently process large data without consuming excessive amount of RAM.

sparse_to_sparse(input_sparse: SparseInRows) → SparseInRows[source]¶

Compute the operation on a sparse ROW_MAJOR input matrix into a new sparse ROW_MAJOR output matrix.

A common idiom is to reuse the indices and indptr arrays of the input and only create a new data array.

class daf.access.operations.Reduction(*, dtype: str, canonical: Optional[str] = None)[source]¶

Bases: Operation

Describe a reduction operation (e.g., sum).

If the input is a vector, the output is a scalar.

If the input is a matrix, the output is a vector with a value for each row. Functionally this should be identical to applying the vector reduction to each matrix row.

Methods:

`vector_to_scalar`(input_vector)	Reduce an input `vector` to a single scalar value.
`dense_to_vector`(input_dense)	Reduce a dense `ROW_MAJOR` `input` matrix into a new per-row output vector.
`sparse_to_vector`(input_sparse)	Reduce a sparse `ROW_MAJOR` `input` matrix into a new per-row output vector.

abstract vector_to_scalar(input_vector: Vector) → Any[source]¶: Reduce an input vector to a single scalar value.

abstract dense_to_vector(input_dense: DenseInRows) → Vector[source]¶: Reduce a dense ROW_MAJOR input matrix into a new per-row output vector.

abstract sparse_to_vector(input_sparse: SparseInRows) → Vector[source]¶: Reduce a sparse ROW_MAJOR input matrix into a new per-row output vector.

daf.access.operations.parse_float_parameter(text: str, description: str) → float[source]¶: Given the text value of a parameter, convert it to a floating point value, and if it is invalid, assert using the description.

daf.access.operations.parse_int_parameter(text: str, description: str) → int[source]¶: Given the text value of a parameter, convert it to an integer value, and if it is invalid, assert using the description.

daf.access.operations.parse_number_parameter(text: str, description: str) → Union[int, float][source]¶: Given the text value of a parameter, convert it to an numeric value, and if it is invalid, assert using the description.

daf.access.operations.parse_bool_parameter(text: str, description: str) → bool[source]¶: Given the text value of a parameter, convert it to a Boolean value, and if it is invalid, assert using the description.

daf.access.operations.float_dtype_for(dtype: str) → str[source]¶: Given an input dtype, return a reasonable output dtype for operations with floating point output.