daf.access.operations¶
Operations whose results can be cached in daf
data sets.
There are some operations that apply to 1D and 2D data, that are commonly used, so it makes sense to cache their results instead of computing them every time from scratch. A trivial example is the sum of the values in each row of a matrix, which is used for computing averages, normalizing the values into fractions, etc.
To allow for efficiently caching such computations, we place severe restrictions on them. Specifically we support two kind of operations:
ElementWise
operations transform 1D/2D data values but maintain its shape (e.g. absolute value).Reduction
operations remove one dimension of the data (e.g. sum), so that vectors become scalars and matrices become vectors. We always apply the reduction to each row of the input (ROW_MAJOR
) function.
In both cases, the operation must be “pure”, that is, the result must depend only on the input data and the operation parameters (if any).
Functions:
|
Mark a class as implementing a |
|
Given the |
|
Given the |
|
Given the |
|
Given the |
|
Given an input |
Classes:
|
An operation that converts each value to its absolute value. |
|
An operation that converts each value to the largest integer no bigger than the value. |
|
An operation that converts each value to the nearest integer. |
|
An operation that converts each value to the lowest integer no smaller than the value. |
|
An operation that converts each value to the lowest integer no smaller than the value. |
|
An operation that converts each value to its |
|
An operation that scales the values such that the sum will be 1. |
|
An operation that converts between |
|
|
|
|
|
An operation that converts any data to sparse format, preserving only the significant values. |
|
An operation that sums all the values, assuming there are no |
|
An operation that returns the minimal value, assuming there are no |
|
An operation that returns the maximal value, assuming there are no |
|
An operation that returns the mean value, assuming there are no |
|
An operation that returns the variance of the values, assuming there are no |
|
An operation that returns the standard deviation of the values, assuming there are no |
|
Common functionality for all operations. |
|
Describe an element-wise operation (e.g., absolute value). |
|
Describe a reduction operation (e.g., sum). |
- daf.access.operations.operation(klass: type) type [source]¶
Mark a class as implementing a
Operation
step in a pipeline.The class should inherit from either
Reduction
orElementWise
. The point of the annotation is to register the class in the list of known operations, so it would be available for use in...|OperationName...
.For simplicity we register the operation under the (unqualified) class name and assert there are no ambiguities.
For example (this is the actual implementation of
Sum
):@operation class Sum(Reduction): def __init__(self, *, _input_dtype: str, dtype: Optional[str] = None) -> None: super().__init__(dtype=dtype or _input_dtype) def vector_to_scalar(self, input_vector: Vector) -> Any: return input_vector.sum() def dense_to_vector(self, input_dense: DenseInRows) -> Vector: return input_dense.sum(axis=1, dtype=self.dtype) def sparse_to_vector(self, input_sparse: SparseInRows) -> Vector: return as_vector(input_sparse.sum(axis=1, dtype=self.dtype))
- class daf.access.operations.Abs(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
ElementWise
An operation that converts each value to its absolute value.
Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input.
- class daf.access.operations.Floor(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
ElementWise
An operation that converts each value to the largest integer no bigger than the value.
Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input.
- class daf.access.operations.Round(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
ElementWise
An operation that converts each value to the nearest integer.
Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input.
- class daf.access.operations.Ceil(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
ElementWise
An operation that converts each value to the lowest integer no smaller than the value.
Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input.
- class daf.access.operations.Clip(*, _input_dtype: str, dtype: Optional[str] = None, min: str, max: str)[source]¶
Bases:
ElementWise
An operation that converts each value to the lowest integer no smaller than the value.
Required Parameters:
min
The minimal allowed value in the result. Lower values will be raised to this value.
max
The maximal allowed value. Higher values will be lowered to this value.
Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input. However, if either of
min
ormax
are floating point numbers, we usefloat32
if the input data type is up to 32 bits andfloat64
otherwise.
- class daf.access.operations.Log(*, _input_dtype: str, dtype: Optional[str] = None, base: str, factor: str)[source]¶
Bases:
ElementWise
An operation that converts each value to its
log
.Required Parameters:
base
The base of the log. This can be a number or the special value
e
to designate the natural logarithm.factor
A normalization factor added to all values before computing the
log
, to avoid running intolog(0)
.
Optional Parameters:
dtype
The data type of the output. By default, we use
float32
if the input data type is up to 32 bits andfloat64
otherwise.
- class daf.access.operations.Fraction(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
ElementWise
An operation that scales the values such that the sum will be 1.
For 2D data, scales each row separately such that the sum of its entries will be 1.
Note
This assumes the data is non-negative. All-zero data is allowed and will stay all-zero, rather than be converted to
NaN
.Optional Parameters:
dtype
The data type of the output. By default, we use
float32
if the input data type is up to 32 bits andfloat64
otherwise.
Note
Be very careful when applying this to a subset of the data. For example, if slicing the data to include just a single module of related genes, then writing
cell,gene#UMIs|Fraction
will give the fraction of the UMIs of each gene out of the total UMIs of this small subset of the genes, which is very different from the fraction of the UMIs of each gene out of the total UMIs of all the genes in the cell. It is therefore common to provide an explicitcell,gene#fraction
property computed for the full data, which will maintain its values if the data is later sliced to focus on a subset of the genes.
- class daf.access.operations.Reformat(*, _input_dtype: str, dtype: Optional[str] = None, densifies: bool, sparsifies: bool)[source]¶
Bases:
ElementWise
An operation that converts between
Sparse
andDense
matrices.Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input.
Methods:
vector_to_vector
(input_vector)Compute the operation on an
input
vector into a new output vector.dense_to_dense
(input_dense, output_dense)Compute the operation on a dense
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.dense_to_sparse
(input_dense)Compute the operation on a dense
ROW_MAJOR
input
matrix into a new sparseROW_MAJOR
output matrix.sparse_to_dense
(input_sparse, output_dense)Compute the operation on a sparse
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.sparse_to_sparse
(input_sparse)Compute the operation on a sparse
ROW_MAJOR
input
matrix into a new sparseROW_MAJOR
output matrix.- vector_to_vector(input_vector: Vector) Vector [source]¶
Compute the operation on an
input
vector into a new output vector.
- dense_to_dense(input_dense: DenseInRows, output_dense: DenseInRows) None [source]¶
Compute the operation on a dense
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.This allows us to pre-allocate the output matrix using
StorageWriter.create_dense_in_rows
, allowing us to efficiently process large data without consuming excessive amount of RAM.
- dense_to_sparse(input_dense: DenseInRows) SparseInRows [source]¶
Compute the operation on a dense
ROW_MAJOR
input
matrix into a new sparseROW_MAJOR
output matrix.
- sparse_to_dense(input_sparse: SparseInRows, output_dense: DenseInRows) None [source]¶
Compute the operation on a sparse
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.This allows us to pre-allocate the output matrix using
StorageWriter.create_dense_in_rows
, allowing us to efficiently process large data without consuming excessive amount of RAM.
- class daf.access.operations.Densify(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
Reformat
An operation that converts
Sparse
matrices toDense
.Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input.
- class daf.access.operations.Sparsify(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
Reformat
An operation that converts
Dense
matrices toSparse
.Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input.
- class daf.access.operations.Significant(*, _input_dtype: str, dtype: Optional[str] = None, low: str, high: str, abs: str = 'True')[source]¶
Bases:
ElementWise
An operation that converts any data to sparse format, preserving only the significant values.
Required Parameters:
high
A value of at least this is always preserved.
low
A value of at least this is preserved if, in the same row, there is at least one value which is at least
high
.
Optional Parameters:
abs
Whether to consider the absolute values when doing the filtering (by default,
True
). E.g., for fold factors, this will preserve both the strong positive and strong negative fold factors, which is exactly what you’d want for visualization in a heatmap.dtype
The data type of the output. By default, we use the same data type as the input.
Todo
Is it possible to implement
Significant
more efficiently for sparse matrices (in pure Python)?
- class daf.access.operations.Sum(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
Reduction
An operation that sums all the values, assuming there are no
None
(that is,NaN
) values in the data.Optional Parameters:
dtype
The data type of the output. By default, we use the same data type as the input. This may be insufficient if the input is a small integer type.
- class daf.access.operations.Min(*, _input_dtype: str)[source]¶
Bases:
Reduction
An operation that returns the minimal value, assuming there are no
None
(that is,NaN
) values in the data.
- class daf.access.operations.Max(*, _input_dtype: str)[source]¶
Bases:
Reduction
An operation that returns the maximal value, assuming there are no
None
(that is,NaN
) values in the data.
- class daf.access.operations.Mean(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
Reduction
An operation that returns the mean value, assuming there are no
None
(that is,NaN
) values in the data.Optional Parameters:
dtype
The data type of the output. By default, we use
float32
if the input data type is up to 32 bits andfloat64
otherwise.
- class daf.access.operations.Var(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
Reduction
An operation that returns the variance of the values, assuming there are no
None
(that is,NaN
) values in the data.Optional Parameters:
dtype
The data type of the output. By default, we use
float32
if the input data type is up to 32 bits andfloat64
otherwise.
- class daf.access.operations.Std(*, _input_dtype: str, dtype: Optional[str] = None)[source]¶
Bases:
Var
An operation that returns the standard deviation of the values, assuming there are no
None
(that is,NaN
) values in the data.Optional Parameters:
dtype
The data type of the output. By default, we use
float32
if the input data type is up to 32 bits andfloat64
otherwise.
- class daf.access.operations.Operation(*, dtype: str, canonical: Optional[str] = None)[source]¶
Bases:
ABC
Common functionality for all operations.
When a user writes
...|Operation,parameter=value,...
, this is converted to creating an instance of a sub-class of this base class, passing it all the parameter values as keyword arguments, with the addition of a_input_dtype
keyword parameter.The sub-class then calls
super().__init__(...)
to initialize itself asOperation
. It can decide on the value of the following parameters based on the_input_dtype
as well as user-provided parameters, if any:Attributes:
Normally the output
dtype
would be the same as the_input_dtype
, but sometimes it needs to be different (e.g.This is used for caching; that is, if the canonical form of the operation is the same for two instances, the operation is assumed to be identical.
- dtype¶
Normally the output
dtype
would be the same as the_input_dtype
, but sometimes it needs to be different (e.g.Mean
will generate floating point results for integer data). It is expected that all sub-classes will allow the user to specify an optional explicitdtype
parameter to override this.
- canonical¶
This is used for caching; that is, if the canonical form of the operation is the same for two instances, the operation is assumed to be identical. To maximize its effectiveness, the canonical form should make all the parameters explicit, in the same order, with standard formatting. As a convenience, the
dtype
is automatically added to thecanonical
parameters string.
- class daf.access.operations.ElementWise(*, densifies: bool, sparsifies: bool, dtype: str, canonical: Optional[str] = None, nop: bool = False)[source]¶
Bases:
Operation
Describe an element-wise operation (e.g., absolute value).
Technically, name “element-wise” is inaccurate, since in principle each output element may depend on all the input elements. The actual requirement is that the input and output of element-wise operations have the same shape. They need not produce the same data type, and the output may be dense even if the input is sparse.
The concrete sub-class needs to specify whether this operation
densifies
sparse matrices orsparsifies
dense matrices. This, plus the type of the input, will determine which of the member functions will be called to actually do the computation.Attributes:
Whether the result of the operation on a
Sparse
input matrix will beDense
.Whether the result of the operation on a
Dense
input matrix will beSparse
.If
nop
, the sub-class indicates the whole operation is a no-op.Methods:
vector_to_vector
(input_vector)Compute the operation on an
input
vector into a new output vector.dense_to_dense
(input_dense, output_dense)Compute the operation on a dense
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.dense_to_sparse
(input_dense)Compute the operation on a dense
ROW_MAJOR
input
matrix into a new sparseROW_MAJOR
output matrix.sparse_to_dense
(input_sparse, output_dense)Compute the operation on a sparse
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.sparse_to_sparse
(input_sparse)Compute the operation on a sparse
ROW_MAJOR
input
matrix into a new sparseROW_MAJOR
output matrix.- densifies¶
Whether the result of the operation on a
Sparse
input matrix will beDense
. This determines whetherElementWise.sparse_to_sparse
orElementWise.sparse_to_dense
will be called when the input isSparse
.
- sparsifies¶
Whether the result of the operation on a
Dense
input matrix will beSparse
. This determines whetherElementWise.dense_to_dense
orElementWise.dense_to_sparse
will be called when the input isDense
.
- nop¶
If
nop
, the sub-class indicates the whole operation is a no-op. In this case we just directly use the input data as the output data, unless this is anElementWise
operation thatdensifies
orsparsifies
the data.
- vector_to_vector(input_vector: Vector) Vector [source]¶
Compute the operation on an
input
vector into a new output vector.
- dense_to_dense(input_dense: DenseInRows, output_dense: DenseInRows) None [source]¶
Compute the operation on a dense
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.This allows us to pre-allocate the output matrix using
StorageWriter.create_dense_in_rows
, allowing us to efficiently process large data without consuming excessive amount of RAM.
- dense_to_sparse(input_dense: DenseInRows) SparseInRows [source]¶
Compute the operation on a dense
ROW_MAJOR
input
matrix into a new sparseROW_MAJOR
output matrix.
- sparse_to_dense(input_sparse: SparseInRows, output_dense: DenseInRows) None [source]¶
Compute the operation on a sparse
ROW_MAJOR
input
matrix into a denseROW_MAJOR
output matrix.This allows us to pre-allocate the output matrix using
StorageWriter.create_dense_in_rows
, allowing us to efficiently process large data without consuming excessive amount of RAM.
- class daf.access.operations.Reduction(*, dtype: str, canonical: Optional[str] = None)[source]¶
Bases:
Operation
Describe a reduction operation (e.g., sum).
If the input is a vector, the output is a scalar.
If the input is a matrix, the output is a vector with a value for each row. Functionally this should be identical to applying the vector reduction to each matrix row.
Methods:
vector_to_scalar
(input_vector)Reduce an input
vector
to a single scalar value.dense_to_vector
(input_dense)Reduce a dense
ROW_MAJOR
input
matrix into a new per-row output vector.sparse_to_vector
(input_sparse)Reduce a sparse
ROW_MAJOR
input
matrix into a new per-row output vector.- abstract vector_to_scalar(input_vector: Vector) Any [source]¶
Reduce an input
vector
to a single scalar value.
- daf.access.operations.parse_float_parameter(text: str, description: str) float [source]¶
Given the
text
value of a parameter, convert it to a floating point value, and if it is invalid, assert using thedescription
.
- daf.access.operations.parse_int_parameter(text: str, description: str) int [source]¶
Given the
text
value of a parameter, convert it to an integer value, and if it is invalid, assert using thedescription
.
- daf.access.operations.parse_number_parameter(text: str, description: str) Union[int, float] [source]¶
Given the
text
value of a parameter, convert it to an numeric value, and if it is invalid, assert using thedescription
.