daf.typing.optimization¶

The generic data types provided by numpy, pandas and scipy.sparse allow for representing data in ways which aren’t optimal for processing. While most operations that take optimal data as input also return optimal data, this isn’t always the case, and the documentation of the relevant libraries is mostly silent on this issue.

At the same time, some code (especially C++ extension code) relies on the data being in one of the optimal formats. Even code that technically works can become much slower when applied to non-optimal data. Therefore, all data fetched from daf is always is_optimal.

Examples of non-optimal data are strangely-strided numpy data, any scipy.sparse.spmatrix that isn’t scipy.sparse.csr_matrix or scipy.sparse.csc_matrix, or any data that is, but contains duplicate and/or unsorted indices.

The functions here allow to test whether data is in an optimal format, and allow converting data to an optimal format, in-place if possible, optionally forcing a copy.

Most of the time, you can ignore these functions. However if you are writing serious processing code (e.g. a library), they are useful in ensuring it will be correct and efficient.

Note

We can easily test and correct for most issues, but if some operation placed “structural zeros” inside a scipy.sparse.csr_matrix or scipy.sparse.csc_matrix data, we have no way to detect them, so the result would be “optimal” as far as the code here goes. Luckily, such operations are rare. If you do encounter this, manually invoke the eliminate_zeros method on the afflicted matrix.

Data:

KnownT

A TypeVar bound to Known.

Functions:

`why_not_optimal`(data)	Return a reason for why some data is not "optimal", or `None` if it is optimal.
`is_optimal`(data)	Whether the `data` is in one of the supported `daf` types and also in an "optimal" format.
`be_optimal`(data)	Assert that some data is in "optimal" format and return it as-is.
`optimize`()	Given some `data` in any `Known` format, return it in a `Proper` "optimal" format.

daf.typing.optimization.KnownT¶

A TypeVar bound to Known.

alias of TypeVar(‘KnownT’, bound=Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame])

daf.typing.optimization.why_not_optimal(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame]) → Optional[str][source]¶: Return a reason for why some data is not “optimal”, or None if it is optimal.

daf.typing.optimization.is_optimal(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame]) → bool[source]¶: Whether the data is in one of the supported daf types and also in an “optimal” format.

daf.typing.optimization.be_optimal(data: KnownT) → KnownT[source]¶: Assert that some data is in “optimal” format and return it as-is.

daf.typing.optimization.optimize(data: Vector, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → Vector[source]¶

daf.typing.optimization.optimize(data: DenseInRows, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → DenseInRows

daf.typing.optimization.optimize(data: DenseInColumns, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → DenseInColumns

daf.typing.optimization.optimize(data: ndarray, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → Union[Vector, DenseInRows, DenseInColumns]

daf.typing.optimization.optimize(data: Series, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → Series

daf.typing.optimization.optimize(data: FrameInRows, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → FrameInRows

daf.typing.optimization.optimize(data: FrameInColumns, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → FrameInColumns

daf.typing.optimization.optimize(data: DataFrame, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → Union[FrameInRows, FrameInColumns]

daf.typing.optimization.optimize(data: SparseInRows, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → SparseInRows

daf.typing.optimization.optimize(data: SparseInColumns, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → SparseInColumns

daf.typing.optimization.optimize(data: spmatrix, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) → Union[SparseInRows, SparseInColumns]

Given some data in any Known format, return it in a Proper “optimal” format.

If possible, and force_copy is not specified, this optimizes the data in-place. Otherwise, a copy is created. E.g. this can sort the indices of a CSR or CSC matrix in-place.

If the data is 2D, and it has no clear layout, a copy will be created using the preferred_layout. E.g. this will determine whether a COO matrix will be converted to a CSR or CSC matrix. For 1D data, this argument is ignored.

If the data was copied and force_copy was not specified, and the data was is_frozen, then so is the result; this ensures the code consuming the result will work regardless of whether a copy was done. If force_copy was specified, the result is never is_frozen.

This will fail if given a pandas.DataFrame with mixed data element types.

Note

This uses unfrozen to modify a scipy.sparse.csr_matrix or a scipy.sparse.csc_matrix in-place, even if it is is_frozen (unless force_copy is specified). This seems acceptable for in-memory sparse matrices, but will fail for read-only memory-mapped sparse matrices; this works because memory-mapped sparse matrices are only created by the FilesWriter, which always writes them in the optimal format, so no in-place modification is done.