daf.typing.optimization¶
The generic data types provided by numpy
, pandas
and scipy.sparse
allow for representing data in ways which
aren’t optimal for processing. While most operations that take optimal data as input also return optimal data, this
isn’t always the case, and the documentation of the relevant libraries is mostly silent on this issue.
At the same time, some code (especially C++ extension code) relies on the data being in one of the optimal formats. Even
code that technically works can become much slower when applied to non-optimal data. Therefore, all data fetched
from daf
is always is_optimal
.
Examples of non-optimal data are strangely-strided numpy
data, any scipy.sparse.spmatrix
that isn’t
scipy.sparse.csr_matrix
or scipy.sparse.csc_matrix
, or any data that is, but contains duplicate and/or unsorted
indices.
The functions here allow to test whether data is in an optimal format, and allow converting data to an optimal format, in-place if possible, optionally forcing a copy.
Most of the time, you can ignore these functions. However if you are writing serious processing code (e.g. a library), they are useful in ensuring it will be correct and efficient.
Note
We can easily test and correct for most issues, but if some operation placed “structural zeros” inside a
scipy.sparse.csr_matrix
or scipy.sparse.csc_matrix
data, we have no way to detect them, so the result would
be “optimal” as far as the code here goes. Luckily, such operations are rare. If you do encounter this, manually
invoke the eliminate_zeros
method on the afflicted matrix.
Data:
A |
Functions:
|
Return a reason for why some data is not "optimal", or |
|
Whether the |
|
Assert that some data is in "optimal" format and return it as-is. |
|
Given some |
- daf.typing.optimization.KnownT¶
A
TypeVar
bound toKnown
.alias of TypeVar(‘KnownT’, bound=
Union
[ndarray
,_fake_sparse.spmatrix
,Series
,DataFrame
])
- daf.typing.optimization.why_not_optimal(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame]) Optional[str] [source]¶
Return a reason for why some data is not “optimal”, or
None
if it is optimal.
- daf.typing.optimization.is_optimal(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame]) bool [source]¶
Whether the
data
is in one of the supporteddaf
types and also in an “optimal” format.
- daf.typing.optimization.be_optimal(data: KnownT) KnownT [source]¶
Assert that some data is in “optimal” format and return it as-is.
- daf.typing.optimization.optimize(data: Vector, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) Vector [source]¶
- daf.typing.optimization.optimize(data: DenseInRows, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) DenseInRows
- daf.typing.optimization.optimize(data: DenseInColumns, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) DenseInColumns
- daf.typing.optimization.optimize(data: ndarray, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) Union[Vector, DenseInRows, DenseInColumns]
- daf.typing.optimization.optimize(data: Series, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) Series
- daf.typing.optimization.optimize(data: FrameInRows, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) FrameInRows
- daf.typing.optimization.optimize(data: FrameInColumns, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) FrameInColumns
- daf.typing.optimization.optimize(data: DataFrame, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) Union[FrameInRows, FrameInColumns]
- daf.typing.optimization.optimize(data: SparseInRows, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) SparseInRows
- daf.typing.optimization.optimize(data: SparseInColumns, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) SparseInColumns
- daf.typing.optimization.optimize(data: spmatrix, *, force_copy: bool = False, preferred_layout: AnyMajor = _layouts.ROW_MAJOR) Union[SparseInRows, SparseInColumns]
Given some
data
in anyKnown
format, return it in aProper
“optimal” format.If possible, and
force_copy
is not specified, this optimizes the data in-place. Otherwise, a copy is created. E.g. this can sort the indices of a CSR or CSC matrix in-place.If the data is 2D, and it has no clear layout, a copy will be created using the
preferred_layout
. E.g. this will determine whether a COO matrix will be converted to a CSR or CSC matrix. For 1D data, this argument is ignored.If the data was copied and
force_copy
was not specified, and the data wasis_frozen
, then so is the result; this ensures the code consuming the result will work regardless of whether a copy was done. Ifforce_copy
was specified, the result is neveris_frozen
.This will fail if given a
pandas.DataFrame
with mixed data element types.Note
This uses
unfrozen
to modify ascipy.sparse.csr_matrix
or ascipy.sparse.csc_matrix
in-place, even if it isis_frozen
(unlessforce_copy
is specified). This seems acceptable for in-memory sparse matrices, but will fail for read-only memory-mapped sparse matrices; this works because memory-mapped sparse matrices are only created by theFilesWriter
, which always writes them in the optimal format, so no in-place modification is done.