daf.typing.layouts¶
2D data can be arranged in many layouts. The choice of layout should, in theory, be transparent to the code. In practice, the layout is crucial for getting reasonable performance, as accessing data “against the grain” results in orders of magnitude loss of performance.
We restrict 2D data stored in daf
to two layouts: ROW_MAJOR
and COLUMN_MAJOR
. This applies both to dense data
and also to sparse data (where “row-major” data means “CSR” and “column-major” means “CSC”).
We provide explicit data type annotations expressing the distinction between these layouts by suffixing the base type
with InRows
or InColumns
(e.g., DenseInRows
vs. DenseInColumns
). This makes it easier to ensure that
operations get data in the correct layout, e.g. summing each row of row-major data would be much, much faster than
summing the rows of column-major data. Arguably clever implementation of the algorithms could mitigate this to a large
degree, but libraries almost never do these (non-trivial) optimizations.
The code here provides functions to test for the layout of 2D data and to convert data to the desired layout, providing
a somewhat more efficient algorithm to do so than is provided by numpy
.
Of course you are free to just ignore the layout (or the type annotations altogether). This may be acceptable for very small data sets, but for “serious” code working on non-trivial data, controlling the 2D data layout is vital.
Classes:
|
Allow for either |
Data:
Require row-major layout. |
|
Require column-major layout. |
|
The default size of the block to use (two copies should fit in the L2). |
Functions:
|
Test whether the given 2D |
|
Create a copy of 2D data, preserving its layout. |
|
Generalize |
Get the 2D data in a specific layout. |
- class daf.typing.layouts.AnyMajor[source]¶
Bases:
object
Allow for either
ROW_MAJOR
orCOLUMN_MAJOR
2D data layout (which are the only valid instances of this class).This does not allow for other (e.g., strided or COO) layouts.
Attributes:
Name for messages.
The
numpy
order for the layout.The
numpy
flag for the layout.The axis which should be contiguous for the layout.
The sparse matrix class for this layout.
The name of the sparse matrix class for this layout.
- name = 'any-major'¶
Name for messages.
- numpy_order = '?'¶
The
numpy
order for the layout.
- numpy_flag = '?'¶
The
numpy
flag for the layout.
- contiguous_axis = -1¶
The axis which should be contiguous for the layout.
- sparse_class = (<class 'scipy.sparse.csr.csr_matrix'>, <class 'scipy.sparse.csc.csc_matrix'>)¶
The sparse matrix class for this layout.
- sparse_class_name = 'scipy.sparse.csr/csc_matrix'¶
The name of the sparse matrix class for this layout.
- daf.typing.layouts.ROW_MAJOR: RowMajor = <daf.typing.layouts.RowMajor object>¶
Require row-major layout.
In this layout, the elements of each row are stored contiguously in memory. For sparse matrices, only the non-zero elements are stored (“CSR” format).
- daf.typing.layouts.COLUMN_MAJOR: ColumnMajor = <daf.typing.layouts.ColumnMajor object>¶
Require column-major layout.
In this layout, the elements of each column are stored contiguously in memory. For sparse matrices, only the non-zero elements are stored (“CSC” format).
- daf.typing.layouts.has_layout(data2d: Any, layout: AnyMajor) bool [source]¶
Test whether the given 2D
data2d
is in somelayout
.If given non-
Known2D
data, will always returnFalse
.Note
Non-sparse 2D data with one row or one column is considered to be both row-major and column-major, if its elements are contiguous in memory.
- daf.typing.layouts.copy2d(data2d: DenseInRows) DenseInRows [source]¶
- daf.typing.layouts.copy2d(data2d: DenseInColumns) DenseInColumns
- daf.typing.layouts.copy2d(data2d: SparseInRows) SparseInRows
- daf.typing.layouts.copy2d(data2d: SparseInColumns) SparseInColumns
- daf.typing.layouts.copy2d(data2d: FrameInRows) FrameInRows
- daf.typing.layouts.copy2d(data2d: FrameInColumns) FrameInColumns
- daf.typing.layouts.copy2d(data2d: ndarray) ndarray
- daf.typing.layouts.copy2d(data2d: spmatrix) spmatrix
- daf.typing.layouts.copy2d(data2d: DataFrame) DataFrame
Create a copy of 2D data, preserving its layout.
All
Known2D
data types (numpy.ndarray
,scipy.sparse.spmatrix
andpandas.DataFrame
) have acopy()
method, so you would think one can just writedata2d.copy()
and be done. That is almost true except that in their infinite wisdomnumpy
will always create the copy inROW_MAJOR
layout, andpandas
will always create the copy inCOLUMN_MAJOR
layout, because “reasons”. Sure,numpy
allows specifyingorder="K"
butpandas
does not, and such a flag makes no sense forscipy.sparse.spmatrix
in the first place.The code here will give you a proper copy of the data in the same layout as the original. Sigh.
Note
In some (older) versions of
pandas
, it seems it just isn’t even possible to create aROW_MAJOR
frame of strings.
- daf.typing.layouts.fast_all_close(left: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame], right: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame], *, rtol: float = 1e-05, atol: float = 1e-08, equal_nan: bool = False) bool [source]¶
Generalize
numpy.allclose
to handle more types, and restrict it to only support efficient comparisons, which requires bothleft
andright
to have the same layout (if they are 2D).Specifically:
Both values must be 1D (
numpy.ndarray
orpandas.Series
), orBoth values must be must be
Sparse
matrices, orBoth values must be must be either
Matrix
orpandas.DataFrame
.
And if the values are 2D data:
Both values must be in
ROW_MAJOR
layout, orBoth values must be in
COLUMN_MAJOR
layout.
Otherwise the code will
assert
with a hopefully helpful message.Note
When comparing
Sparse
matrices, thertol
,atol
andequal_nan
values are only used to compare the non-zero values, after ensuring their structure is identical in both matrices. This requires both matrices to beis_optimal
.
- daf.typing.layouts.as_layout(data2d: ndarray, layout: RowMajor, *, force_copy: bool = False, block_size: Optional[int] = None) DenseInRows [source]¶
- daf.typing.layouts.as_layout(data2d: ndarray, layout: ColumnMajor, *, force_copy: bool = False, block_size: Optional[int] = None) DenseInColumns
- daf.typing.layouts.as_layout(data2d: spmatrix, layout: RowMajor, *, force_copy: bool = False, block_size: Optional[int] = None) SparseInRows
- daf.typing.layouts.as_layout(data2d: spmatrix, layout: ColumnMajor, *, force_copy: bool = False, block_size: Optional[int] = None) SparseInColumns
- daf.typing.layouts.as_layout(data2d: DataFrame, layout: RowMajor, *, force_copy: bool = False, block_size: Optional[int] = None) FrameInRows
Get the 2D data in a specific layout.
If
force_copy
, return a copy even if the data is already in the required layout.It turns out that for large dense data it is more efficient to work on the data in blocks that fit a “reasonable” HW cache level. The code here uses a default
block_size
of 1MB (two copies of this “should” fit in the L2); this value seems to work well in CPUs circa 2022. This optimization should really have been innumpy
itself, ideally using some cache-oblivious algorithm, which would negate need for specifying buffer sizes.Todo
If/when https://github.com/numpy/numpy/issues/21655 is implemented, change the
as_layout
implementation to use it.
- daf.typing.layouts.BLOCK_SIZE = 1048576¶
The default size of the block to use (two copies should fit in the L2).