daf.typing.layouts

2D data can be arranged in many layouts. The choice of layout should, in theory, be transparent to the code. In practice, the layout is crucial for getting reasonable performance, as accessing data “against the grain” results in orders of magnitude loss of performance.

We restrict 2D data stored in daf to two layouts: ROW_MAJOR and COLUMN_MAJOR. This applies both to dense data and also to sparse data (where “row-major” data means “CSR” and “column-major” means “CSC”).

We provide explicit data type annotations expressing the distinction between these layouts by suffixing the base type with InRows or InColumns (e.g., DenseInRows vs. DenseInColumns). This makes it easier to ensure that operations get data in the correct layout, e.g. summing each row of row-major data would be much, much faster than summing the rows of column-major data. Arguably clever implementation of the algorithms could mitigate this to a large degree, but libraries almost never do these (non-trivial) optimizations.

The code here provides functions to test for the layout of 2D data and to convert data to the desired layout, providing a somewhat more efficient algorithm to do so than is provided by numpy.

Of course you are free to just ignore the layout (or the type annotations altogether). This may be acceptable for very small data sets, but for “serious” code working on non-trivial data, controlling the 2D data layout is vital.

Classes:

AnyMajor()

Allow for either ROW_MAJOR or COLUMN_MAJOR 2D data layout (which are the only valid instances of this class).

Data:

ROW_MAJOR

Require row-major layout.

COLUMN_MAJOR

Require column-major layout.

BLOCK_SIZE

The default size of the block to use (two copies should fit in the L2).

Functions:

has_layout(data2d, layout)

Test whether the given 2D data2d is in some layout.

copy2d()

Create a copy of 2D data, preserving its layout.

fast_all_close(left, right, *[, rtol, atol, ...])

Generalize numpy.allclose to handle more types, and restrict it to only support efficient comparisons, which requires both left and right to have the same layout (if they are 2D).

as_layout()

Get the 2D data in a specific layout.

class daf.typing.layouts.AnyMajor[source]

Bases: object

Allow for either ROW_MAJOR or COLUMN_MAJOR 2D data layout (which are the only valid instances of this class).

This does not allow for other (e.g., strided or COO) layouts.

Attributes:

name

Name for messages.

numpy_order

The numpy order for the layout.

numpy_flag

The numpy flag for the layout.

contiguous_axis

The axis which should be contiguous for the layout.

sparse_class

The sparse matrix class for this layout.

sparse_class_name

The name of the sparse matrix class for this layout.

name = 'any-major'

Name for messages.

numpy_order = '?'

The numpy order for the layout.

numpy_flag = '?'

The numpy flag for the layout.

contiguous_axis = -1

The axis which should be contiguous for the layout.

sparse_class = (<class 'scipy.sparse.csr.csr_matrix'>, <class 'scipy.sparse.csc.csc_matrix'>)

The sparse matrix class for this layout.

sparse_class_name = 'scipy.sparse.csr/csc_matrix'

The name of the sparse matrix class for this layout.

daf.typing.layouts.ROW_MAJOR: RowMajor = <daf.typing.layouts.RowMajor object>

Require row-major layout.

In this layout, the elements of each row are stored contiguously in memory. For sparse matrices, only the non-zero elements are stored (“CSR” format).

daf.typing.layouts.COLUMN_MAJOR: ColumnMajor = <daf.typing.layouts.ColumnMajor object>

Require column-major layout.

In this layout, the elements of each column are stored contiguously in memory. For sparse matrices, only the non-zero elements are stored (“CSC” format).

daf.typing.layouts.has_layout(data2d: Any, layout: AnyMajor) bool[source]

Test whether the given 2D data2d is in some layout.

If given non-Known2D data, will always return False.

Note

Non-sparse 2D data with one row or one column is considered to be both row-major and column-major, if its elements are contiguous in memory.

daf.typing.layouts.copy2d(data2d: DenseInRows) DenseInRows[source]
daf.typing.layouts.copy2d(data2d: DenseInColumns) DenseInColumns
daf.typing.layouts.copy2d(data2d: SparseInRows) SparseInRows
daf.typing.layouts.copy2d(data2d: SparseInColumns) SparseInColumns
daf.typing.layouts.copy2d(data2d: FrameInRows) FrameInRows
daf.typing.layouts.copy2d(data2d: FrameInColumns) FrameInColumns
daf.typing.layouts.copy2d(data2d: ndarray) ndarray
daf.typing.layouts.copy2d(data2d: spmatrix) spmatrix
daf.typing.layouts.copy2d(data2d: DataFrame) DataFrame

Create a copy of 2D data, preserving its layout.

All Known2D data types (numpy.ndarray, scipy.sparse.spmatrix and pandas.DataFrame) have a copy() method, so you would think one can just write data2d.copy() and be done. That is almost true except that in their infinite wisdom numpy will always create the copy in ROW_MAJOR layout, and pandas will always create the copy in COLUMN_MAJOR layout, because “reasons”. Sure, numpy allows specifying order="K" but pandas does not, and such a flag makes no sense for scipy.sparse.spmatrix in the first place.

The code here will give you a proper copy of the data in the same layout as the original. Sigh.

Note

In some (older) versions of pandas, it seems it just isn’t even possible to create a ROW_MAJOR frame of strings.

daf.typing.layouts.fast_all_close(left: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame], right: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame], *, rtol: float = 1e-05, atol: float = 1e-08, equal_nan: bool = False) bool[source]

Generalize numpy.allclose to handle more types, and restrict it to only support efficient comparisons, which requires both left and right to have the same layout (if they are 2D).

Specifically:

  • Both values must be 1D (numpy.ndarray or pandas.Series), or

  • Both values must be must be Sparse matrices, or

  • Both values must be must be either Matrix or pandas.DataFrame.

And if the values are 2D data:

Otherwise the code will assert with a hopefully helpful message.

Note

When comparing Sparse matrices, the rtol, atol and equal_nan values are only used to compare the non-zero values, after ensuring their structure is identical in both matrices. This requires both matrices to be is_optimal.

daf.typing.layouts.as_layout(data2d: ndarray, layout: RowMajor, *, force_copy: bool = False, block_size: Optional[int] = None) DenseInRows[source]
daf.typing.layouts.as_layout(data2d: ndarray, layout: ColumnMajor, *, force_copy: bool = False, block_size: Optional[int] = None) DenseInColumns
daf.typing.layouts.as_layout(data2d: spmatrix, layout: RowMajor, *, force_copy: bool = False, block_size: Optional[int] = None) SparseInRows
daf.typing.layouts.as_layout(data2d: spmatrix, layout: ColumnMajor, *, force_copy: bool = False, block_size: Optional[int] = None) SparseInColumns
daf.typing.layouts.as_layout(data2d: DataFrame, layout: RowMajor, *, force_copy: bool = False, block_size: Optional[int] = None) FrameInRows

Get the 2D data in a specific layout.

If force_copy, return a copy even if the data is already in the required layout.

It turns out that for large dense data it is more efficient to work on the data in blocks that fit a “reasonable” HW cache level. The code here uses a default block_size of 1MB (two copies of this “should” fit in the L2); this value seems to work well in CPUs circa 2022. This optimization should really have been in numpy itself, ideally using some cache-oblivious algorithm, which would negate need for specifying buffer sizes.

Todo

If/when https://github.com/numpy/numpy/issues/21655 is implemented, change the as_layout implementation to use it.

daf.typing.layouts.BLOCK_SIZE = 1048576

The default size of the block to use (two copies should fit in the L2).