daf.typing

Provide type annotations and support functions for code processing 1D and 2D data.

The code has to deal with many different alternative data types for what is essentially two basic data types: 2D data and 1D data. These alternative data types have different implementations that expose “almost, but not quite entirely alike” APIs which require different code paths for realistic efficient and correct algorithms. Even simple things like “create a copy of this data” does not work in a uniform, robust way.

Python does support “duck typing” which in theory could have provided uniform APIs for 1D/2D data. However, in practice the implementations are just sufficiently diverse so that uniform code only mostly works. Since Python has very weak static typing analysis, one must mainly rely on tests to validate one’s code; the subtle differences in the 1D/2D data implementations means your tests may pass today, but next year your code will break in strange and mysterious ways a year later, when applies to a combination of 1D/2D data implementations you didn’t test for.

2D data in particular can be represented in a wide range of formats and variants within each format. Efficient code for each representation requires different code paths which places a burden on the consumers of 2D data from daf containers.

To minimize this burden, daf restricts the data types it stores to very few variants. Specifically daf uses pandas, numpy for dense data and scipy.sparse for sparse matrices. We further restrict 2D data to be in either row-major or column-major layouts (see layouts for details on why and how daf deals with this). This requires only a small number of code paths to ensure efficient computation.

Todo

Extend daf to support arrow as well.

Pandas 2.0 was extended to support it as a backend in addition to numpy. This will make our life even more “interesting” as we’ll learn by trial and error which seemingly “uniform” APIs behave subtly different for this new 1D/2D data implementation.

Todo

Extend daf to support “masked arrays” for storing nullable integers and nullable Booleans?

This will require accompanying each nullable array with a Boolean mask of valid elements. It will also very likely require the users to support additional code paths.

We also have to deal with the data type of the elements of the data (aka dtype), which again is only mostly compatible between numpy and pandas. We don’t do it directly in our type annotations because mypy is not capable enough to express dtype without a combinatorial explosion of data types (see dtypes for details on how daf deals with element data types).

We therefore provide here only the minimal mypy annotations allowing to express the code’s intent and select correct and efficient code paths, and provide some utilities to at least assert the data types are as expected. In particular, the type annotations only support the restricted subset we allow to store out of the full set of data types available in numpy, pandas and scipy.sparse.

In general we provide is_... functions that test whether some data is of some format (and also works as a mypy TypeGuard), be_... functions that assert that some data is of some format (and return it as such, to make mypy effective and happy), and as_... functions that convert data into specific formats (optionally forcing the creation of a copy). Additional support functions are provided as needed in the separate sub-modules; the most useful are as_layout and optimize which are needed to ensure code is efficient, and freeze which helps protect data against accidental in-place modification.

Since pandas and scipy.sparse provide no mypy type annotations (at least as of Python 3.10), we are forced to introduce fake type annotation for their types in fake_pandas and fake_sparse. Even though numpy does provide mypy type annotations, its use of a catch-all numpy.ndarray type is not sufficient for capturing the code intent. Therefore, in most cases, the results of any computation on the data becomes effectively Any, which negates the usefulness of using mypy, unless is... and be_... (and the occasional as_...) are used.

The bottom line is that, as always, type annotations in Python are optional. You can ignore them (which makes sense for quick-and-dirty scripts, where correctness and performance issues are trivial). If you do want to benefit from them (for serious code), you need to put in the extra work (adding type annotations and a liberal amount of be_... calls). The goal of this module is merely to make it possible to do so with the least amount of pain. But even if you do not use type annotations at all, the support functions provided here are important for writing safe, efficient code (e.g. copy2d).

Note

The type annotations here require advanced mypy features. The code itself will run find in older Python versions (3.7 and above). However type-checking the code will only work on later versions (3.10 and above).