daf.typing¶
Provide type annotations and support functions for code processing 1D and 2D data.
The code has to deal with many different alternative data types for what is essentially two basic data types: 2D data and 1D data. These alternative data types have different implementations that expose “almost, but not quite entirely alike” APIs which require different code paths for realistic efficient and correct algorithms. Even simple things like “create a copy of this data” does not work in a uniform, robust way.
Python does support “duck typing” which in theory could have provided uniform APIs for 1D/2D data. However, in practice the implementations are just sufficiently diverse so that uniform code only mostly works. Since Python has very weak static typing analysis, one must mainly rely on tests to validate one’s code; the subtle differences in the 1D/2D data implementations means your tests may pass today, but next year your code will break in strange and mysterious ways a year later, when applies to a combination of 1D/2D data implementations you didn’t test for.
2D data in particular can be represented in a wide range of formats and variants within each format. Efficient code for
each representation requires different code paths which places a burden on the consumers of 2D data from daf
containers.
To minimize this burden, daf
restricts the data types it stores to very few variants. Specifically daf
uses
pandas
, numpy
for dense data and scipy.sparse
for sparse matrices. We further restrict 2D data to be in
either row-major or column-major layouts (see layouts
for details on why and how daf
deals with this). This
requires only a small number of code paths to ensure efficient computation.
Todo
Extend daf
to support arrow
as well.
Pandas 2.0 was extended to support it as a backend in addition to numpy
. This will make our life even more
“interesting” as we’ll learn by trial and error which seemingly “uniform” APIs behave subtly different for this new
1D/2D data implementation.
Todo
Extend daf
to support “masked arrays” for storing nullable integers and nullable Booleans?
This will require accompanying each nullable array with a Boolean mask of valid elements. It will also very likely require the users to support additional code paths.
We also have to deal with the data type of the elements of the data (aka dtype
), which again is only mostly
compatible between numpy
and pandas
. We don’t do it directly in our type annotations because mypy
is not
capable enough to express dtype
without a combinatorial explosion of data types (see dtypes
for details on how
daf
deals with element data types).
We therefore provide here only the minimal mypy
annotations allowing to express the code’s intent and select correct
and efficient code paths, and provide some utilities to at least assert the data types are as expected. In particular,
the type annotations only support the restricted subset we allow to store out of the full set of data types available in
numpy
, pandas
and scipy.sparse
.
In general we provide is_...
functions that test whether some data is of some format (and also works as a mypy
TypeGuard
), be_...
functions that assert that some data is of some format (and return it as such, to make
mypy
effective and happy), and as_...
functions that convert data into specific formats (optionally forcing the
creation of a copy). Additional support functions are provided as needed in the separate sub-modules; the most useful
are as_layout
and optimize
which are needed to ensure code is efficient, and freeze
which helps protect data
against accidental in-place modification.
Since pandas
and scipy.sparse
provide no mypy
type annotations (at least as of Python 3.10), we are forced
to introduce fake type annotation for their types in fake_pandas
and fake_sparse
. Even though numpy
does
provide mypy
type annotations, its use of a catch-all numpy.ndarray
type is not sufficient for capturing the
code intent. Therefore, in most cases, the results of any computation on the data becomes effectively Any
, which
negates the usefulness of using mypy
, unless is...
and be_...
(and the occasional as_...
) are used.
The bottom line is that, as always, type annotations in Python are optional. You can ignore them (which makes sense
for quick-and-dirty scripts, where correctness and performance issues are trivial). If you do want to benefit from them
(for serious code), you need to put in the extra work (adding type annotations and a liberal amount of be_...
calls). The goal of this module is merely to make it possible to do so with the least amount of pain. But even if you
do not use type annotations at all, the support functions provided here are important for writing safe, efficient code
(e.g. copy2d
).
Note
The type annotations here require advanced mypy
features. The code itself will run find in older Python versions
(3.7 and above). However type-checking the code will only work on later versions (3.10 and above).
- daf.typing 1D data
- daf.typing 2D data
- daf.typing support
- daf.typing.unions
AnyData
Known
is_known()
be_known()
Known1D
is_known1d()
be_known1d()
Known2D
is_known2d()
be_known2d()
Proper
is_proper()
be_proper()
Proper1D
is_proper1d()
be_proper1d()
Proper2D
is_proper2d()
be_proper2d()
ProperInRows
is_proper_in_rows()
be_proper_in_rows()
ProperInColumns
is_proper_in_columns()
be_proper_in_columns()
- daf.typing.dtypes
- daf.typing.layouts
- daf.typing.optimization
- daf.typing.freezing
- daf.typing.descriptions
- daf.typing fake
- daf.typing.unions