daf.typing.dtypes

Python 1D/2D data has a notion of dtype to describe the data type of a single elements. In general it is possible to treat this as a string (e.g. bool, float32 or int8) and be done. This works well for numeric types, at least as long as you stick to explicitly sized types.

Both numpy and pandas also allow to store arbitrary data as object. In contrast, daf does not allow storing such arbitrary data as 1D or 2D data elements.

Note

Do not try to store arbitrary objects inside 1D/2D data in daf. There is no practical way to protect against this, and things will fail in spectacular and unexpected ways.

That said, daf does allow storing strings, which greatly complicates the issue. The numpy data types for representing strings try to deal with the maximal string size (presumably for efficiency reasons) with dtype looking like U5 or <U12. In contrast, pandas has no concept of a string dtype at all, and just represents it as an object. But to make things interesting, pandas also has a category dtype which it uses to represent strings-out-of-some-limited-set.

Since pandas uses numpy internally this results with inconsistent dtype value for data containing strings. Depending on whether you access the internal numpy data or the wrapper pandas data, and whether this data was created using a numpy or a pandas operation, you can get either category, or U5, <U12 or even object, which makes it impossible to implement an efficient and robust testing for “string-ness” of data.

The way daf deals with this mess is to restrict itself to storing just plain string data and optimistically assume that object means str. We also never store categorical data, only allowing to store plain string data.

Note

In daf, it makes more sense to define a “category” as an “axis”, and simply store integer elements whose value is the index along that axis. The optimization module helps with converting categorical data into plain string data.

Some daf functions take a dtype (or a collection of them), e.g. when testing whether some data elements have an acceptable type. This forces us to introduce a single dtype to stand for “string”, which we have chosen to be U. This value has the advantage you can pass it to either numpy or pandas when creating new data. You can’t directly test for dtype == "U", of course, but if you pass U to any daf function that tests the element data type (e.g., has_dtype), then the code will test (to its limited best ability) that the data actually contains strings.

Data:

STR_DTYPE

Value of dtype for strings (U).

INT_DTYPES

Values of dtype for integers of any size.

FLOAT_DTYPES

Values of dtype for floats of any size.

NUM_DTYPES

Values of dtype for simple numbers (integers or floats) of any size.

FIXED_DTYPES

Values for dtype for fixed-size data (for memory_mapping).

ENTRIES_DTYPES

Values for dtype for specifying slice entries (for StorageView).

ALL_DTYPES

All the "acceptable" data types.

DType

Everything acceptable as a specification of a single numpy dtype.

DTypes

Everything acceptable as a specification of a set of numpy dtype.

Functions:

dtype_of(data)

Return the type of the element of the data.

has_dtype(data[, dtypes])

Check whether the type of the element of the data is as expected.

is_dtype(dtype[, dtypes])

Check whether a numpy dtype is one of the expected dtypes.

daf.typing.dtypes.STR_DTYPE = 'U'

Value of dtype for strings (U).

Note

This is safe to use when creating new data and when testing the data type using daf functions. However testing for foo.dtype == "U" will always fail because “reasons”. Use has_dtype instead.

daf.typing.dtypes.INT_DTYPES = ('int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64')

Values of dtype for integers of any size.

daf.typing.dtypes.FLOAT_DTYPES = ('float16', 'float32', 'float64')

Values of dtype for floats of any size.

daf.typing.dtypes.NUM_DTYPES = ('int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64')

Values of dtype for simple numbers (integers or floats) of any size.

daf.typing.dtypes.FIXED_DTYPES = ('bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64')

Values for dtype for fixed-size data (for memory_mapping).

daf.typing.dtypes.ENTRIES_DTYPES = ('U', 'bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64')

Values for dtype for specifying slice entries (for StorageView).

daf.typing.dtypes.ALL_DTYPES = ('int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64', 'bool', 'category', 'object')

All the “acceptable” data types.

This is used as the default set of data types when testing for whether a dtype is as expected via is_dtype.

Note

We are forced to allow object as it is used for strings, but not try to store arbitrary objects inside daf 1D/2D data. We also allow for category, but only allow actually storing plain strings.

daf.typing.dtypes.DType(*args, **kwargs)

Everything acceptable as a specification of a single numpy dtype.

alias of Union[str, dtype]

daf.typing.dtypes.DTypes(*args, **kwargs)

Everything acceptable as a specification of a set of numpy dtype.

alias of Union[str, dtype, Collection[str], Collection[dtype], Collection[Union[str, dtype]]]

daf.typing.dtypes.dtype_of(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame]) Optional[dtype][source]

Return the type of the element of the data.

And no, calling .dtype does not work for all Known types, because of pandas, which has no concept of a pandas.DataFrame with homogeneous data elements (that is, a Frame). For a data frame with mixed types, we give up and return None.

Note

This will return STR_DTYPE for any dtype we can interpret as a string.

daf.typing.dtypes.has_dtype(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame], dtypes: Optional[daf.typing.dtypes.DTypes] = None) bool[source]

Check whether the type of the element of the data is as expected.

If no dtypes are provided, tests for ALL_DTYPES.

When testing for strings, use STR_DTYPE (that is, U), since numpy and pandas use many different actual dtype values to represent strings, because “reasons”.

daf.typing.dtypes.is_dtype(dtype: Union[str, dtype], dtypes: Optional[daf.typing.dtypes.DTypes] = None) bool[source]

Check whether a numpy dtype is one of the expected dtypes.

If no dtypes are provided, tests for ALL_DTYPES.

When testing for strings, use STR_DTYPE (that is, U), since numpy and pandas use many different actual dtype values to represent strings, because “reasons”.