daf.typing.dtypes¶
Python 1D/2D data has a notion of dtype
to describe the data type of a single elements. In general it is possible to
treat this as a string (e.g. bool
, float32
or int8
) and be done. This works well for numeric types, at least
as long as you stick to explicitly sized types.
Both numpy
and pandas
also allow to store arbitrary data as object
. In contrast, daf
does not allow
storing such arbitrary data as 1D or 2D data elements.
Note
Do not try to store arbitrary objects inside 1D/2D data in daf
. There is no practical way to protect against
this, and things will fail in spectacular and unexpected ways.
That said, daf
does allow storing strings, which greatly complicates the issue. The numpy
data types for
representing strings try to deal with the maximal string size (presumably for efficiency reasons) with dtype
looking
like U5
or <U12
. In contrast, pandas
has no concept of a string dtype
at all, and just represents it as
an object
. But to make things interesting, pandas
also has a category
dtype
which it uses to represent
strings-out-of-some-limited-set.
Since pandas
uses numpy
internally this results with inconsistent dtype
value for data containing strings.
Depending on whether you access the internal numpy
data or the wrapper pandas
data, and whether this data was
created using a numpy
or a pandas
operation, you can get either category
, or U5
, <U12
or even
object
, which makes it impossible to implement an efficient and robust testing for “string-ness” of data.
The way daf
deals with this mess is to restrict itself to storing just plain string data and optimistically assume
that object
means str
. We also never store categorical data, only allowing to store plain string data.
Note
In daf
, it makes more sense to define a “category” as an “axis”, and simply store integer elements whose value
is the index along that axis. The optimization
module helps with converting categorical data into plain string
data.
Some daf
functions take a dtype
(or a collection of them), e.g. when testing whether some data elements have an
acceptable type. This forces us to introduce a single dtype
to stand for “string”, which we have chosen to be U
.
This value has the advantage you can pass it to either numpy
or pandas
when creating new data. You can’t
directly test for dtype == "U"
, of course, but if you pass U
to any daf
function that tests the element
data type (e.g., has_dtype
), then the code will test (to its limited best ability) that the data actually contains
strings.
Data:
Value of |
|
Values of |
|
Values of |
|
Values of |
|
Values for |
|
Values for |
|
All the "acceptable" data types. |
|
Everything acceptable as a specification of a single |
|
Everything acceptable as a specification of a set of |
Functions:
|
Return the type of the element of the data. |
|
Check whether the type of the element of the data is as expected. |
|
Check whether a |
- daf.typing.dtypes.STR_DTYPE = 'U'¶
Value of
dtype
for strings (U
).Note
This is safe to use when creating new data and when testing the data type using
daf
functions. However testing forfoo.dtype == "U"
will always fail because “reasons”. Usehas_dtype
instead.
- daf.typing.dtypes.INT_DTYPES = ('int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64')¶
Values of
dtype
for integers of any size.
- daf.typing.dtypes.FLOAT_DTYPES = ('float16', 'float32', 'float64')¶
Values of
dtype
for floats of any size.
- daf.typing.dtypes.NUM_DTYPES = ('int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64')¶
Values of
dtype
for simple numbers (integers or floats) of any size.
- daf.typing.dtypes.FIXED_DTYPES = ('bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64')¶
Values for
dtype
for fixed-size data (formemory_mapping
).
- daf.typing.dtypes.ENTRIES_DTYPES = ('U', 'bool', 'int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64')¶
Values for
dtype
for specifying slice entries (forStorageView
).
- daf.typing.dtypes.ALL_DTYPES = ('int8', 'int16', 'int32', 'int64', 'uint8', 'uint16', 'uint32', 'uint64', 'float16', 'float32', 'float64', 'bool', 'category', 'object')¶
All the “acceptable” data types.
This is used as the default set of data types when testing for whether a
dtype
is as expected viais_dtype
.Note
We are forced to allow
object
as it is used for strings, but not try to store arbitrary objects insidedaf
1D/2D data. We also allow forcategory
, but only allow actually storing plain strings.
- daf.typing.dtypes.DType(*args, **kwargs)¶
Everything acceptable as a specification of a single
numpy
dtype
.alias of
Union
[str
,dtype
]
- daf.typing.dtypes.DTypes(*args, **kwargs)¶
Everything acceptable as a specification of a set of
numpy
dtype
.alias of
Union
[str
,dtype
,Collection
[str
],Collection
[dtype
],Collection
[Union
[str
,dtype
]]]
- daf.typing.dtypes.dtype_of(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame]) Optional[dtype] [source]¶
Return the type of the element of the data.
And no, calling
.dtype
does not work for allKnown
types, because ofpandas
, which has no concept of apandas.DataFrame
with homogeneous data elements (that is, aFrame
). For a data frame with mixed types, we give up and returnNone
.Note
This will return
STR_DTYPE
for anydtype
we can interpret as a string.
- daf.typing.dtypes.has_dtype(data: Union[ndarray, _fake_sparse.spmatrix, Series, DataFrame], dtypes: Optional[daf.typing.dtypes.DTypes] = None) bool [source]¶
Check whether the type of the element of the data is as expected.
If no
dtypes
are provided, tests forALL_DTYPES
.When testing for strings, use
STR_DTYPE
(that is,U
), sincenumpy
andpandas
use many different actualdtype
values to represent strings, because “reasons”.
- daf.typing.dtypes.is_dtype(dtype: Union[str, dtype], dtypes: Optional[daf.typing.dtypes.DTypes] = None) bool [source]¶
Check whether a
numpy
dtype
is one of the expecteddtypes
.If no
dtypes
are provided, tests forALL_DTYPES
.When testing for strings, use
STR_DTYPE
(that is,U
), sincenumpy
andpandas
use many different actualdtype
values to represent strings, because “reasons”.