daf.groups¶

Functions for projecting data between members and groups.

A common idiom is to have two axes such that one is a grouping of the other. For example, in scRNA-seq data, it is common to group cells into clusters, so we have a cell axis and a cluster axis. Often this is a multi-level grouping (cell, sub-cluster, cluster).

In this idiom, by convention there is a 1D data property for the “member” axis specifying the entry of the “group” axis it belongs to. That is, we may see something like cell#cluster which gives for each cell the (integer, 0-based) index of the cluster it belongs to. Since integers don’t support NaN, by convention any negative value (typically -1) is used to say “this cell belongs to no cluster”.

Computing such groups is the goal of complex analysis pipelines and is very much out of scope for a low-level package such as daf. However, once such group(s) are computed, there are universal operations, which it does make sense to provide here:

Aggregation: Compute 1D/2D data for the group axis based on 1D/2D data of the members of the group. See aggregate_group_data1d for an example computing the mean age of the cells in each metacell and aggregate_group_data2d for an example computing the total UMIs of each gene for each metacell (which is the same as the metacell,gene#UMIs).
Counting: Compute 2D data for the group axis based on discrete 1D data of the members axis. See count_group_values for an example counting how many cells in each metacell came from donors of either sex.
Assignment: Compute 1D data for the member axis based on 1D data of the group axis. See assign_group_values for an example assigning for each cell the type of the metacell it belongs to.

Functions:

`aggregate_group_data1d`(data, aggregation, *)	Compute a per-group property value which is the result of applying the `aggregation` function to the vector of values of the members of the group.
`aggregate_group_data2d`(data, aggregation, *)	Compute per-group-per-axis property values which are the result of applying the `aggregation` function to the vector of values of the members of the group.
`most_frequent`(vector)	Return the most frequent value in a `vector`.
`create_group_axis`(data, *, format[, overwrite])	Create a new `group` axis to hold per-group data.
`count_group_members`(data, *[, dtype, overwrite])	Count how many members are included in each group.
`count_group_values`(data, *[, dtype, dense, ...])	Count how many members of each group have each possible property value.
`assign_group_values`(data, *[, dtype, ...])	Assign the per-group property value to each members of the group.

daf.groups.aggregate_group_data1d(data: DafWriter, aggregation: Callable[[Vector], Any], *, default: Optional[Any] = None, dtype: Optional[Union[dtype, str]] = None, overwrite: bool = False) → None[source]¶

Compute a per-group property value which is the result of applying the aggregation function to the vector of values of the members of the group.

The aggregation function can be any function that converts a vector of all member values into a single group value. For example, for discrete data, most_frequent will pick the value that appears in the highest number of members. An optimized version is used if the aggregation is one of numpy.sum, numpy.mean, numpy.var, numpy.std, numpy.median, numpy.min or numpy.max.

The resulting per-group 1D data will have the specified dtype. By default is the same as the data type of of the member values. This is acceptable for an aggregation like np.sum, but would fail for an aggregation like np.mean for integer data.

If no members are assigned to some existing group, then it is given the default value. By default this is None which is acceptable for floating point values (becomes a NaN), but would fail for integer data.

Required Inputs

member#: An axis with one entry per individual group member.
group#: An axis with an entry per group of zero or more members.
member#group: The index of the group each member belongs to, or negative if not a part of any group.
member#property: Some property value associated with each individual member.

Assured Outputs

group#property: The aggregated property value associated with each group.

If overwrite, will overwrite existing data.

For example:

import daf
import numpy as np

data = daf.DafWriter(
    storage=daf.MemoryStorage(name="example.storage"),
    base=daf.FilesReader(daf.DAF_EXAMPLE_PATH, name="example.base"),
    name="example"
)

with data.adapter(
    axes=dict(cell="member", metacell="group"),
    data={"cell#metacell": "group", "cell#batch#age": "property"},
    hide_implicit=True,
    back_data={"group#property": "age.mean"}
) as adapter:
    daf.aggregate_group_data1d(adapter, aggregation=np.mean)

print(data.get_vector("metacell#age.mean"))

[39 41 33 45 43 41 33 38 42 32]

daf.groups.aggregate_group_data2d(data: DafWriter, aggregation: Callable[[Vector], Any], *, default: Optional[Any] = None, dtype: Optional[Union[dtype, str]] = None, overwrite: bool = False) → None[source]¶

Compute per-group-per-axis property values which are the result of applying the aggregation function to the vector of values of the members of the group.

The aggregation function can be any function that converts a vector of all member values into a single group value. For example, for discrete data, most_frequent will pick the value that appears in the highest number of members. An optimized version is used if the aggregation is one of numpy.sum, numpy.mean, numpy.var, numpy.std, numpy.median, numpy.min or numpy.max.

The resulting per-group-per-axis 2D data will have the specified dtype. By default is the same as the data type of of the member values. This is acceptable for an aggregation like np.sum, but would fail for an aggregation like np.mean for integer data.

If no members are assigned to some existing group, then it is given the default value for all entries. By default this is None which is acceptable for floating point values (becomes a NaN), but would fail for integer data.

Todo

Optimize aggregate_group_data2d to avoid creating a temporary dense matrix per group for sparse data, and/or to parallelize the operation in general.

Required Inputs

member#: An axis with one entry per individual group member.
group#: An axis with an entry per group of zero or more members.
axis#: An axis for some 2D property.
member#group: The index of the group each member belongs to, or negative if not a part of any group.
member,axis#property: The property value associated with each individual member and data axis entry.

Assured Outputs

group,axis#property: The aggregated value associated with each group and data axis entry.

If overwrite, will overwrite existing data.

For example:

import daf
import numpy as np

data = daf.DafWriter(
    storage=daf.MemoryStorage(name="example.storage"),
    base=daf.FilesReader(daf.DAF_EXAMPLE_PATH, name="example.base"),
    name="example"
)

with data.adapter(
    axes=dict(cell="member", metacell="group", gene="axis"),
    data={"cell#metacell": "group", "cell,gene#UMIs": "property"},
    hide_implicit=True,
    back_data={"group,axis#property": "UMIs_sum"},
) as adapter:
    daf.aggregate_group_data2d(adapter, aggregation=np.sum, overwrite=True)

print(data.get_series("gene#metacell=Metacell_0,UMIs_sum"))

RSPO3      5
FOXA1      6
WNT6      67
TNNI1      1
MSGN1      1
LMO2       1
SFRP5      0
DLX5     111
ITGA4     13
FOXA2      1
dtype: int32

daf.groups.most_frequent(vector: Vector) → Any[source]¶

Return the most frequent value in a vector.

There is no guarantee that this value appears in the majority of the entries, or in general that it is “very common”. The only guarantee is that there is no other value that is more common.

daf.groups.create_group_axis(data: DafWriter, *, format: str, overwrite: bool = False) → None[source]¶

Create a new group axis to hold per-group data.

Since in daf axis entry names are always strings, we use the format to convert the group index to a string. This format should include %s somewhere in it.

Note

The created axis will be continuous, that is, group axis entries will still be created for all the group indices from zero to the maximal used group index.

Required Inputs

member#: An axis with one entry per individual group member.
member#group: The index of the group each member belongs to. If negative, it is not a part of any group.

Assured Outputs

group#: A new axis with one entry per group.

If overwrite, will overwrite existing data.

daf.groups.count_group_members(data: DafWriter, *, dtype: Union[str, dtype] = 'int32', overwrite: bool = False) → None[source]¶

Count how many members are included in each group.

The resulting per-group 1D data will have the specified dtype. By default is int32 which is a reasonable value for storing counts.

Required Inputs

member#group: The index of the group each member belongs to. If negative, it is not a part of any group.

Assured Outputs

group#members: How many members exist in the group.

If overwrite, will overwrite existing data.

For example:

import daf

data = daf.DafWriter(
    storage=daf.MemoryStorage(name="example.storage"),
    base=daf.FilesReader(daf.DAF_EXAMPLE_PATH, name="example.base"),
    name="example"
)

with data.adapter(
    axes=dict(cell="member", metacell="group"),
    data={"cell#metacell": "group"},
    hide_implicit=True,
    back_data={"group#members": "cells"}
) as adapter:
    daf.count_group_members(adapter)

print(data.get_vector("metacell#cells"))

[ 53 114  36  47  52  97  26  31  34  34]

daf.groups.count_group_values(data: DafWriter, *, dtype: Union[str, dtype] = 'int32', dense: bool = False, overwrite: bool = False) → None[source]¶

Count how many members of each group have each possible property value.

In daf, axis entries always have string values. However, the per-member values 1D data need not contain strings, the only requirement is that converting them to strings will match the values axis entry names. This allows us to deal with data such as “age” which may take a few float values (e.g. would only be one of 6, 6.5, 7 days).

The resulting per-group 2D data will have the specified dtype. By default is int32 which is a reasonable value for storing counts.

By default, store the data in Sparse format. If dense, store it in Dense format.

Required Inputs

member#: An axis with one entry per individual group member.
group#: An axis with an entry per group of zero or more members.
property#: An axis with an entry per value of some property.
member#group: The index of the group each member belongs to. If negative, it is not a part of any group.
member#property: The property value associated with each individual member.

Assured Outputs

group,property#members: How many members have each property value in each group.

If overwrite, will overwrite existing data.

For example:

import daf

data = daf.DafWriter(
    storage=daf.MemoryStorage(name="example.storage"),
    base=daf.FilesReader(daf.DAF_EXAMPLE_PATH, name="example.base"),
    name="example"
)

with data.adapter(
    axes=dict(cell="member", metacell="group", sex="property"),
    data={"cell#metacell": "group", "cell#batch#sex": "property"},
    hide_implicit=True,
    back_data={"group,property#members": "cells"}
) as adapter:
    daf.count_group_values(adapter)

print(data.get_frame("metacell,sex#cells"))

sex         male  female
metacell...
Metacell_0    14      39
Metacell_1   114       0
Metacell_2    22      14
Metacell_3    47       0
Metacell_4    45       7
Metacell_5    97       0
Metacell_6    17       9
Metacell_7     9      22
Metacell_8    34       0
Metacell_9    18      16

daf.groups.assign_group_values(data: DafWriter, *, dtype: Optional[Union[dtype, str]] = None, default: Optional[Any] = None, overwrite: bool = False) → None[source]¶

Assign the per-group property value to each members of the group.

The resulting per-member 1D data will have the specified dtype. By default is the same as the data type of of the group values.

Members that are not a part of any group are given the default value. This is None by default, which is acceptable for floating point values (becomes a NaN), but would fail for integer data.

Required Inputs

member#: An axis with one entry per individual group member.
member#group: The index of the group each member belongs to. If negative, it is not a part of any group.
group#property: The property value associated with each group.

Assured Outputs

member#property: The property value associated with the group of each member.

If overwrite, will overwrite existing data.

For example:

import daf

data = daf.DafWriter(
    storage=daf.MemoryStorage(name="example.storage"),
    base=daf.FilesReader(daf.DAF_EXAMPLE_PATH, name="example.base"),
    name="example"
)

with data.adapter(
    axes=dict(cell="member", metacell="group", sex="property"),
    data={"cell#metacell": "group", "metacell#cell_type": "property"},
    hide_implicit=True,
    back_data={"member#property": "type"}
) as adapter:
    daf.assign_group_values(adapter)

print(data.get_item("cell=Cell_0,type"))

epiblast