--- myst: substitutions: describe: '`describe`' --- (basics-groupby)= # GroupBy cuDF supports a small (but important) subset of Pandas' [groupby API](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html). ## Summary of supported operations 1. Grouping by one or more columns 2. Basic aggregations such as "sum", "mean", etc. 3. Quantile aggregation 4. A "collect" or `list` aggregation for collecting values in a group into lists 5. Automatic exclusion of columns with unsupported dtypes ("nuisance" columns) when aggregating 6. Iterating over the groups of a GroupBy object 7. `GroupBy.groups` API that returns a mapping of group keys to row labels 8. `GroupBy.apply` API for performing arbitrary operations on each group. Note that this has very limited functionality compared to the equivalent Pandas function. See the section on [apply](#groupby-apply) for more details. 9. `GroupBy.pipe` similar to [Pandas](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#piping-function-calls). ## Grouping A GroupBy object is created by grouping the values of a `Series` or `DataFrame` by one or more columns: ```python >>> import cudf >>> df = cudf.DataFrame({'a': [1, 1, 1, 2, 2], 'b': [1, 1, 2, 2, 3], 'c': [1, 2, 3, 4, 5]}) >>> df a b c 0 1 1 1 1 1 1 2 2 1 2 3 3 2 2 4 4 2 3 5 >>> gb1 = df.groupby('a') # grouping by a single column >>> gb2 = df.groupby(['a', 'b']) # grouping by multiple columns >>> gb3 = df.groupby(cudf.Series(['a', 'a', 'b', 'b', 'b'])) # grouping by an external column ``` ````{warning} Unlike Pandas, cuDF uses `sort=False` by default to achieve better performance, which does not guarantee any particular group order in the result. For example: ```python >>> df = cudf.DataFrame({'a' : [2, 2, 1], 'b' : [42, 21, 11]}) >>> df.groupby('a').sum() b a 2 63 1 11 >>> df.to_pandas().groupby('a').sum() b a 1 11 2 63 ``` Setting `sort=True` will produce Pandas-like output, but with some performance penalty: ```python >>> df.groupby('a', sort=True).sum() b a 1 11 2 63 ``` ```` ### Grouping by index levels You can also group by one or more levels of a MultiIndex: ```python >>> df = cudf.DataFrame( ... {'a': [1, 1, 1, 2, 2], 'b': [1, 1, 2, 2, 3], 'c': [1, 2, 3, 4, 5]} ... ).set_index(['a', 'b']) ... >>> df.groupby(level='a') ``` ### The `Grouper` object A `Grouper` can be used to disambiguate between columns and levels when they have the same name: ```python >>> df b c b 1 1 1 1 1 2 1 2 3 2 2 4 2 3 5 >>> df.groupby('b', level='b') # ValueError: Cannot specify both by and level >>> df.groupby([cudf.Grouper(key='b'), cudf.Grouper(level='b')]) # OK ``` ## Aggregation Aggregations on groups are supported via the `agg` method: ```python >>> df a b c 0 1 1 1 1 1 1 2 2 1 2 3 3 2 2 4 4 2 3 5 >>> df.groupby('a').agg('sum') b c a 1 4 6 2 5 9 >>> df.groupby('a').agg({'b': ['sum', 'min'], 'c': 'mean'}) b c sum min mean a 1 4 1 2.0 2 5 2 4.5 >>> df.groupby("a").corr(method="pearson") b c a 1 b 1.000000 0.866025 c 0.866025 1.000000 2 b 1.000000 1.000000 c 1.000000 1.000000 ``` The following table summarizes the available aggregations and the types that support them: ```{eval-rst} .. table:: :class: special-table +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | Aggregations / dtypes | Numeric | Datetime | String | Categorical | List | Struct | Interval | Decimal | +====================================+===========+============+==========+===============+========+==========+============+===========+ | count | ✅ | ✅ | ✅ | ✅ | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | size | ✅ | ✅ | ✅ | ✅ | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | sum | ✅ | ✅ | | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | idxmin | ✅ | ✅ | | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | idxmax | ✅ | ✅ | | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | min | ✅ | ✅ | ✅ | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | max | ✅ | ✅ | ✅ | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | mean | ✅ | ✅ | | | | | | | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | var | ✅ | ✅ | | | | | | | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | std | ✅ | ✅ | | | | | | | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | quantile | ✅ | ✅ | | | | | | | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | median | ✅ | ✅ | | | | | | | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | nunique | ✅ | ✅ | ✅ | ✅ | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | nth | ✅ | ✅ | ✅ | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | collect | ✅ | ✅ | ✅ | | ✅ | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | unique | ✅ | ✅ | ✅ | ✅ | | | | | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | corr | ✅ | | | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ | cov | ✅ | | | | | | | ✅ | +------------------------------------+-----------+------------+----------+---------------+--------+----------+------------+-----------+ ``` ## GroupBy apply To apply function on each group, use the `GroupBy.apply()` method: ```python >>> df a b c 0 1 1 1 1 1 1 2 2 1 2 3 3 2 2 4 4 2 3 5 >>> df.groupby('a').apply(lambda x: x.max() - x.min()) a b c a 0 0 1 2 1 0 1 1 ``` ### Limitations - `apply` works by applying the provided function to each group sequentially, and concatenating the results together. **This can be very slow**, especially for a large number of small groups. For a small number of large groups, it can give acceptable performance. - The results may not always match Pandas exactly. For example, cuDF may return a `DataFrame` containing a single column where Pandas returns a `Series`. Some post-processing may be required to match Pandas behavior. - cuDF does not support some of the exceptional cases that Pandas supports with `apply`, such as calling [describe] inside the callable. ## Transform The `.transform()` method aggregates per group, and broadcasts the result to the group size, resulting in a Series/DataFrame that is of the same size as the input Series/DataFrame. ```python >>> import cudf >>> df = cudf.DataFrame({'a': [2, 1, 1, 2, 2], 'b': [1, 2, 3, 4, 5]}) >>> df.groupby('a').transform('max') b 0 5 1 3 2 3 3 5 4 5 ``` ## Rolling window calculations Use the `GroupBy.rolling()` method to perform rolling window calculations on each group: ```python >>> df a b c 0 1 1 1 1 1 1 2 2 1 2 3 3 2 2 4 4 2 3 5 ``` Rolling window sum on each group with a window size of 2: ```python >>> df.groupby('a').rolling(2).sum() a b c a 1 0 1 2 2 3 2 2 3 5 2 3 4 4 5 9 ``` [describe]: https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#flexible-apply