Aggregation Groupby#
- group aggregation_groupby
-
struct aggregation_request#
- #include <groupby.hpp>
Request for groupby aggregation(s) to perform on a column.
The group membership of each
value[i]
is determined by the corresponding rowi
in the original order ofkeys
used to construct thegroupby
. I.e., for eachaggregation
,values[i]
is aggregated with all othervalues[j]
where rowsi
andj
inkeys
are equivalent.values.size()
column must equalkeys.num_rows()
.Public Members
-
column_view values#
The elements to aggregate.
-
std::vector<std::unique_ptr<groupby_aggregation>> aggregations#
Desired aggregations.
-
column_view values#
-
struct scan_request#
- #include <groupby.hpp>
Request for groupby aggregation(s) for scanning a column.
The group membership of each
value[i]
is determined by the corresponding rowi
in the original order ofkeys
used to construct thegroupby
. I.e., for eachaggregation
,values[i]
is aggregated with all othervalues[j]
where rowsi
andj
inkeys
are equivalent.values.size()
column must equalkeys.num_rows()
.Public Members
-
column_view values#
The elements to aggregate.
-
std::vector<std::unique_ptr<groupby_scan_aggregation>> aggregations#
Desired aggregations.
-
column_view values#
-
struct aggregation_result#
- #include <groupby.hpp>
The result(s) of an
aggregation_request
For every
aggregation_request
given togroupby::aggregate
anaggregation_result
will be returned. Theaggregation_result
holds the resulting column(s) for each requested aggregation on therequest
s values.Public Members
-
std::vector<std::unique_ptr<column>> results = {}#
Columns of results from an
aggregation_request
-
std::vector<std::unique_ptr<column>> results = {}#
-
class groupby#
- #include <groupby.hpp>
Groups values by keys and computes aggregations on those groups.
Public Functions
-
explicit groupby(table_view const &keys, null_policy null_handling = null_policy::EXCLUDE, sorted keys_are_sorted = sorted::NO, std::vector<order> const &column_order = {}, std::vector<null_order> const &null_precedence = {})#
Construct a groupby object with the specified
keys
If the
keys
are already sorted, better performance may be achieved by passingkeys_are_sorted == true
and indicating the ascending/descending order of each column and null order incolumn_order
andnull_precedence
, respectively.Note
This object does not maintain the lifetime of
keys
. It is the user’s responsibility to ensure thegroupby
object does not outlive the data viewed by thekeys
table_view
.- Parameters:
keys – Table whose rows act as the groupby keys
null_handling – Indicates whether rows in
keys
that contain NULL values should be includedkeys_are_sorted – Indicates whether rows in
keys
are already sortedcolumn_order – If
keys_are_sorted == YES
, indicates whether each column is ascending/descending. If empty, assumes all columns are ascending. Ignored ifkeys_are_sorted == false
.null_precedence – If
keys_are_sorted == YES
, indicates the ordering of null values in each column. Else, ignored. If empty, assumes all columns usenull_order::AFTER
. Ignored ifkeys_are_sorted == false
.
-
std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> aggregate(host_span<aggregation_request const> requests, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Performs grouped aggregations on the specified values.
The values to aggregate and the aggregations to perform are specified in an
aggregation_request
. Each request contains acolumn_view
of values to aggregate and a set ofaggregation
s to perform on those elements.For each
aggregation
in a request,values[i]
is aggregated with all othervalues[j]
where rowsi
andj
inkeys
are equivalent.The
size()
of the request column must equalkeys.num_rows()
.For every
aggregation_request
anaggregation_result
will be returned. Theaggregation_result
holds the resulting column(s) for each requested aggregation on therequest
s values. The order of the columns in each result is the same order as was specified in the request.The returned
table
contains the group labels for each group, i.e., the unique rows fromkeys
. Elementi
across all aggregation results belongs to the group at rowi
in the group labels table.The order of the rows in the group labels is arbitrary. Furthermore, successive
groupby::aggregate
calls may return results in different orders.Example:
Input: keys: {1 2 1 3 1} {1 2 1 4 1} request: values: {3 1 4 9 2} aggregations: {{SUM}, {MIN}} result: keys: {3 1 2} {4 1 2} values: SUM: {9 9 1} MIN: {9 2 1}
- Throws:
cudf::logic_error – If
requests[i].values.size() != keys.num_rows()
.- Parameters:
requests – The set of columns to aggregate and the aggregations to perform
stream – CUDA stream used for device memory operations and kernel launches.
mr – Device memory resource used to allocate the returned table and columns’ device memory
- Returns:
Pair containing the table with each group’s unique key and a vector of aggregation_results for each request in the same order as specified in
requests
.
-
std::pair<std::unique_ptr<table>, std::vector<aggregation_result>> scan(host_span<scan_request const> requests, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Performs grouped scans on the specified values.
The values to aggregate and the aggregations to perform are specified in an
aggregation_request
. Each request contains acolumn_view
of values to aggregate and a set ofaggregation
s to perform on those elements.For each
aggregation
in a request,values[i]
is scan aggregated with all previousvalues[j]
where rowsi
andj
inkeys
are equivalent.The
size()
of the request column must equalkeys.num_rows()
.For every
aggregation_request
anaggregation_result
will be returned. Theaggregation_result
holds the resulting column(s) for each requested aggregation on therequest
s values. The order of the columns in each result is the same order as was specified in the request.The returned
table
contains the group labels for each row, i.e., thekeys
given to groupby object. Elementi
across all aggregation results belongs to the group at rowi
in the group labels table.The order of the rows in the group labels is arbitrary. Furthermore, successive
groupby::scan
calls may return results in different orders.Example:
Input: keys: {1 2 1 3 1} {1 2 1 4 1} request: values: {3 1 4 9 2} aggregations: {{SUM}, {MIN}} result: keys: {3 1 1 1 2} {4 1 1 1 2} values: SUM: {9 3 7 9 1} MIN: {9 3 3 2 1}
- Throws:
cudf::logic_error – If
requests[i].values.size() != keys.num_rows()
.- Parameters:
requests – The set of columns to scan and the scans to perform
stream – CUDA stream used for device memory operations and kernel launches.
mr – Device memory resource used to allocate the returned table and columns’ device memory
- Returns:
Pair containing the table with each group’s key and a vector of aggregation_results for each request in the same order as specified in
requests
.
-
std::pair<std::unique_ptr<table>, std::unique_ptr<table>> shift(table_view const &values, host_span<size_type const> offsets, std::vector<std::reference_wrapper<scalar const>> const &fill_values, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Performs grouped shifts for specified values.
In
j
th column, for each group,i
th element is determined by thei - offsets[j]
th element of the group. Ifi - offsets[j] < 0 or >= group_size
, the value is determined byfill_values
[j].Example:
keys: {1 4 1 3 4 4 1} {1 2 1 3 2 2 1} values: {3 9 1 4 2 5 7} {"a" "c" "bb" "ee" "z" "x" "d"} offset: {2, -1} fill_value: {@, @} result (group order maybe different): keys: {3 1 1 1 4 4 4} {3 1 1 1 2 2 2} values: {@ @ @ 3 @ @ 9} {@ "bb" "d" @ "z" "x" @} ------------------------------------------------- keys: {1 4 1 3 4 4 1} {1 2 1 3 2 2 1} values: {3 9 1 4 2 5 7} {"a" "c" "bb" "ee" "z" "x" "d"} offset: {-2, 1} fill_value: {-1, "42"} result (group order maybe different): keys: {3 1 1 1 4 4 4} {3 1 1 1 2 2 2} values: {-1 7 -1 -1 5 -1 -1} {"42" "42" "a" "bb" "42" "c" "z"}
Note
The first returned table stores the keys passed to the groupby object. Row
i
of the key table corresponds to the group labels of rowi
in the shifted columns. The key order in each group matches the input order. The order of each group is arbitrary. The group order in successive calls togroupby::shifts
may be different.- Parameters:
values – Table whose columns to be shifted
offsets – The offsets by which to shift the input
fill_values – Fill values for indeterminable outputs
stream – CUDA stream used for device memory operations and kernel launches.
mr – Device memory resource used to allocate the returned table and columns’ device memory
- Throws:
cudf::logic_error – if
fill_value
[i] dtype does not matchvalues
[i] dtype fori
th column- Returns:
Pair containing the tables with each group’s key and the columns shifted
-
groups get_groups(cudf::table_view values = {}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Get the grouped keys and values corresponding to a groupby operation on a set of values.
Returns a
groups
object representing the grouped keys and values. If values is not provided, only a grouping of the keys is performed, and thevalues
of thegroups
object will benullptr
.- Parameters:
values – Table representing values on which a groupby operation is to be performed
stream – CUDA stream used for device memory operations and kernel launches.
mr – Device memory resource used to allocate the returned tables’s device memory in the returned groups
- Returns:
A
groups
object representing grouped keys and values
-
std::pair<std::unique_ptr<table>, std::unique_ptr<table>> replace_nulls(table_view const &values, host_span<cudf::replace_policy const> replace_policies, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Performs grouped replace nulls on
value
.For each
value[i] == NULL
in groupj
,value[i]
is replaced with the first non-null value in groupj
that precedes or followsvalue[i]
. If a non-null value is not found in the specified direction,value[i]
is left NULL.The returned pair contains a column of the sorted keys and the result column. In result column, values of the same group are in contiguous memory. In each group, the order of values maintain their original order. The order of groups are not guaranteed.
Example:
//Inputs: keys: {3 3 1 3 1 3 4} {2 2 1 2 1 2 5} values: {3 4 7 @ @ @ @} {@ @ @ "x" "tt" @ @} replace_policies: {FORWARD, BACKWARD} //Outputs (group orders may be different): keys: {3 3 3 3 1 1 4} {2 2 2 2 1 1 5} result: {3 4 4 4 7 7 @} {"x" "x" "x" @ "tt" "tt" @}
- Parameters:
values – [in] A table whose column null values will be replaced
replace_policies – [in] Specify the position of replacement values relative to null values, one for each column
stream – [in] CUDA stream used for device memory operations and kernel launches.
mr – [in] Device memory resource used to allocate device memory of the returned column
- Returns:
Pair that contains a table with the sorted keys and the result column
-
struct groups#
- #include <groupby.hpp>
The grouped data corresponding to a groupby operation on a set of values.
A
groups
object holds two tables of identical number of rows: a table of grouped keys and a table of grouped values. In addition, it holds a vector of integer offsets into the rows of the tables, such thatoffsets[i+1] - offsets[i]
gives the size of groupi
.
-
explicit groupby(table_view const &keys, null_policy null_handling = null_policy::EXCLUDE, sorted keys_are_sorted = sorted::NO, std::vector<order> const &column_order = {}, std::vector<null_order> const &null_precedence = {})#
-
struct aggregation_request#