Aggregation Rolling#
- group aggregation_rolling
Functions
-
std::unique_ptr<column> rolling_window(column_view const &input, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &agg, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a fixed-size rolling window function to the values in a column.
This function aggregates values in a window around each element i of the input column, and invalidates the bit mask for element i if there are not enough observations. The window size is static (the same for each element). This matches Pandas’ API for DataFrame.rolling with a few notable differences:
instead of the center flag it uses a two-part window to allow for more flexible windows. The total window size =
preceding_window + following_window
. Elementi
uses elements[i-preceding_window+1, i+following_window]
to do the window computation.instead of storing NA/NaN for output rows that do not meet the minimum number of observations this function updates the valid bitmask of the column to indicate which elements are valid.
Notes on return column types:
The returned column for count aggregation always has
INT32
type.The returned column for VARIANCE/STD aggregations always has
FLOAT64
type.All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) to
FLOAT32
orFLOAT64
before doing a rollingMEAN
.
- Parameters:
input – [in] The input column
preceding_window – [in] The static rolling window size in the backward direction
following_window – [in] The static rolling window size in the forward direction
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.agg – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> rolling_window(column_view const &input, column_view const &default_outputs, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &agg, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a fixed-size rolling window function to the values in a column.
This function aggregates values in a window around each element i of the input column, and invalidates the bit mask for element i if there are not enough observations. The window size is static (the same for each element). This matches Pandas’ API for DataFrame.rolling with a few notable differences:
instead of the center flag it uses a two-part window to allow for more flexible windows. The total window size =
preceding_window + following_window
. Elementi
uses elements[i-preceding_window+1, i+following_window]
to do the window computation.instead of storing NA/NaN for output rows that do not meet the minimum number of observations this function updates the valid bitmask of the column to indicate which elements are valid.
Notes on return column types:
The returned column for count aggregation always has
INT32
type.The returned column for VARIANCE/STD aggregations always has
FLOAT64
type.All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) to
FLOAT32
orFLOAT64
before doing a rollingMEAN
.
- Parameters:
input – [in] The input column
preceding_window – [in] The static rolling window size in the backward direction
following_window – [in] The static rolling window size in the forward direction
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.agg – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
default_outputs – A column of per-row default values to be returned instead of nulls. Used for LEAD()/LAG(), if the row offset crosses the boundaries of the column.
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a grouping-aware, fixed-size rolling window function to the values in a column.
Like
rolling_window()
, this function aggregates values in a window around each element of a specifiedinput
column. It differs fromrolling_window()
in that elements of theinput
column are grouped into distinct groups (e.g. the result of a groupby). The window aggregation cannot cross the group boundaries. For a rowi
ofinput
, the group is determined from the corresponding (i.e. i-th) values of the columns undergroup_keys
.Note: This method requires that the rows are presorted by the
group_key
values.Example: Consider a user-sales dataset, where the rows look as follows: { "user_id", sales_amt, day } The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by `user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including current row), 1 row following). In this example, 1. `group_keys == [ user_id ]` 2. `input == sales_amt` The data are grouped by `user_id`, and ordered by `day`-string. The aggregation (SUM) is then calculated for a window of 3 values around (and including) each row. For the following input: [ // user, sales_amt { "user1", 10 }, { "user2", 20 }, { "user1", 20 }, { "user1", 10 }, { "user2", 30 }, { "user2", 80 }, { "user1", 50 }, { "user1", 60 }, { "user2", 40 } ] Partitioning (grouping) by `user_id` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<------user2-------> The SUM aggregation is applied with 1 preceding and 1 following row, with a minimum of 1 period. The aggregation window is thus 3 rows wide, yielding the following column: [ 30, 40, 80, 120, 110, 50, 130, 150, 120 ] Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8) consider only 2 values each, in spite of the window-size being 3. Each aggregation operation cannot cross group boundaries.
The returned column for
op == COUNT
always hasINT32
type. All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) toFLOAT32
orFLOAT64
before doing a rollingMEAN
.Note:
preceding_window
andfollowing_window
could well have negative values. This yields windows where the current row might not be included at all. For instance, consider a window defined as (preceding=3, following=-1). This produces a window from 2 (i.e. 3-1) rows preceding the current row, and 1 row preceding the current row. For the example above, the window for row#3 is:[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—window–> ^ | current_row
Similarly,
preceding
could have a negative value, indicating that the window begins at a position after the current row. It differs slightly from the semantics forfollowing
, becausepreceding
includes the current row. Therefore:preceding=1 => Window starts at the current row.
preceding=0 => Window starts at 1 past the current row.
preceding=-1 => Window starts at 2 past the current row. Etc.
- Parameters:
group_keys – [in] The (pre-sorted) grouping columns
input – [in] The input column (to be aggregated)
preceding_window – [in] The static rolling window size in the backward direction (for positive values), or forward direction (for negative values)
following_window – [in] The static rolling window size in the forward direction (for positive values), or backward direction (for negative values)
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.aggr – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, window_bounds preceding_window, window_bounds following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a grouping-aware, fixed-size rolling window function to the values in a column.
Like
rolling_window()
, this function aggregates values in a window around each element of a specifiedinput
column. It differs fromrolling_window()
in that elements of theinput
column are grouped into distinct groups (e.g. the result of a groupby). The window aggregation cannot cross the group boundaries. For a rowi
ofinput
, the group is determined from the corresponding (i.e. i-th) values of the columns undergroup_keys
.Note: This method requires that the rows are presorted by the
group_key
values.Example: Consider a user-sales dataset, where the rows look as follows: { "user_id", sales_amt, day } The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by `user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including current row), 1 row following). In this example, 1. `group_keys == [ user_id ]` 2. `input == sales_amt` The data are grouped by `user_id`, and ordered by `day`-string. The aggregation (SUM) is then calculated for a window of 3 values around (and including) each row. For the following input: [ // user, sales_amt { "user1", 10 }, { "user2", 20 }, { "user1", 20 }, { "user1", 10 }, { "user2", 30 }, { "user2", 80 }, { "user1", 50 }, { "user1", 60 }, { "user2", 40 } ] Partitioning (grouping) by `user_id` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<------user2-------> The SUM aggregation is applied with 1 preceding and 1 following row, with a minimum of 1 period. The aggregation window is thus 3 rows wide, yielding the following column: [ 30, 40, 80, 120, 110, 50, 130, 150, 120 ] Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8) consider only 2 values each, in spite of the window-size being 3. Each aggregation operation cannot cross group boundaries.
The returned column for
op == COUNT
always hasINT32
type. All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) toFLOAT32
orFLOAT64
before doing a rollingMEAN
.Note:
preceding_window
andfollowing_window
could well have negative values. This yields windows where the current row might not be included at all. For instance, consider a window defined as (preceding=3, following=-1). This produces a window from 2 (i.e. 3-1) rows preceding the current row, and 1 row preceding the current row. For the example above, the window for row#3 is:[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—window–> ^ | current_row
Similarly,
preceding
could have a negative value, indicating that the window begins at a position after the current row. It differs slightly from the semantics forfollowing
, becausepreceding
includes the current row. Therefore:preceding=1 => Window starts at the current row.
preceding=0 => Window starts at 1 past the current row.
preceding=-1 => Window starts at 2 past the current row. Etc.
- Parameters:
group_keys – [in] The (pre-sorted) grouping columns
input – [in] The input column (to be aggregated)
preceding_window – [in] The static rolling window size in the backward direction (for positive values), or forward direction (for negative values)
following_window – [in] The static rolling window size in the forward direction (for positive values), or backward direction (for negative values)
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.aggr – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, column_view const &default_outputs, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a grouping-aware, fixed-size rolling window function to the values in a column.
Like
rolling_window()
, this function aggregates values in a window around each element of a specifiedinput
column. It differs fromrolling_window()
in that elements of theinput
column are grouped into distinct groups (e.g. the result of a groupby). The window aggregation cannot cross the group boundaries. For a rowi
ofinput
, the group is determined from the corresponding (i.e. i-th) values of the columns undergroup_keys
.Note: This method requires that the rows are presorted by the
group_key
values.Example: Consider a user-sales dataset, where the rows look as follows: { "user_id", sales_amt, day } The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by `user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including current row), 1 row following). In this example, 1. `group_keys == [ user_id ]` 2. `input == sales_amt` The data are grouped by `user_id`, and ordered by `day`-string. The aggregation (SUM) is then calculated for a window of 3 values around (and including) each row. For the following input: [ // user, sales_amt { "user1", 10 }, { "user2", 20 }, { "user1", 20 }, { "user1", 10 }, { "user2", 30 }, { "user2", 80 }, { "user1", 50 }, { "user1", 60 }, { "user2", 40 } ] Partitioning (grouping) by `user_id` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<------user2-------> The SUM aggregation is applied with 1 preceding and 1 following row, with a minimum of 1 period. The aggregation window is thus 3 rows wide, yielding the following column: [ 30, 40, 80, 120, 110, 50, 130, 150, 120 ] Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8) consider only 2 values each, in spite of the window-size being 3. Each aggregation operation cannot cross group boundaries.
The returned column for
op == COUNT
always hasINT32
type. All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) toFLOAT32
orFLOAT64
before doing a rollingMEAN
.Note:
preceding_window
andfollowing_window
could well have negative values. This yields windows where the current row might not be included at all. For instance, consider a window defined as (preceding=3, following=-1). This produces a window from 2 (i.e. 3-1) rows preceding the current row, and 1 row preceding the current row. For the example above, the window for row#3 is:[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—window–> ^ | current_row
Similarly,
preceding
could have a negative value, indicating that the window begins at a position after the current row. It differs slightly from the semantics forfollowing
, becausepreceding
includes the current row. Therefore:preceding=1 => Window starts at the current row.
preceding=0 => Window starts at 1 past the current row.
preceding=-1 => Window starts at 2 past the current row. Etc.
- Parameters:
group_keys – [in] The (pre-sorted) grouping columns
input – [in] The input column (to be aggregated)
preceding_window – [in] The static rolling window size in the backward direction (for positive values), or forward direction (for negative values)
following_window – [in] The static rolling window size in the forward direction (for positive values), or backward direction (for negative values)
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.aggr – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
default_outputs – A column of per-row default values to be returned instead of nulls. Used for LEAD()/LAG(), if the row offset crosses the boundaries of the column or group.
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> grouped_rolling_window(table_view const &group_keys, column_view const &input, column_view const &default_outputs, window_bounds preceding_window, window_bounds following_window, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a grouping-aware, fixed-size rolling window function to the values in a column.
Like
rolling_window()
, this function aggregates values in a window around each element of a specifiedinput
column. It differs fromrolling_window()
in that elements of theinput
column are grouped into distinct groups (e.g. the result of a groupby). The window aggregation cannot cross the group boundaries. For a rowi
ofinput
, the group is determined from the corresponding (i.e. i-th) values of the columns undergroup_keys
.Note: This method requires that the rows are presorted by the
group_key
values.Example: Consider a user-sales dataset, where the rows look as follows: { "user_id", sales_amt, day } The `grouped_rolling_window()` method enables windowing queries such as grouping a dataset by `user_id`, and summing up the `sales_amt` column over a window of 3 rows (2 preceding (including current row), 1 row following). In this example, 1. `group_keys == [ user_id ]` 2. `input == sales_amt` The data are grouped by `user_id`, and ordered by `day`-string. The aggregation (SUM) is then calculated for a window of 3 values around (and including) each row. For the following input: [ // user, sales_amt { "user1", 10 }, { "user2", 20 }, { "user1", 20 }, { "user1", 10 }, { "user2", 30 }, { "user2", 80 }, { "user1", 50 }, { "user1", 60 }, { "user2", 40 } ] Partitioning (grouping) by `user_id` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<------user2-------> The SUM aggregation is applied with 1 preceding and 1 following row, with a minimum of 1 period. The aggregation window is thus 3 rows wide, yielding the following column: [ 30, 40, 80, 120, 110, 50, 130, 150, 120 ] Note: The SUMs calculated at the group boundaries (i.e. indices 0, 4, 5, and 8) consider only 2 values each, in spite of the window-size being 3. Each aggregation operation cannot cross group boundaries.
The returned column for
op == COUNT
always hasINT32
type. All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) toFLOAT32
orFLOAT64
before doing a rollingMEAN
.Note:
preceding_window
andfollowing_window
could well have negative values. This yields windows where the current row might not be included at all. For instance, consider a window defined as (preceding=3, following=-1). This produces a window from 2 (i.e. 3-1) rows preceding the current row, and 1 row preceding the current row. For the example above, the window for row#3 is:[ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <—window–> ^ | current_row
Similarly,
preceding
could have a negative value, indicating that the window begins at a position after the current row. It differs slightly from the semantics forfollowing
, becausepreceding
includes the current row. Therefore:preceding=1 => Window starts at the current row.
preceding=0 => Window starts at 1 past the current row.
preceding=-1 => Window starts at 2 past the current row. Etc.
- Parameters:
group_keys – [in] The (pre-sorted) grouping columns
input – [in] The input column (to be aggregated)
preceding_window – [in] The static rolling window size in the backward direction (for positive values), or forward direction (for negative values)
following_window – [in] The static rolling window size in the forward direction (for positive values), or backward direction (for negative values)
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.aggr – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
default_outputs – A column of per-row default values to be returned instead of nulls. Used for LEAD()/LAG(), if the row offset crosses the boundaries of the column or group.
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> grouped_time_range_rolling_window(table_view const &group_keys, column_view const ×tamp_column, cudf::order const ×tamp_order, column_view const &input, size_type preceding_window_in_days, size_type following_window_in_days, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a grouping-aware, timestamp-based rolling window function to the values in a column.
Like
rolling_window()
, this function aggregates values in a window around each element of a specifiedinput
column. It differs fromrolling_window()
in two respects:The elements of the
input
column are grouped into distinct groups (e.g. the result of a groupby), determined by the corresponding values of the columns undergroup_keys
. The window-aggregation cannot cross the group boundaries.Within a group, the aggregation window is calculated based on a time interval (e.g. number of days preceding/following the current row). The timestamps for the input data are specified by the
timestamp_column
argument.
Note: This method requires that the rows are presorted by the group keys and timestamp values.
Example: Consider a user-sales dataset, where the rows look as follows: { "user_id", sales_amt, date } This method enables windowing queries such as grouping a dataset by `user_id`, sorting by increasing `date`, and summing up the `sales_amt` column over a window of 3 days (1 preceding *day, the current day, and 1 following day). In this example, 1. `group_keys == [ user_id ]` 2. `timestamp_column == date` 3. `input == sales_amt` The data are grouped by `user_id`, and ordered by `date`. The aggregation (SUM) is then calculated for a window of 3 days around (and including) each row. For the following input: [ // user, sales_amt, YYYYMMDD (date) { "user1", 10, 20200101 }, { "user2", 20, 20200101 }, { "user1", 20, 20200102 }, { "user1", 10, 20200103 }, { "user2", 30, 20200101 }, { "user2", 80, 20200102 }, { "user1", 50, 20200107 }, { "user1", 60, 20200107 }, { "user2", 40, 20200104 } ] Partitioning (grouping) by `user_id`, and ordering by `date` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): Date :(202001-) [ 01, 02, 03, 07, 07, 01, 01, 02, 04 ] Input: [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<---------user2---------> The SUM aggregation is applied, with 1 day preceding, and 1 day following, with a minimum of 1 period. The aggregation window is thus 3 *days* wide, yielding the following output column: Results: [ 30, 40, 30, 110, 110, 130, 130, 130, 40 ]
Note: The number of rows participating in each window might vary, based on the index within the group, datestamp, and
min_periods
. Apropos:results[0] considers 2 values, because it is at the beginning of its group, and has no preceding values.
results[5] considers 3 values, despite being at the beginning of its group. It must include 2 following values, based on its datestamp.
Each aggregation operation cannot cross group boundaries.
The returned column for
op == COUNT
always hasINT32
type. All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) toFLOAT32
orFLOAT64
before doing a rollingMEAN
.- Parameters:
group_keys – [in] The (pre-sorted) grouping columns
timestamp_column – [in] The (pre-sorted) timestamps for each row
timestamp_order – [in] The order (ASCENDING/DESCENDING) in which the timestamps are sorted
input – [in] The input column (to be aggregated)
preceding_window_in_days – [in] The rolling window time-interval in the backward direction
following_window_in_days – [in] The rolling window time-interval in the forward direction
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.aggr – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> grouped_time_range_rolling_window(table_view const &group_keys, column_view const ×tamp_column, cudf::order const ×tamp_order, column_view const &input, window_bounds preceding_window_in_days, window_bounds following_window_in_days, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a grouping-aware, timestamp-based rolling window function to the values in a column,.
Like
rolling_window()
, this function aggregates values in a window around each element of a specifiedinput
column. It differs fromrolling_window()
in two respects:The elements of the
input
column are grouped into distinct groups (e.g. the result of a groupby), determined by the corresponding values of the columns undergroup_keys
. The window-aggregation cannot cross the group boundaries.Within a group, the aggregation window is calculated based on a time interval (e.g. number of days preceding/following the current row). The timestamps for the input data are specified by the
timestamp_column
argument.
Note: This method requires that the rows are presorted by the group keys and timestamp values.
Example: Consider a user-sales dataset, where the rows look as follows: { "user_id", sales_amt, date } This method enables windowing queries such as grouping a dataset by `user_id`, sorting by increasing `date`, and summing up the `sales_amt` column over a window of 3 days (1 preceding *day, the current day, and 1 following day). In this example, 1. `group_keys == [ user_id ]` 2. `timestamp_column == date` 3. `input == sales_amt` The data are grouped by `user_id`, and ordered by `date`. The aggregation (SUM) is then calculated for a window of 3 days around (and including) each row. For the following input: [ // user, sales_amt, YYYYMMDD (date) { "user1", 10, 20200101 }, { "user2", 20, 20200101 }, { "user1", 20, 20200102 }, { "user1", 10, 20200103 }, { "user2", 30, 20200101 }, { "user2", 80, 20200102 }, { "user1", 50, 20200107 }, { "user1", 60, 20200107 }, { "user2", 40, 20200104 } ] Partitioning (grouping) by `user_id`, and ordering by `date` yields the following `sales_amt` vector (with 2 groups, one for each distinct `user_id`): Date :(202001-) [ 01, 02, 03, 07, 07, 01, 01, 02, 04 ] Input: [ 10, 20, 10, 50, 60, 20, 30, 80, 40 ] <-------user1-------->|<---------user2---------> The SUM aggregation is applied, with 1 day preceding, and 1 day following, with a minimum of 1 period. The aggregation window is thus 3 *days* wide, yielding the following output column: Results: [ 30, 40, 30, 110, 110, 130, 130, 130, 40 ]
Note: The number of rows participating in each window might vary, based on the index within the group, datestamp, and
min_periods
. Apropos:results[0] considers 2 values, because it is at the beginning of its group, and has no preceding values.
results[5] considers 3 values, despite being at the beginning of its group. It must include 2 following values, based on its datestamp.
Each aggregation operation cannot cross group boundaries.
The returned column for
op == COUNT
always hasINT32
type. All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) toFLOAT32
orFLOAT64
before doing a rollingMEAN
.The
preceding_window_in_days
andfollowing_window_in_days
are specified as awindow_bounds
and supports “unbounded” windows, if set towindow_bounds::unbounded()
.- Parameters:
group_keys – [in] The (pre-sorted) grouping columns
timestamp_column – [in] The (pre-sorted) timestamps for each row
timestamp_order – [in] The order (ASCENDING/DESCENDING) in which the timestamps are sorted
input – [in] The input column (to be aggregated)
preceding_window_in_days – [in] The rolling window time-interval in the backward direction
following_window_in_days – [in] The rolling window time-interval in the forward direction
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.aggr – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> grouped_range_rolling_window(table_view const &group_keys, column_view const &orderby_column, cudf::order const &order, column_view const &input, range_window_bounds const &preceding, range_window_bounds const &following, size_type min_periods, rolling_aggregation const &aggr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a grouping-aware, value range-based rolling window function to the values in a column.
This function aggregates rows in a window around each element of a specified
input
column. The window is determined based on the values of an orderedorderby
column, and on the values of apreceding
andfollowing
scalar representing an inclusive range of orderby column values.The elements of the
input
column are grouped into distinct groups (e.g. the result of a groupby), determined by the corresponding values of the columns undergroup_keys
. The window-aggregation cannot cross the group boundaries.Within a group, with all rows sorted by the
orderby
column, the aggregation window for a row at indexi
is determined as follows: a) Iforderby
is ASCENDING, aggregation window for rowi
includes allinput
rows at indexj
such that:b) If(orderby[i] - preceding) <= orderby[j] <= orderby[i] + following
orderby
is DESCENDING, aggregation window for rowi
includes allinput
rows at indexj
such that:(orderby[i] + preceding) >= orderby[j] >= orderby[i] - following
Note: This method requires that the rows are presorted by the group keys and orderby column values.
The window intervals are specified as scalar values appropriate for the orderby column. Currently, only the following combinations of
orderby
column type and range types are supported:If
orderby
column is a TIMESTAMP, thepreceding
/following
windows are specified in terms ofDURATION
scalars of the same resolution. E.g. Fororderby
column of typeTIMESTAMP_SECONDS
, the intervals may only beDURATION_SECONDS
. Durations of higher resolution (e.g.DURATION_NANOSECONDS
) or lower (e.g.DURATION_DAYS
) cannot be used.If the
orderby
column is an integral type (e.g.INT32
), thepreceding
/following
should be the exact same type (INT32
).
Example: Consider a motor-racing statistics dataset, containing the following columns: 1. driver_name: (STRING) Name of the car driver 2. num_overtakes: (INT32) Number of times the driver overtook another car in a lap 3. lap_number: (INT32) The number of the lap The `group_range_rolling_window()` function allows one to calculate the total number of overtakes each driver made within any 3 lap window of each entry: 1. Group/partition the dataset by `driver_id` (This is the group_keys argument.) 2. Sort each group by the `lap_number` (i.e. This is the orderby_column.) 3. Calculate the SUM(num_overtakes) over a window (preceding=1, following=1) For the following input: [ // driver_name, num_overtakes, lap_number { "bottas", 1, 1 }, { "hamilton", 2, 1 }, { "bottas", 2, 2 }, { "bottas", 1, 3 }, { "hamilton", 3, 1 }, { "hamilton", 8, 2 }, { "bottas", 5, 7 }, { "bottas", 6, 8 }, { "hamilton", 4, 4 } ] Partitioning (grouping) by `driver_name`, and ordering by `lap_number` yields the following `num_overtakes` vector (with 2 groups, one for each distinct `driver_name`): lap_number: [ 1, 2, 3, 7, 8, 1, 1, 2, 4 ] num_overtakes: [ 1, 2, 1, 5, 6, 2, 3, 8, 4 ] <-----bottas------>|<----hamilton---> The SUM aggregation is applied, with 1 preceding, and 1 following, with a minimum of 1 period. The aggregation window is thus 3 (laps) wide, yielding the following output column: Results: [ 3, 4, 3, 11, 11, 13, 13, 13, 4 ]
Note: The number of rows participating in each window might vary, based on the index within the group, datestamp, and
min_periods
. Apropos:results[0] considers 2 values, because it is at the beginning of its group, and has no preceding values.
results[5] considers 3 values, despite being at the beginning of its group. It must include 2 following values, based on its orderby_column value.
Each aggregation operation cannot cross group boundaries.
The type of the returned column depends on the input column type
T
, and the aggregation:COUNT returns
INT32
columnsMIN/MAX returns
T
columnsSUM returns the promoted type for T. Sum on
INT32
yieldsINT64
.MEAN returns FLOAT64 columns
COLLECT returns columns of type
LIST<T>
.
LEAD/LAG/ROW_NUMBER are undefined for range queries.
- Parameters:
group_keys – [in] The (pre-sorted) grouping columns
orderby_column – [in] The (pre-sorted) order-by column, for range comparisons
order – [in] The order (ASCENDING/DESCENDING) in which the order-by column is sorted
input – [in] The input column (to be aggregated)
preceding – [in] The interval value in the backward direction
following – [in] The interval value in the forward direction
min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.aggr – [in] The rolling window aggregation type (SUM, MAX, MIN, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
A nullable output column containing the rolling window results
-
std::unique_ptr<column> rolling_window(column_view const &input, column_view const &preceding_window, column_view const &following_window, size_type min_periods, rolling_aggregation const &agg, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Applies a variable-size rolling window function to the values in a column.
This function aggregates values in a window around each element i of the input column, and invalidates the bit mask for element i if there are not enough observations. The window size is dynamic (varying for each element). This matches Pandas’ API for DataFrame.rolling with a few notable differences:
instead of the center flag it uses a two-part window to allow for more flexible windows. The total window size =
preceding_window + following_window
. Elementi
uses elements[i-preceding_window+1, i+following_window]
to do the window computation.instead of storing NA/NaN for output rows that do not meet the minimum number of observations this function updates the valid bitmask of the column to indicate which elements are valid.
support for dynamic rolling windows, i.e. window size can be specified for each element using an additional array.
The returned column for count aggregation always has INT32 type. All other operators return a column of the same type as the input. Therefore it is suggested to convert integer column types (especially low-precision integers) to
FLOAT32
orFLOAT64
before doing a rollingMEAN
.- Throws:
cudf::logic_error – if window column type is not INT32
- Parameters:
input – [in] The input column
preceding_window – [in] A non-nullable column of INT32 window sizes in the forward direction.
preceding_window[i]
specifies preceding window size for elementi
.following_window – [in] A non-nullable column of INT32 window sizes in the backward direction.
following_window[i]
specifies following window size for elementi
.min_periods – [in] Minimum number of observations in window required to have a value, otherwise element
i
is null.agg – [in] The rolling window aggregation type (sum, max, min, etc.)
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
A nullable output column containing the rolling window results
-
struct range_window_bounds#
- #include <range_window_bounds.hpp>
Abstraction for window boundary sizes, to be used with
grouped_range_rolling_window()
.Similar to
window_bounds
ingrouped_rolling_window()
,range_window_bounds
represents window boundaries for use withgrouped_range_rolling_window()
. A window may be specified as one of the following:A fixed-width numeric scalar value. E.g. a) A
DURATION_DAYS
scalar, for use with aTIMESTAMP_DAYS
orderby column b) AnINT32
scalar, for use with anINT32
orderby column”unbounded”, indicating that the bounds stretch to the first/last row in the group.
”current row”, indicating that the bounds end at the first/last row in the group that match the value of the current row.
Public Types
-
enum class extent_type : int32_t#
The type of range_window_bounds.
Values:
-
enumerator CURRENT_ROW#
-
enumerator BOUNDED#
Bounds defined as the first/last row that matches the current row.
-
enumerator UNBOUNDED#
Bounds stretching to the first/last row in the entire group.
Bounds defined as the first/last row that falls within a specified range from the current row.
-
enumerator CURRENT_ROW#
Public Functions
-
inline bool is_current_row() const#
Whether or not the window is bounded to the current row.
- Returns:
true If window is bounded to the current row
- Returns:
false If window is not bounded to the current row
-
inline bool is_unbounded() const#
Whether or not the window is unbounded.
- Returns:
true If window is unbounded
- Returns:
false If window is of finite bounds
-
inline scalar const &range_scalar() const#
Returns the underlying scalar value for the bounds.
- Returns:
The underlying scalar value for the bounds
-
range_window_bounds(range_window_bounds const&) = default#
Copy constructor.
Public Static Functions
-
static range_window_bounds get(scalar const &boundary, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Factory method to construct a bounded window boundary.
- Parameters:
boundary – Finite window boundary
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
A bounded window boundary object
-
static range_window_bounds current_row(data_type type, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Factory method to construct a window boundary limited to the value of the current row.
- Parameters:
type – The datatype of the window boundary
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
A “current row” window boundary object
-
static range_window_bounds unbounded(data_type type, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Factory method to construct an unbounded window boundary.
- Parameters:
type – The datatype of the window boundary
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
An unbounded window boundary object
-
struct window_bounds#
- #include <rolling.hpp>
Abstraction for window boundary sizes.
Public Functions
-
inline bool is_unbounded() const#
Whether the window_bounds is unbounded.
- Returns:
true if the window bounds is unbounded.
- Returns:
false if the window bounds has a finite row boundary.
-
inline size_type value() const#
Gets the row-boundary for this window_bounds.
- Returns:
the row boundary value (in days or rows)
Public Static Functions
-
static inline window_bounds get(size_type value)#
Construct bounded window boundary.
- Parameters:
value – Finite window boundary (in days or rows)
- Returns:
A window boundary
-
static inline window_bounds unbounded()#
Construct unbounded window boundary.
- Returns:
-
inline bool is_unbounded() const#
-
std::unique_ptr<column> rolling_window(column_view const &input, size_type preceding_window, size_type following_window, size_type min_periods, rolling_aggregation const &agg, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#