Aggregation Reduction#
- group Reduction
Enums
Functions
-
std::unique_ptr<scalar> reduce(column_view const &col, reduce_aggregation const &agg, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Computes the reduction of the values in all rows of a column.
This function does not detect overflows in reductions except for the
SUM_WITH_OVERFLOW
aggregation. Whenoutput_type
does not match thecol.type()
, their values may be promoted toint64_t
ordouble
for computing aggregations and then cast tooutput_type
before returning.The
SUM_WITH_OVERFLOW
aggregation is a special case that detects integer overflow during summation ofint64_t
values and returns a struct containing both the sum result and an overflow flag.Only
min
andmax
ops are supported for reduction of non-arithmetic types (e.g. timestamp or string).Any null values are skipped for the operation. If the reduction fails, the output scalar returns with
is_valid()==false
.For empty or all-null input, the result is generally an invalid scalar except for specific aggregations where the aggregation has a well-defined output.
If the input column is an arithmetic type, the
output_type
can be any arithmetic type. If the input column is a non-arithmetic type (e.g. timestamp or string) theoutput_type
must match thecol.type()
. If the reduction type isany
orall
, theoutput_type
must be type BOOL8.Aggregation
Output Type
Init Value
Empty Input
Comments
SUM/PRODUCT
output_type
yes
NA
Input accumulated into output_type variable
SUM_WITH_OVERFLOW
STRUCT{INT64,BOOL8}
yes
{null,false}
{sum, overflow_flag}, input must be INT64
SUM_OF_SQUARES
output_type
no
NA
Input accumulated into output_type variable
MIN/MAX
col.type
yes
NA
Supports arithmetic, timestamp, duration, string types only
ANY/ALL
BOOL8
yes
True for ALL only
Checks for non-zero elements
MEAN/VARIANCE/STD
FLOAT32/FLOAT64
no
NA
output_type must be a float type
MEDIAN/QUANTILE
output_type
no
NA
Exact value if output_type is FLOAT64. See cudf::quantile
NUNIQUE
output_type
no
1 if all-nulls
May process null rows
NTH_ELEMENT
col.type
no
NA
BITWISE_AGG
col.type
no
NA
Supports only integral types
HISTOGRAM/MERGE_HISTOGRAM
LIST of col.type
no
empty list returned
COLLECT_LIST/COLLECT_SET
LIST of col.type
no
empty list returned
TDIGEST/MERGE_TDIGEST
STRUCT
no
empty struct returned
tdigest scalar is returned
HOST_UDF
output_type
yes
NA
Custom UDF could ignore output_type
The NA in the table indicates an output scalar with
is_valid()==false
- Throws:
std::invalid_argument – if reduction is called for non-arithmetic output type and operator other than
min
andmax
.std::invalid_argument – if input column data type is not convertible to
output_type
.std::invalid_argument – if
min
ormax
reduction is called and the output type does not match the input column data type.std::invalid_argument – if
any
orall
reduction is called and the output type is not BOOL8.std::invalid_argument – if
mean
,var
, orstd
reduction is called and theoutput_type
is not floating point.std::invalid_argument – if
sum_with_overflow
reduction is called and the input column type is notINT64
or theoutput_dtype
is notSTRUCT
.
- Parameters:
col – Input column view
agg – Aggregation operator applied by the reduction
output_type – The output scalar type
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned scalar’s device memory
- Returns:
Output scalar with reduce result
-
std::unique_ptr<scalar> reduce(column_view const &col, reduce_aggregation const &agg, data_type output_type, std::optional<std::reference_wrapper<scalar const>> init, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Computes the reduction of the values in all rows of a column with an initial value.
Only
sum
,product
,min
,max
,any
,all
, andsum_with_overflow
reductions are supported. Forsum_with_overflow
, the initial value is added to the sum and overflow detection is performed throughout the entire computation.See also
cudf::reduce(column_view const&,reduce_aggregation const&,data_type,rmm::cuda_stream_view,rmm::device_async_resource_ref) for more details
- Throws:
std::invalid_argument – if reduction is not
sum
,product
,min
,max
,any
,all
, orsum_with_overflow
andinit
is specified.- Parameters:
col – Input column view
agg – Aggregation operator applied by the reduction
output_type – The output scalar type
init – The initial value of the reduction
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned scalar’s device memory
- Returns:
Output scalar with reduce result
-
std::unique_ptr<column> segmented_reduce(column_view const &segmented_values, device_span<size_type const> offsets, segmented_reduce_aggregation const &agg, data_type output_type, null_policy null_handling, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Compute reduction of each segment in the input column.
This function does not detect overflows in reductions. When
output_type
does not match thesegmented_values.type()
, their values may be promoted toint64_t
ordouble
for computing aggregations and then cast tooutput_type
before returning.Null values are treated as identities during reduction.
If the segment is empty, the row corresponding to the result of the segment is null.
If any index in
offsets
is out of bound ofsegmented_values
, the behavior is undefined.If the input column has arithmetic type,
output_type
can be any arithmetic type. If the input column has non-arithmetic type, e.g. timestamp, the same output type must be specified.If input is not empty, the result is always nullable.
- Throws:
cudf::logic_error – if reduction is called for non-arithmetic output type and operator other than
min
andmax
.cudf::logic_error – if input column data type is not convertible to
output_type
type.cudf::logic_error – if
min
ormax
reduction is called and theoutput_type
does not match the input column data type.cudf::logic_error – if
any
orall
reduction is called and theoutput_type
is not BOOL8.
- Parameters:
segmented_values – Column view of segmented inputs
offsets – Each segment’s offset of
segmented_values
. A list of offsets with sizenum_segments + 1
. The size ofi
th segment isoffsets[i+1] - offsets[i]
.agg – Aggregation operator applied by the reduction
output_type – The output column type
null_handling – If
INCLUDE
, the reduction is valid if all elements in a segment are valid, otherwise null. IfEXCLUDE
, the reduction is valid if any element in the segment is valid, otherwise null.stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned scalar’s device memory
- Returns:
Output column with results of segmented reduction
-
std::unique_ptr<column> segmented_reduce(column_view const &segmented_values, device_span<size_type const> offsets, segmented_reduce_aggregation const &agg, data_type output_type, null_policy null_handling, std::optional<std::reference_wrapper<scalar const>> init, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Compute reduction of each segment in the input column with an initial value. Only SUM, PRODUCT, MIN, MAX, ANY, and ALL aggregations are supported.
- Parameters:
segmented_values – Column view of segmented inputs
offsets – Each segment’s offset of
segmented_values
. A list of offsets with sizenum_segments + 1
. The size ofi
th segment isoffsets[i+1] - offsets[i]
.agg – Aggregation operator applied by the reduction
output_type – The output column type
null_handling – If
INCLUDE
, the reduction is valid if all elements in a segment are valid, otherwise null. IfEXCLUDE
, the reduction is valid if any element in the segment is valid, otherwise null.init – The initial value of the reduction
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned scalar’s device memory
- Returns:
Output column with results of segmented reduction.
-
std::unique_ptr<column> scan(column_view const &input, scan_aggregation const &agg, scan_type inclusive, null_policy null_handling = null_policy::EXCLUDE, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Computes the scan of a column.
The null values are skipped for the operation, and if an input element at
i
is null, then the output element ati
will also be null.- Throws:
cudf::logic_error – if column datatype is not numeric type.
- Parameters:
input – [in] The input column view for the scan
agg – [in] unique_ptr to aggregation operator applied by the scan
inclusive – [in] The flag for applying an inclusive scan if scan_type::INCLUSIVE, an exclusive scan if scan_type::EXCLUSIVE.
null_handling – [in] Exclude null values when computing the result if null_policy::EXCLUDE. Include nulls if null_policy::INCLUDE. Any operation with a null results in a null.
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned scalar’s device memory
- Returns:
Scanned output column
-
std::pair<std::unique_ptr<scalar>, std::unique_ptr<scalar>> minmax(column_view const &col, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Determines the minimum and maximum values of a column.
- Parameters:
col – column to compute minmax
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
A std::pair of scalars with the first scalar being the minimum value and the second scalar being the maximum value of the input column.
-
std::unique_ptr<scalar> reduce(column_view const &col, reduce_aggregation const &agg, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#