Files | Classes | Enumerations | Functions
Aggregation Factories

Files

file  aggregation.hpp
 Representation for specifying desired aggregations from aggregation-based APIs, e.g., groupby, reductions, rolling, etc.
 

Classes

class  cudf::aggregation
 Abstract base class for specifying the desired aggregation in an aggregation_request. More...
 
class  cudf::rolling_aggregation
 Derived class intended for rolling_window specific aggregation usage. More...
 
class  cudf::groupby_aggregation
 Derived class intended for groupby specific aggregation usage. More...
 
class  cudf::groupby_scan_aggregation
 Derived class intended for groupby specific scan usage. More...
 
class  cudf::reduce_aggregation
 Derived class intended for reduction usage. More...
 
class  cudf::scan_aggregation
 Derived class intended for scan usage. More...
 
class  cudf::segmented_reduce_aggregation
 Derived class intended for segmented reduction usage. More...
 

Enumerations

enum  udf_type : bool { CUDA, PTX }
 
enum  correlation_type : int32_t { PEARSON, KENDALL, SPEARMAN }
 

Functions

template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_sum_aggregation ()
 Factory to create a SUM aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_product_aggregation ()
 Factory to create a PRODUCT aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_min_aggregation ()
 Factory to create a MIN aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_max_aggregation ()
 Factory to create a MAX aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_count_aggregation (null_policy null_handling=null_policy::EXCLUDE)
 Factory to create a COUNT aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_any_aggregation ()
 Factory to create an ANY aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_all_aggregation ()
 Factory to create a ALL aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_sum_of_squares_aggregation ()
 Factory to create a SUM_OF_SQUARES aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_mean_aggregation ()
 Factory to create a MEAN aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_m2_aggregation ()
 Factory to create a M2 aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_variance_aggregation (size_type ddof=1)
 Factory to create a VARIANCE aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_std_aggregation (size_type ddof=1)
 Factory to create a STD aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_median_aggregation ()
 Factory to create a MEDIAN aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_quantile_aggregation (std::vector< double > const &quantiles, interpolation interp=interpolation::LINEAR)
 Factory to create a QUANTILE aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_argmax_aggregation ()
 Factory to create an argmax aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_argmin_aggregation ()
 Factory to create an argmin aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_nunique_aggregation (null_policy null_handling=null_policy::EXCLUDE)
 Factory to create a nunique aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_nth_element_aggregation (size_type n, null_policy null_handling=null_policy::INCLUDE)
 Factory to create a nth_element aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_row_number_aggregation ()
 Factory to create a ROW_NUMBER aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_rank_aggregation ()
 Factory to create a RANK aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_dense_rank_aggregation ()
 Factory to create a DENSE_RANK aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_percent_rank_aggregation ()
 Factory to create a PERCENT_RANK aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_collect_list_aggregation (null_policy null_handling=null_policy::INCLUDE)
 Factory to create a COLLECT_LIST aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_collect_set_aggregation (null_policy null_handling=null_policy::INCLUDE, null_equality nulls_equal=null_equality::EQUAL, nan_equality nans_equal=nan_equality::UNEQUAL)
 Factory to create a COLLECT_SET aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_lag_aggregation (size_type offset)
 Factory to create a LAG aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_lead_aggregation (size_type offset)
 Factory to create a LEAD aggregation.
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_udf_aggregation (udf_type type, std::string const &user_defined_aggregator, data_type output_type)
 Factory to create an aggregation base on UDF for PTX or CUDA. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_merge_lists_aggregation ()
 Factory to create a MERGE_LISTS aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_merge_sets_aggregation (null_equality nulls_equal=null_equality::EQUAL, nan_equality nans_equal=nan_equality::UNEQUAL)
 Factory to create a MERGE_SETS aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_merge_m2_aggregation ()
 Factory to create a MERGE_M2 aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_covariance_aggregation (size_type min_periods=1, size_type ddof=1)
 Factory to create a COVARIANCE aggregation. More...
 
template<typename Base = aggregation>
std::unique_ptr< Base > cudf::make_correlation_aggregation (correlation_type type, size_type min_periods=1)
 Factory to create a CORRELATION aggregation. More...
 
template<typename Base >
std::unique_ptr< Base > cudf::make_tdigest_aggregation (int max_centroids=1000)
 Factory to create a TDIGEST aggregation. More...
 
template<typename Base >
std::unique_ptr< Base > cudf::make_merge_tdigest_aggregation (int max_centroids=1000)
 Factory to create a MERGE_TDIGEST aggregation. More...
 

Detailed Description

Function Documentation

◆ make_argmax_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_argmax_aggregation ( )

Factory to create an argmax aggregation.

argmax returns the index of the maximum element.

◆ make_argmin_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_argmin_aggregation ( )

Factory to create an argmin aggregation.

argmin returns the index of the minimum element.

◆ make_collect_list_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_collect_list_aggregation ( null_policy  null_handling = null_policy::INCLUDE)

Factory to create a COLLECT_LIST aggregation.

COLLECT_LIST returns a list column of all included elements in the group/series.

If null_handling is set to EXCLUDE, null elements are dropped from each of the list rows.

Parameters
null_handlingIndicates whether to include/exclude nulls in list elements.

◆ make_collect_set_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_collect_set_aggregation ( null_policy  null_handling = null_policy::INCLUDE,
null_equality  nulls_equal = null_equality::EQUAL,
nan_equality  nans_equal = nan_equality::UNEQUAL 
)

Factory to create a COLLECT_SET aggregation.

COLLECT_SET returns a lists column of all included elements in the group/series. Within each list, the duplicated entries are dropped out such that each entry appears only once.

If null_handling is set to EXCLUDE, null elements are dropped from each of the list rows.

Parameters
null_handlingIndicates whether to include/exclude nulls during collection
nulls_equalFlag to specify whether null entries within each list should be considered equal.
nans_equalFlag to specify whether NaN values in floating point column should be considered equal.

◆ make_correlation_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_correlation_aggregation ( correlation_type  type,
size_type  min_periods = 1 
)

Factory to create a CORRELATION aggregation.

Compute correlation coefficient between two columns. The input columns are child columns of a non-nullable struct columns.

Parameters
typecorrelation_type
min_periodsMinimum number of non-null observations required to produce a result.

◆ make_count_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_count_aggregation ( null_policy  null_handling = null_policy::EXCLUDE)

Factory to create a COUNT aggregation.

Parameters
null_handlingIndicates if null values will be counted.

◆ make_covariance_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_covariance_aggregation ( size_type  min_periods = 1,
size_type  ddof = 1 
)

Factory to create a COVARIANCE aggregation.

Compute covariance between two columns. The input columns are child columns of a non-nullable struct columns.

Parameters
min_periodsMinimum number of non-null observations required to produce a result.
ddofDelta Degrees of Freedom. The divisor used in calculations is N - ddof, where N is the number of non-null observations.

◆ make_dense_rank_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_dense_rank_aggregation ( )

Factory to create a DENSE_RANK aggregation.

DENSE_RANK returns a non-nullable column of size_type "dense ranks": the preceding unique value's rank plus one. As a result, ranks are not unique but there are no gaps in the ranking sequence (unlike RANK aggregations).

This aggregation only works with "scan" algorithms. The input column into the group or ungrouped scan is an orderby column that orders the rows that the aggregate function ranks. If rows are ordered by more than one column, the orderby input column should be a struct column containing the ordering columns.

Note:

  1. This method requires that the rows are presorted by the group keys and order_by columns.
  2. DENSE_RANK aggregations will return a fully valid column regardless of null_handling policy specified in the scan.
  3. DENSE_RANK aggregations are not compatible with exclusive scans.
Example: Consider a motor-racing statistics dataset, containing the following columns:
1. venue: (STRING) Location of the race event
2. driver: (STRING) Name of the car driver (abbreviated to 3 characters)
3. time: (INT32) Time taken to complete the circuit
For the following presorted data:
[ // venue, driver, time
{ "silverstone", "HAM" ("hamilton"), 15823},
{ "silverstone", "LEC" ("leclerc"), 15827},
{ "silverstone", "BOT" ("bottas"), 15834}, // <-- Tied for 3rd place.
{ "silverstone", "NOR" ("norris"), 15834}, // <-- Tied for 3rd place.
{ "silverstone", "RIC" ("ricciardo"), 15905},
{ "monza", "RIC" ("ricciardo"), 12154},
{ "monza", "NOR" ("norris"), 12156}, // <-- Tied for 2nd place.
{ "monza", "BOT" ("bottas"), 12156}, // <-- Tied for 2nd place.
{ "monza", "LEC" ("leclerc"), 12201},
{ "monza", "PER" ("perez"), 12203}
]
A grouped dense rank aggregation scan with:
groupby column : venue
input orderby column: time
Produces the following dense rank column:
{ 1, 2, 3, 3, 4, 1, 2, 2, 3, 4}
(This corresponds to the following grouping and `driver` rows:)
{ "HAM", "LEC", "BOT", "NOR", "RIC", "RIC", "NOR", "BOT", "LEC", "PER" }
<----------silverstone----------->|<-------------monza-------------->

◆ make_m2_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_m2_aggregation ( )

Factory to create a M2 aggregation.

A M2 aggregation is sum of squares of differences from the mean. That is: M2 = SUM((x - MEAN) * (x - MEAN)).

This aggregation produces the intermediate values that are used to compute variance and standard deviation across multiple discrete sets. See https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Parallel_algorithm for more detail.

◆ make_merge_lists_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_merge_lists_aggregation ( )

Factory to create a MERGE_LISTS aggregation.

Given a lists column, this aggregation merges all the lists corresponding to the same key value into one list. It is designed specifically to merge the partial results of multiple (distributed) groupby COLLECT_LIST aggregations into a final COLLECT_LIST result. As such, it requires the input lists column to be non-nullable (the child column containing list entries is not subjected to this requirement).

◆ make_merge_m2_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_merge_m2_aggregation ( )

Factory to create a MERGE_M2 aggregation.

Merges the results of M2 aggregations on independent sets into a new M2 value equivalent to if a single M2 aggregation was done across all of the sets at once. This aggregation is only valid on structs whose members are the result of the COUNT_VALID, MEAN, and M2 aggregations on the same sets. The output of this aggregation is a struct containing the merged COUNT_VALID, MEAN, and M2 aggregations.

The input M2 aggregation values are expected to be all non-negative numbers, since they were output from M2 aggregation.

◆ make_merge_sets_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_merge_sets_aggregation ( null_equality  nulls_equal = null_equality::EQUAL,
nan_equality  nans_equal = nan_equality::UNEQUAL 
)

Factory to create a MERGE_SETS aggregation.

Given a lists column, this aggregation firstly merges all the lists corresponding to the same key value into one list, then it drops all the duplicate entries in each lists, producing a lists column containing non-repeated entries.

This aggregation is designed specifically to merge the partial results of multiple (distributed) groupby COLLECT_LIST or COLLECT_SET aggregations into a final COLLECT_SET result. As such, it requires the input lists column to be non-nullable (the child column containing list entries is not subjected to this requirement).

In practice, the input (partial results) to this aggregation should be generated by (distributed) COLLECT_LIST aggregations, not COLLECT_SET, to avoid unnecessarily removing duplicate entries for the partial results.

Parameters
nulls_equalFlag to specify whether nulls within each list should be considered equal during dropping duplicate list entries.
nans_equalFlag to specify whether NaN values in floating point column should be considered equal during dropping duplicate list entries.

◆ make_merge_tdigest_aggregation()

template<typename Base >
std::unique_ptr<Base> cudf::make_merge_tdigest_aggregation ( int  max_centroids = 1000)

Factory to create a MERGE_TDIGEST aggregation.

Merges the results from a previous aggregation resulting from a make_tdigest_aggregation or make_merge_tdigest_aggregation to produce a new a tdigest (https://arxiv.org/pdf/1902.04023.pdf) column.

The tdigest column produced is of the following structure:

struct { // centroids for the digest list { struct { double // mean double // weight }, ... } // these are from the input stream, not the centroids. they are used // during the percentile_approx computation near the beginning or // end of the quantiles double // min double // max }

Each output row is a single tdigest. The length of the row is the "size" of the tdigest, each element of which represents a weighted centroid (mean, weight).

Parameters
max_centroidsParameter controlling compression level and accuracy on subsequent queries on the output tdigest data. max_centroids places an upper bound on the size of the computed tdigests: A value of 1000 will result in a tdigest containing no more than 1000 centroids (32 bytes each). Higher result in more accurate tdigest information.
Returns
A MERGE_TDIGEST aggregation object.

◆ make_nth_element_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_nth_element_aggregation ( size_type  n,
null_policy  null_handling = null_policy::INCLUDE 
)

Factory to create a nth_element aggregation.

nth_element returns the n'th element of the group/series.

If n is not within the range [-group_size, group_size), the result of the respective group will be null. Negative indices [-group_size, -1] corresponds to [0, group_size-1] indices respectively where group_size is the size of each group.

Parameters
nindex of nth element in each group.
null_handlingIndicates to include/exclude nulls during indexing.

◆ make_nunique_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_nunique_aggregation ( null_policy  null_handling = null_policy::EXCLUDE)

Factory to create a nunique aggregation.

nunique returns the number of unique elements.

Parameters
null_handlingIndicates if null values will be counted.

◆ make_percent_rank_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_percent_rank_aggregation ( )

Factory to create a PERCENT_RANK aggregation.

PERCENT_RANK returns a non-nullable column of double precision "fractional" ranks. For row index i, the percent rank of row i is defined as: percent_rank = (rank - 1) / (group_row_count - 1) where,

  1. rank is the RANK of the row within the group
  2. group_row_count is the number of rows in the group

This aggregation only works with "scan" algorithms. The input to the grouped or ungrouped scan is an orderby column that orders the rows that the aggregate function ranks. If rows are ordered by more than one column, the orderby input column should be a struct column containing the ordering columns.

Note:

  1. This method requires that the rows are presorted by the group keys and order_by columns.
  2. PERCENT_RANK aggregations will return a fully valid column regardless of null_handling policy specified in the scan.
  3. PERCENT_RANK aggregations are not compatible with exclusive scans.
Example: Consider a motor-racing statistics dataset, containing the following columns:
1. venue: (STRING) Location of the race event
2. driver: (STRING) Name of the car driver (abbreviated to 3 characters)
3. time: (INT32) Time taken to complete the circuit
For the following presorted data:
[ // venue, driver, time
{ "silverstone", "HAM" ("hamilton"), 15823},
{ "silverstone", "LEC" ("leclerc"), 15827},
{ "silverstone", "BOT" ("bottas"), 15834}, // <-- Tied for 3rd place.
{ "silverstone", "NOR" ("norris"), 15834}, // <-- Tied for 3rd place.
{ "silverstone", "RIC" ("ricciardo"), 15905},
{ "monza", "RIC" ("ricciardo"), 12154},
{ "monza", "NOR" ("norris"), 12156}, // <-- Tied for 2nd place.
{ "monza", "BOT" ("bottas"), 12156}, // <-- Tied for 2nd place.
{ "monza", "LEC" ("leclerc"), 12201},
{ "monza", "PER" ("perez"), 12203}
]
A grouped percent rank aggregation scan with:
groupby column : venue
input orderby column: time
Produces the following percent rank column:
{ 0.00, 0.25, 0.50, 0.50, 1.00, 0.00, 0.25, 0.25, 0.75, 1.00 }
(This corresponds to the following grouping and `driver` rows:)
{ "HAM", "LEC", "BOT", "NOR", "RIC", "RIC", "NOR", "BOT", "LEC", "PER" }
<----------silverstone----------->|<-------------monza-------------->

◆ make_quantile_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_quantile_aggregation ( std::vector< double > const &  quantiles,
interpolation  interp = interpolation::LINEAR 
)

Factory to create a QUANTILE aggregation.

Parameters
quantilesThe desired quantiles
interpThe desired interpolation

◆ make_rank_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_rank_aggregation ( )

Factory to create a RANK aggregation.

RANK returns a non-nullable column of size_type "ranks": the number of rows preceding or equal to the current row plus one. As a result, ranks are not unique and gaps will appear in the ranking sequence.

This aggregation only works with "scan" algorithms. The input column into the group or ungrouped scan is an orderby column that orders the rows that the aggregate function ranks. If rows are ordered by more than one column, the orderby input column should be a struct column containing the ordering columns.

Note:

  1. This method requires that the rows are presorted by the group keys and order_by columns.
  2. RANK aggregations will return a fully valid column regardless of null_handling policy specified in the scan.
  3. RANK aggregations are not compatible with exclusive scans.
Example: Consider a motor-racing statistics dataset, containing the following columns:
1. venue: (STRING) Location of the race event
2. driver: (STRING) Name of the car driver (abbreviated to 3 characters)
3. time: (INT32) Time taken to complete the circuit
For the following presorted data:
[ // venue, driver, time
{ "silverstone", "HAM" ("hamilton"), 15823},
{ "silverstone", "LEC" ("leclerc"), 15827},
{ "silverstone", "BOT" ("bottas"), 15834}, // <-- Tied for 3rd place.
{ "silverstone", "NOR" ("norris"), 15834}, // <-- Tied for 3rd place.
{ "silverstone", "RIC" ("ricciardo"), 15905},
{ "monza", "RIC" ("ricciardo"), 12154},
{ "monza", "NOR" ("norris"), 12156}, // <-- Tied for 2nd place.
{ "monza", "BOT" ("bottas"), 12156}, // <-- Tied for 2nd place.
{ "monza", "LEC" ("leclerc"), 12201},
{ "monza", "PER" ("perez"), 12203}
]
A grouped rank aggregation scan with:
groupby column : venue
input orderby column: time
Produces the following rank column:
{ 1, 2, 3, 3, 5, 1, 2, 2, 4, 5}
(This corresponds to the following grouping and `driver` rows:)
{ "HAM", "LEC", "BOT", "NOR", "RIC", "RIC", "NOR", "BOT", "LEC", "PER" }
<----------silverstone----------->|<-------------monza-------------->

◆ make_std_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_std_aggregation ( size_type  ddof = 1)

Factory to create a STD aggregation.

Parameters
ddofDelta degrees of freedom. The divisor used in calculation of std is N - ddof, where N is the population size.
Exceptions
cudf::logic_errorif input type is chrono or compound types.

◆ make_tdigest_aggregation()

template<typename Base >
std::unique_ptr<Base> cudf::make_tdigest_aggregation ( int  max_centroids = 1000)

Factory to create a TDIGEST aggregation.

Produces a tdigest (https://arxiv.org/pdf/1902.04023.pdf) column from input values. The input aggregation values are expected to be fixed-width numeric types.

The tdigest column produced is of the following structure:

struct { // centroids for the digest list { struct { double // mean double // weight }, ... } // these are from the input stream, not the centroids. they are used // during the percentile_approx computation near the beginning or // end of the quantiles double // min double // max }

Each output row is a single tdigest. The length of the row is the "size" of the tdigest, each element of which represents a weighted centroid (mean, weight).

Parameters
max_centroidsParameter controlling compression level and accuracy on subsequent queries on the output tdigest data. max_centroids places an upper bound on the size of the computed tdigests: A value of 1000 will result in a tdigest containing no more than 1000 centroids (32 bytes each). Higher result in more accurate tdigest information.
Returns
A TDIGEST aggregation object.

◆ make_udf_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_udf_aggregation ( udf_type  type,
std::string const &  user_defined_aggregator,
data_type  output_type 
)

Factory to create an aggregation base on UDF for PTX or CUDA.

Parameters
[in]typeeither udf_type::PTX or udf_type::CUDA
[in]user_defined_aggregatorA string containing the aggregator code
[in]output_typeexpected output type
Returns
aggregation unique pointer housing user_defined_aggregator string.

◆ make_variance_aggregation()

template<typename Base = aggregation>
std::unique_ptr<Base> cudf::make_variance_aggregation ( size_type  ddof = 1)

Factory to create a VARIANCE aggregation.

Parameters
ddofDelta degrees of freedom. The divisor used in calculation of variance is N - ddof, where N is the population size.
Exceptions
cudf::logic_errorif input type is chrono or compound types.