Reorder Compact#
- group Stream Compaction
Enums
-
enum class duplicate_keep_option#
Choices for drop_duplicates API for retainment of duplicate rows.
Values:
-
enumerator KEEP_ANY#
Keep an unspecified occurrence.
-
enumerator KEEP_FIRST#
Keep first occurrence.
-
enumerator KEEP_LAST#
Keep last occurrence.
-
enumerator KEEP_NONE#
Keep no (remove all) occurrences of duplicates.
-
enumerator KEEP_ANY#
Functions
-
std::unique_ptr<table> drop_nulls(table_view const &input, std::vector<size_type> const &keys, cudf::size_type keep_threshold, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Filters a table to remove null elements with threshold count.
Filters the rows of the
inputconsidering specified columns indicated inkeysfor validity / null values.Given an input table_view, row
ifrom the input columns is copied to the output if the same rowiofkeyshas at leastkeep_thresholdnon-null fields.This operation is stable: the input order is preserved in the output.
Any non-nullable column in the input is treated as all non-null.
input {col1: {1, 2, 3, null}, col2: {4, 5, null, null}, col3: {7, null, null, null}} keys = {0, 1, 2} // All columns keep_threshold = 2 output {col1: {1, 2} col2: {4, 5} col3: {7, null}}Note
if
input.num_rows()is zero, orkeysis empty or has no nulls, there is no error, and an emptytableis returned- Parameters:
input – [in] The input
table_viewto filterkeys – [in] vector of indices representing key columns from
inputkeep_threshold – [in] The minimum number of non-null fields in a row required to keep the row.
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned table’s device memory
- Returns:
Table containing all rows of the
inputwith at leastkeep_thresholdnon-null fields inkeys.
-
std::unique_ptr<table> drop_nulls(table_view const &input, std::vector<size_type> const &keys, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Filters a table to remove null elements.
Filters the rows of the
inputconsidering specified columns indicated inkeysfor validity / null values.input {col1: {1, 2, 3, null}, col2: {4, 5, null, null}, col3: {7, null, null, null}} keys = {0, 1, 2} //All columns output {col1: {1} col2: {4} col3: {7}}Same as drop_nulls but defaults keep_threshold to the number of columns in
keys.- Parameters:
input – [in] The input
table_viewto filterkeys – [in] vector of indices representing key columns from
inputstream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned table’s device memory
- Returns:
Table containing all rows of the
inputwithout nulls in the columns ofkeys.
-
std::unique_ptr<table> drop_nans(table_view const &input, std::vector<size_type> const &keys, cudf::size_type keep_threshold, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Filters a table to remove NANs with threshold count.
Filters the rows of the
inputconsidering specified columns indicated inkeysfor NANs. These key columns must be of floating-point type.Given an input table_view, row
ifrom the input columns is copied to the output if the same rowiofkeyshas at leastkeep_thresholdnon-NAN elements.This operation is stable: the input order is preserved in the output.
input {col1: {1.0, 2.0, 3.0, NAN}, col2: {4.0, null, NAN, NAN}, col3: {7.0, NAN, NAN, NAN}} keys = {0, 1, 2} // All columns keep_threshold = 2 output {col1: {1.0, 2.0} col2: {4.0, null} col3: {7.0, NAN}}Note
if
input.num_rows()is zero, orkeysis empty, there is no error, and an emptytableis returned- Throws:
cudf::logic_error – if The
keyscolumns are not floating-point type.- Parameters:
input – [in] The input
table_viewto filterkeys – [in] vector of indices representing key columns from
inputkeep_threshold – [in] The minimum number of non-NAN elements in a row required to keep the row.
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned table’s device memory
- Returns:
Table containing all rows of the
inputwith at leastkeep_thresholdnon-NAN elements inkeys.
-
std::unique_ptr<table> drop_nans(table_view const &input, std::vector<size_type> const &keys, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Filters a table to remove NANs.
Filters the rows of the
inputconsidering specified columns indicated inkeysfor NANs. These key columns must be of floating-point type.input {col1: {1.0, 2.0, 3.0, NAN}, col2: {4.0, null, NAN, NAN}, col3: {null, NAN, NAN, NAN}} keys = {0, 1, 2} // All columns keep_threshold = 2 output {col1: {1.0} col2: {4.0} col3: {null}}Same as drop_nans but defaults keep_threshold to the number of columns in
keys.- Parameters:
input – [in] The input
table_viewto filterkeys – [in] vector of indices representing key columns from
inputstream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned table’s device memory
- Returns:
Table containing all rows of the
inputwithout NANs in the columns ofkeys.
-
std::unique_ptr<table> apply_boolean_mask(table_view const &input, column_view const &boolean_mask, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Filters
inputusingboolean_maskof boolean values as a mask.Given an input
table_viewand a maskcolumn_view, an elementifrom each column_view of theinputis copied to the corresponding output column if the corresponding elementiin the mask is non-null andtrue. This operation is stable: the input order is preserved.Note
if
input.num_rows()is zero, there is no error, and an empty table is returned.- Throws:
cudf::logic_error – if
input.num_rows() != boolean_mask.size().cudf::logic_error – if
boolean_maskis nottype_id::BOOL8type.
- Parameters:
input – [in] The input table_view to filter
boolean_mask – [in] A nullable column_view of type type_id::BOOL8 used as a mask to filter the
input.stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned table’s device memory
- Returns:
Table containing copy of all rows of
inputpassing the filter defined byboolean_mask.
-
std::unique_ptr<table> unique(table_view const &input, std::vector<size_type> const &keys, duplicate_keep_option keep, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a new table with consecutive duplicate rows removed.
Given an
inputtable_view, each row is copied to the output table to create a set of distinct rows. If there are duplicate rows, which row is copied depends on thekeepparameter.The order of rows in the output table remains the same as in the input.
A row is distinct if there are no equivalent rows in the table. A row is unique if there is no adjacent equivalent row. That is, keeping distinct rows removes all duplicates in the table/column, while keeping unique rows only removes duplicates from consecutive groupings.
Performance hint: if the input is pre-sorted,
cudf::uniquecan produce an equivalent result (i.e., same set of output rows) but with less running time thancudf::distinct.- Throws:
cudf::logic_error – if the
keyscolumn indices are out of bounds in theinputtable.- Parameters:
input – [in] input table_view to copy only unique rows
keys – [in] vector of indices representing key columns from
inputkeep – [in] keep any, first, last, or none of the found duplicates
nulls_equal – [in] flag to denote nulls are equal if null_equality::EQUAL, nulls are not equal if null_equality::UNEQUAL
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned table’s device memory
- Returns:
Table with unique rows from each sequence of equivalent rows as specified by
keep
-
std::unique_ptr<table> distinct(table_view const &input, std::vector<size_type> const &keys, duplicate_keep_option keep = duplicate_keep_option::KEEP_ANY, null_equality nulls_equal = null_equality::EQUAL, nan_equality nans_equal = nan_equality::ALL_EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a new table without duplicate rows.
Given an
inputtable_view, each row is copied to the output table to create a set of distinct rows. If there are duplicate rows, which row is copied depends on thekeepparameter.The order of rows in the output table is not specified.
Performance hint: if the input is pre-sorted,
cudf::uniquecan produce an equivalent result (i.e., same set of output rows) but with less running time thancudf::distinct.- Parameters:
input – The input table
keys – Vector of indices indicating key columns in the
inputtablekeep – Copy any, first, last, or none of the found duplicates
nulls_equal – Flag to specify whether null elements should be considered as equal
nans_equal – Flag to specify whether NaN elements should be considered as equal
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned table
- Returns:
Table with distinct rows in an unspecified order
-
std::unique_ptr<column> distinct_indices(table_view const &input, duplicate_keep_option keep = duplicate_keep_option::KEEP_ANY, null_equality nulls_equal = null_equality::EQUAL, nan_equality nans_equal = nan_equality::ALL_EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a column of indices of all distinct rows in the input table.
Given an
inputtable_view, an output vector of all row indices of the distinct rows is generated. If there are duplicate rows, which index is kept depends on thekeepparameter.- Parameters:
input – The input table
keep – Get index of any, first, last, or none of the found duplicates
nulls_equal – Flag to specify whether null elements should be considered as equal
nans_equal – Flag to specify whether NaN elements should be considered as equal
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned vector
- Returns:
Column containing the result indices
-
std::unique_ptr<table> stable_distinct(table_view const &input, std::vector<size_type> const &keys, duplicate_keep_option keep = duplicate_keep_option::KEEP_ANY, null_equality nulls_equal = null_equality::EQUAL, nan_equality nans_equal = nan_equality::ALL_EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a new table without duplicate rows, preserving input order.
Given an
inputtable_view, each row is copied to the output table to create a set of distinct rows. The input row order is preserved. If there are duplicate rows, which row is copied depends on thekeepparameter.This API produces the same output rows as
cudf::distinct, but with input order preserved.Note that when
keepisKEEP_ANY, the choice of which duplicate row to keep is arbitrary, but the returned table will retain the input order. That is, if the key column contained1, 2, 1with another values column3, 4, 5, the result could contain values3, 4or4, 5but not4, 3or5, 4.- Parameters:
input – The input table
keys – Vector of indices indicating key columns in the
inputtablekeep – Copy any, first, last, or none of the found duplicates
nulls_equal – Flag to specify whether null elements should be considered as equal
nans_equal – Flag to specify whether NaN elements should be considered as equal
stream – CUDA stream used for device memory operations and kernel launches.
mr – Device memory resource used to allocate the returned table
- Returns:
Table with distinct rows, preserving input order
-
cudf::size_type unique_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the number of consecutive groups of equivalent rows in a column.
If
null_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_NULL, bothNaNandnullvalues are ignored. Ifnull_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_VALID, onlynullis ignored,NaNis considered in count.nulls are handled as equal.- Parameters:
input – [in] The column_view whose consecutive groups of equivalent rows will be counted
null_handling – [in] flag to include or ignore
nullwhile countingnan_handling – [in] flag to consider
NaN==nullor notstream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of consecutive groups of equivalent rows in the column
-
cudf::size_type unique_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the number of consecutive groups of equivalent rows in a table.
- Parameters:
input – [in] Table whose consecutive groups of equivalent rows will be counted
nulls_equal – [in] flag to denote if null elements should be considered equal nulls are not equal if null_equality::UNEQUAL.
stream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of consecutive groups of equivalent rows in the column
-
cudf::size_type distinct_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the distinct elements in the column_view.
If
nulls_equal == nulls_equal::UNEQUAL, allnulls are distinct.Given an input column_view, number of distinct elements in this column_view is returned.
If
null_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_NULL, bothNaNandnullvalues are ignored. Ifnull_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_VALID, onlynullis ignored,NaNis considered in distinct count.nulls are handled as equal.- Parameters:
input – [in] The column_view whose distinct elements will be counted
null_handling – [in] flag to include or ignore
nullwhile countingnan_handling – [in] flag to consider
NaN==nullor notstream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of distinct rows in the table
-
cudf::size_type distinct_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the distinct rows in a table.
- Parameters:
input – [in] Table whose distinct rows will be counted
nulls_equal – [in] flag to denote if null elements should be considered equal. nulls are not equal if null_equality::UNEQUAL.
stream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of distinct rows in the table
-
std::vector<std::unique_ptr<column>> filter(std::vector<column_view> const &predicate_columns, std::string const &predicate_udf, std::vector<column_view> const &filter_columns, bool is_ptx, std::optional<void*> user_data = std::nullopt, null_aware is_null_aware = null_aware::NO, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Creates a new column by applying a filter function against every element of the input columns.
Null values in the input columns are considered as not matching the filter.
Computes:
out[i]... = predicate(columns[i]... ) ? (columns[i]...): not-applied.Note that for every scalar in
columns(columns of size 1),columns[i] == input[0]The size of the resulting column is the size of the largest column.
- Throws:
std::invalid_argument – if any of the input columns have different sizes (except scalars of size 1)
std::invalid_argument – if the output or any of the inputs are not fixed-width or string types
cudf::logic_error – if JIT is not supported by the runtime
std::invalid_argument – if the size of
copy_maskdoes not match the number of input columns
- Parameters:
predicate_columns – Immutable views of the predicate columns
predicate_udf – The PTX/CUDA string of the transform function to apply
filter_columns – Immutable view of the columns to be filtered
is_ptx – true: the UDF is treated as PTX code; false: the UDF is treated as CUDA code
user_data – User-defined device data to pass to the UDF.
is_null_aware – Signifies the UDF will receive row inputs as optional values
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
The filtered target columns
-
std::unique_ptr<table> filter(table_view const &predicate_table, ast::expression const &predicate_expr, table_view const &filter_table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Creates new table by applying a filter function against every element of the input columns.
Null values in the input columns are considered as not matching the filter.
Computes:
out[i]... = predicate(columns[i]... ) ? (columns[i]...): not-applied.- Throws:
std::invalid_argument – if the output or any of the inputs are not fixed-width or string types
cudf::logic_error – if JIT is not supported by the runtime
- Parameters:
predicate_table – The table used for predicate expression evaluation
predicate_expr – The predicate filter expression
filter_table – The table to be filtered
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
The filtered table
-
enum class duplicate_keep_option#