Files | |
file | stream_compaction.hpp |
Column APIs for filtering rows. | |
Enumerations | |
enum class | cudf::duplicate_keep_option { cudf::KEEP_ANY = 0 , cudf::KEEP_FIRST , cudf::KEEP_LAST , cudf::KEEP_NONE } |
Choices for drop_duplicates API for retainment of duplicate rows. More... | |
|
strong |
Choices for drop_duplicates API for retainment of duplicate rows.
Enumerator | |
---|---|
KEEP_ANY | Keep an unspecified occurrence. |
KEEP_FIRST | Keep first occurrence. |
KEEP_LAST | Keep last occurrence. |
KEEP_NONE | Keep no (remove all) occurrences of duplicates. |
Definition at line 223 of file stream_compaction.hpp.
std::unique_ptr<table> cudf::apply_boolean_mask | ( | table_view const & | input, |
column_view const & | boolean_mask, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Filters input
using boolean_mask
of boolean values as a mask.
Given an input table_view
and a mask column_view
, an element i
from each column_view of the input
is copied to the corresponding output column if the corresponding element i
in the mask is non-null and true
. This operation is stable: the input order is preserved.
input.num_rows()
is zero, there is no error, and an empty table is returned.cudf::logic_error | if input.num_rows() != boolean_mask.size() . |
cudf::logic_error | if boolean_mask is not type_id::BOOL8 type. |
[in] | input | The input table_view to filter |
[in] | boolean_mask | A nullable column_view of type type_id::BOOL8 used as a mask to filter the input . |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
[in] | mr | Device memory resource used to allocate the returned table's device memory |
input
passing the filter defined by boolean_mask
. std::unique_ptr<table> cudf::distinct | ( | table_view const & | input, |
std::vector< size_type > const & | keys, | ||
duplicate_keep_option | keep = duplicate_keep_option::KEEP_ANY , |
||
null_equality | nulls_equal = null_equality::EQUAL , |
||
nan_equality | nans_equal = nan_equality::ALL_EQUAL , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Create a new table without duplicate rows.
Given an input
table_view, each row is copied to the output table to create a set of distinct rows. If there are duplicate rows, which row is copied depends on the keep
parameter.
The order of rows in the output table is not specified.
Performance hint: if the input is pre-sorted, cudf::unique
can produce an equivalent result (i.e., same set of output rows) but with less running time than cudf::distinct
.
input | The input table |
keys | Vector of indices indicating key columns in the input table |
keep | Copy any, first, last, or none of the found duplicates |
nulls_equal | Flag to specify whether null elements should be considered as equal |
nans_equal | Flag to specify whether NaN elements should be considered as equal |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned table |
cudf::size_type cudf::distinct_count | ( | column_view const & | input, |
null_policy | null_handling, | ||
nan_policy | nan_handling, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
) |
Count the distinct elements in the column_view.
If nulls_equal == nulls_equal::UNEQUAL
, all null
s are distinct.
Given an input column_view, number of distinct elements in this column_view is returned.
If null_handling
is null_policy::EXCLUDE and nan_handling
is nan_policy::NAN_IS_NULL, both NaN
and null
values are ignored. If null_handling
is null_policy::EXCLUDE and nan_handling
is nan_policy::NAN_IS_VALID, only null
is ignored, NaN
is considered in distinct count.
null
s are handled as equal.
[in] | input | The column_view whose distinct elements will be counted |
[in] | null_handling | flag to include or ignore null while counting |
[in] | nan_handling | flag to consider NaN==null or not |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
cudf::size_type cudf::distinct_count | ( | table_view const & | input, |
null_equality | nulls_equal = null_equality::EQUAL , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
) |
Count the distinct rows in a table.
[in] | input | Table whose distinct rows will be counted |
[in] | nulls_equal | flag to denote if null elements should be considered equal. nulls are not equal if null_equality::UNEQUAL. |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
std::unique_ptr<column> cudf::distinct_indices | ( | table_view const & | input, |
duplicate_keep_option | keep = duplicate_keep_option::KEEP_ANY , |
||
null_equality | nulls_equal = null_equality::EQUAL , |
||
nan_equality | nans_equal = nan_equality::ALL_EQUAL , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Create a column of indices of all distinct rows in the input table.
Given an input
table_view, an output vector of all row indices of the distinct rows is generated. If there are duplicate rows, which index is kept depends on the keep
parameter.
input | The input table |
keep | Get index of any, first, last, or none of the found duplicates |
nulls_equal | Flag to specify whether null elements should be considered as equal |
nans_equal | Flag to specify whether NaN elements should be considered as equal |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned vector |
std::unique_ptr<table> cudf::drop_nans | ( | table_view const & | input, |
std::vector< size_type > const & | keys, | ||
cudf::size_type | keep_threshold, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Filters a table to remove NANs with threshold count.
Filters the rows of the input
considering specified columns indicated in keys
for NANs. These key columns must be of floating-point type.
Given an input table_view, row i
from the input columns is copied to the output if the same row i
of keys
has at least keep_threshold
non-NAN elements.
This operation is stable: the input order is preserved in the output.
input.num_rows()
is zero, or keys
is empty, there is no error, and an empty table
is returnedcudf::logic_error | if The keys columns are not floating-point type. |
[in] | input | The input table_view to filter |
[in] | keys | vector of indices representing key columns from input |
[in] | keep_threshold | The minimum number of non-NAN elements in a row required to keep the row. |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
[in] | mr | Device memory resource used to allocate the returned table's device memory |
input
with at least keep_threshold
non-NAN elements in keys
. std::unique_ptr<table> cudf::drop_nans | ( | table_view const & | input, |
std::vector< size_type > const & | keys, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Filters a table to remove NANs.
Filters the rows of the input
considering specified columns indicated in keys
for NANs. These key columns must be of floating-point type.
Same as drop_nans but defaults keep_threshold to the number of columns in keys
.
[in] | input | The input table_view to filter |
[in] | keys | vector of indices representing key columns from input |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
[in] | mr | Device memory resource used to allocate the returned table's device memory |
input
without NANs in the columns of keys
. std::unique_ptr<table> cudf::drop_nulls | ( | table_view const & | input, |
std::vector< size_type > const & | keys, | ||
cudf::size_type | keep_threshold, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Filters a table to remove null elements with threshold count.
Filters the rows of the input
considering specified columns indicated in keys
for validity / null values.
Given an input table_view, row i
from the input columns is copied to the output if the same row i
of keys
has at least keep_threshold
non-null fields.
This operation is stable: the input order is preserved in the output.
Any non-nullable column in the input is treated as all non-null.
input.num_rows()
is zero, or keys
is empty or has no nulls, there is no error, and an empty table
is returned[in] | input | The input table_view to filter |
[in] | keys | vector of indices representing key columns from input |
[in] | keep_threshold | The minimum number of non-null fields in a row required to keep the row. |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
[in] | mr | Device memory resource used to allocate the returned table's device memory |
input
with at least keep_threshold
non-null fields in keys
. std::unique_ptr<table> cudf::drop_nulls | ( | table_view const & | input, |
std::vector< size_type > const & | keys, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Filters a table to remove null elements.
Filters the rows of the input
considering specified columns indicated in keys
for validity / null values.
Same as drop_nulls but defaults keep_threshold to the number of columns in keys
.
[in] | input | The input table_view to filter |
[in] | keys | vector of indices representing key columns from input |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
[in] | mr | Device memory resource used to allocate the returned table's device memory |
input
without nulls in the columns of keys
. std::unique_ptr<table> cudf::stable_distinct | ( | table_view const & | input, |
std::vector< size_type > const & | keys, | ||
duplicate_keep_option | keep = duplicate_keep_option::KEEP_ANY , |
||
null_equality | nulls_equal = null_equality::EQUAL , |
||
nan_equality | nans_equal = nan_equality::ALL_EQUAL , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Create a new table without duplicate rows, preserving input order.
Given an input
table_view, each row is copied to the output table to create a set of distinct rows. The input row order is preserved. If there are duplicate rows, which row is copied depends on the keep
parameter.
This API produces the same output rows as cudf::distinct
, but with input order preserved.
Note that when keep
is KEEP_ANY
, the choice of which duplicate row to keep is arbitrary, but the returned table will retain the input order. That is, if the key column contained 1, 2, 1
with another values column 3, 4, 5
, the result could contain values 3, 4
or 4, 5
but not 4, 3
or 5, 4
.
input | The input table |
keys | Vector of indices indicating key columns in the input table |
keep | Copy any, first, last, or none of the found duplicates |
nulls_equal | Flag to specify whether null elements should be considered as equal |
nans_equal | Flag to specify whether NaN elements should be considered as equal |
stream | CUDA stream used for device memory operations and kernel launches. |
mr | Device memory resource used to allocate the returned table |
std::unique_ptr<table> cudf::unique | ( | table_view const & | input, |
std::vector< size_type > const & | keys, | ||
duplicate_keep_option | keep, | ||
null_equality | nulls_equal = null_equality::EQUAL , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Create a new table with consecutive duplicate rows removed.
Given an input
table_view, each row is copied to the output table to create a set of distinct rows. If there are duplicate rows, which row is copied depends on the keep
parameter.
The order of rows in the output table remains the same as in the input.
A row is distinct if there are no equivalent rows in the table. A row is unique if there is no adjacent equivalent row. That is, keeping distinct rows removes all duplicates in the table/column, while keeping unique rows only removes duplicates from consecutive groupings.
Performance hint: if the input is pre-sorted, cudf::unique
can produce an equivalent result (i.e., same set of output rows) but with less running time than cudf::distinct
.
cudf::logic_error | if the keys column indices are out of bounds in the input table. |
[in] | input | input table_view to copy only unique rows |
[in] | keys | vector of indices representing key columns from input |
[in] | keep | keep any, first, last, or none of the found duplicates |
[in] | nulls_equal | flag to denote nulls are equal if null_equality::EQUAL, nulls are not equal if null_equality::UNEQUAL |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
[in] | mr | Device memory resource used to allocate the returned table's device memory |
keep
cudf::size_type cudf::unique_count | ( | column_view const & | input, |
null_policy | null_handling, | ||
nan_policy | nan_handling, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
) |
Count the number of consecutive groups of equivalent rows in a column.
If null_handling
is null_policy::EXCLUDE and nan_handling
is nan_policy::NAN_IS_NULL, both NaN
and null
values are ignored. If null_handling
is null_policy::EXCLUDE and nan_handling
is nan_policy::NAN_IS_VALID, only null
is ignored, NaN
is considered in count.
null
s are handled as equal.
[in] | input | The column_view whose consecutive groups of equivalent rows will be counted |
[in] | null_handling | flag to include or ignore null while counting |
[in] | nan_handling | flag to consider NaN==null or not |
[in] | stream | CUDA stream used for device memory operations and kernel launches |
cudf::size_type cudf::unique_count | ( | table_view const & | input, |
null_equality | nulls_equal = null_equality::EQUAL , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
) |
Count the number of consecutive groups of equivalent rows in a table.
[in] | input | Table whose consecutive groups of equivalent rows will be counted |
[in] | nulls_equal | flag to denote if null elements should be considered equal nulls are not equal if null_equality::UNEQUAL. |
[in] | stream | CUDA stream used for device memory operations and kernel launches |