Files | Enumerations | Functions
Stream Compaction

Files

file  stream_compaction.hpp
 Column APIs for filtering rows.
 

Enumerations

enum  cudf::duplicate_keep_option { cudf::duplicate_keep_option::KEEP_FIRST = 0, cudf::duplicate_keep_option::KEEP_LAST, cudf::duplicate_keep_option::KEEP_NONE }
 Choices for drop_duplicates API for retainment of duplicate rows. More...
 

Functions

std::unique_ptr< tablecudf::drop_nulls (table_view const &input, std::vector< size_type > const &keys, cudf::size_type keep_threshold, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Filters a table to remove null elements with threshold count. More...
 
std::unique_ptr< tablecudf::drop_nulls (table_view const &input, std::vector< size_type > const &keys, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Filters a table to remove null elements. More...
 
std::unique_ptr< tablecudf::drop_nans (table_view const &input, std::vector< size_type > const &keys, cudf::size_type keep_threshold, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Filters a table to remove NANs with threshold count. More...
 
std::unique_ptr< tablecudf::drop_nans (table_view const &input, std::vector< size_type > const &keys, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Filters a table to remove NANs. More...
 
std::unique_ptr< tablecudf::apply_boolean_mask (table_view const &input, column_view const &boolean_mask, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Filters input using boolean_mask of boolean values as a mask. More...
 
std::unique_ptr< tablecudf::unique (table_view const &input, std::vector< size_type > const &keys, duplicate_keep_option keep, null_equality nulls_equal=null_equality::EQUAL, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Create a new table with consecutive duplicate rows removed. More...
 
std::unique_ptr< tablecudf::distinct (table_view const &input, std::vector< size_type > const &keys, null_equality nulls_equal=null_equality::EQUAL, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource())
 Create a new table without duplicate rows. More...
 
cudf::size_type cudf::unique_count (column_view const &input, null_policy null_handling, nan_policy nan_handling)
 Count the number of consecutive groups of equivalent rows in a column. More...
 
cudf::size_type cudf::unique_count (table_view const &input, null_equality nulls_equal=null_equality::EQUAL)
 Count the number of consecutive groups of equivalent rows in a table. More...
 
cudf::size_type cudf::distinct_count (column_view const &input, null_policy null_handling, nan_policy nan_handling)
 Count the distinct elements in the column_view. More...
 
cudf::size_type cudf::distinct_count (table_view const &input, null_equality nulls_equal=null_equality::EQUAL)
 Count the distinct rows in a table. More...
 

Detailed Description

Enumeration Type Documentation

◆ duplicate_keep_option

Choices for drop_duplicates API for retainment of duplicate rows.

Enumerator
KEEP_FIRST 

Keeps first duplicate element and unique elements.

KEEP_LAST 

Keeps last duplicate element and unique elements.

KEEP_NONE 

Keeps only unique elements.

Definition at line 210 of file stream_compaction.hpp.

Function Documentation

◆ apply_boolean_mask()

std::unique_ptr<table> cudf::apply_boolean_mask ( table_view const &  input,
column_view const &  boolean_mask,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Filters input using boolean_mask of boolean values as a mask.

Given an input table_view and a mask column_view, an element i from each column_view of the input is copied to the corresponding output column if the corresponding element i in the mask is non-null and true. This operation is stable: the input order is preserved.

Note
if input.num_rows() is zero, there is no error, and an empty table is returned.
Exceptions
cudf::logic_errorif input.num_rows() != boolean_mask.size().
cudf::logic_errorif boolean_mask is not type_id::BOOL8 type.
Parameters
[in]inputThe input table_view to filter
[in]boolean_maskA nullable column_view of type type_id::BOOL8 used as a mask to filter the input.
[in]mrDevice memory resource used to allocate the returned table's device memory
Returns
Table containing copy of all rows of input passing the filter defined by boolean_mask.

◆ distinct()

std::unique_ptr<table> cudf::distinct ( table_view const &  input,
std::vector< size_type > const &  keys,
null_equality  nulls_equal = null_equality::EQUAL,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Create a new table without duplicate rows.

Given an input table_view, each row is copied to output table if the corresponding row of keys columns is distinct (no other equivalent row exists in the table). If duplicate rows are present, it is unspecified which row is copied.

The order of elements in the output table is not specified.

Performance hints:

  • Always use cudf::unique instead of cudf::distinct if the input is pre-sorted
  • If the input is not pre-sorted and the behavior of pandas.DataFrame.drop_duplicates is desired:
Parameters
[in]inputinput table_view to copy only distinct rows
[in]keysvector of indices representing key columns from input
[in]nulls_equalflag to denote nulls are equal if null_equality::EQUAL, nulls are not equal if null_equality::UNEQUAL
[in]mrDevice memory resource used to allocate the returned table's device memory
Returns
Table with distinct rows in an unspecified order.

◆ distinct_count() [1/2]

cudf::size_type cudf::distinct_count ( column_view const &  input,
null_policy  null_handling,
nan_policy  nan_handling 
)

Count the distinct elements in the column_view.

If nulls_equal == nulls_equal::UNEQUAL, all nulls are distinct.

Given an input column_view, number of distinct elements in this column_view is returned.

If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_NULL, both NaN and null values are ignored. If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_VALID, only null is ignored, NaN is considered in distinct count.

nulls are handled as equal.

Parameters
[in]inputThe column_view whose distinct elements will be counted
[in]null_handlingflag to include or ignore null while counting
[in]nan_handlingflag to consider NaN==null or not
Returns
number of distinct rows in the table

◆ distinct_count() [2/2]

cudf::size_type cudf::distinct_count ( table_view const &  input,
null_equality  nulls_equal = null_equality::EQUAL 
)

Count the distinct rows in a table.

Parameters
[in]inputTable whose distinct rows will be counted
[in]nulls_equalflag to denote if null elements should be considered equal. nulls are not equal if null_equality::UNEQUAL.
Returns
number of distinct rows in the table

◆ drop_nans() [1/2]

std::unique_ptr<table> cudf::drop_nans ( table_view const &  input,
std::vector< size_type > const &  keys,
cudf::size_type  keep_threshold,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Filters a table to remove NANs with threshold count.

Filters the rows of the input considering specified columns indicated in keys for NANs. These key columns must be of floating-point type.

Given an input table_view, row i from the input columns is copied to the output if the same row i of keys has at least keep_threshold non-NAN elements.

This operation is stable: the input order is preserved in the output.

input {col1: {1.0, 2.0, 3.0, NAN},
col2: {4.0, null, NAN, NAN},
col3: {7.0, NAN, NAN, NAN}}
keys = {0, 1, 2} // All columns
keep_threshold = 2
output {col1: {1.0, 2.0}
col2: {4.0, null}
col3: {7.0, NAN}}
Note
if input.num_rows() is zero, or keys is empty, there is no error, and an empty table is returned
Exceptions
cudf::logic_errorif The keys columns are not floating-point type.
Parameters
[in]inputThe input table_view to filter.
[in]keysvector of indices representing key columns from input
[in]keep_thresholdThe minimum number of non-NAN elements in a row required to keep the row.
[in]mrDevice memory resource used to allocate the returned table's device memory
Returns
Table containing all rows of the input with at least keep_threshold non-NAN elements in keys.

◆ drop_nans() [2/2]

std::unique_ptr<table> cudf::drop_nans ( table_view const &  input,
std::vector< size_type > const &  keys,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Filters a table to remove NANs.

Filters the rows of the input considering specified columns indicated in keys for NANs. These key columns must be of floating-point type.

input {col1: {1.0, 2.0, 3.0, NAN},
col2: {4.0, null, NAN, NAN},
col3: {null, NAN, NAN, NAN}}
keys = {0, 1, 2} // All columns
keep_threshold = 2
output {col1: {1.0}
col2: {4.0}
col3: {null}}

Same as drop_nans but defaults keep_threshold to the number of columns in keys.

Parameters
[in]inputThe input table_view to filter.
[in]keysvector of indices representing key columns from input
[in]mrDevice memory resource used to allocate the returned table's device memory
Returns
Table containing all rows of the input without NANs in the columns of keys.

◆ drop_nulls() [1/2]

std::unique_ptr<table> cudf::drop_nulls ( table_view const &  input,
std::vector< size_type > const &  keys,
cudf::size_type  keep_threshold,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Filters a table to remove null elements with threshold count.

Filters the rows of the input considering specified columns indicated in keys for validity / null values.

Given an input table_view, row i from the input columns is copied to the output if the same row i of keys has at least keep_threshold non-null fields.

This operation is stable: the input order is preserved in the output.

Any non-nullable column in the input is treated as all non-null.

input {col1: {1, 2, 3, null},
col2: {4, 5, null, null},
col3: {7, null, null, null}}
keys = {0, 1, 2} // All columns
keep_threshold = 2
output {col1: {1, 2}
col2: {4, 5}
col3: {7, null}}
Note
if input.num_rows() is zero, or keys is empty or has no nulls, there is no error, and an empty table is returned
Parameters
[in]inputThe input table_view to filter.
[in]keysvector of indices representing key columns from input
[in]keep_thresholdThe minimum number of non-null fields in a row required to keep the row.
[in]mrDevice memory resource used to allocate the returned table's device memory
Returns
Table containing all rows of the input with at least keep_threshold non-null fields in keys.

◆ drop_nulls() [2/2]

std::unique_ptr<table> cudf::drop_nulls ( table_view const &  input,
std::vector< size_type > const &  keys,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Filters a table to remove null elements.

Filters the rows of the input considering specified columns indicated in keys for validity / null values.

input {col1: {1, 2, 3, null},
col2: {4, 5, null, null},
col3: {7, null, null, null}}
keys = {0, 1, 2} //All columns
output {col1: {1}
col2: {4}
col3: {7}}

Same as drop_nulls but defaults keep_threshold to the number of columns in keys.

Parameters
[in]inputThe input table_view to filter.
[in]keysvector of indices representing key columns from input
[in]mrDevice memory resource used to allocate the returned table's device memory
Returns
Table containing all rows of the input without nulls in the columns of keys.

◆ unique()

std::unique_ptr<table> cudf::unique ( table_view const &  input,
std::vector< size_type > const &  keys,
duplicate_keep_option  keep,
null_equality  nulls_equal = null_equality::EQUAL,
rmm::mr::device_memory_resource mr = rmm::mr::get_current_device_resource() 
)

Create a new table with consecutive duplicate rows removed.

Given an input table_view, one specific row from a group of equivalent elements is copied to output table depending on the value of keep:

  • KEEP_FIRST: only the first of a sequence of duplicate rows is copied
  • KEEP_LAST: only the last of a sequence of duplicate rows is copied
  • KEEP_NONE: no duplicate rows are copied

A row is distinct if there are no equivalent rows in the table. A row is unique if there is no adjacent equivalent row. That is, keeping distinct rows removes all duplicates in the table/column, while keeping unique rows only removes duplicates from consecutive groupings.

Performance hints:

  • Always use cudf::unique instead of cudf::distinct if the input is pre-sorted
  • If the input is not pre-sorted and the behavior of pandas.DataFrame.drop_duplicates is desired:
Exceptions
cudf::logic_errorif the keys column indices are out of bounds in the input table.
Parameters
[in]inputinput table_view to copy only unique rows
[in]keysvector of indices representing key columns from input
[in]keepkeep first row, last row, or no rows of the found duplicates
[in]nulls_equalflag to denote nulls are equal if null_equality::EQUAL, nulls are not equal if null_equality::UNEQUAL
[in]mrDevice memory resource used to allocate the returned table's device memory
Returns
Table with unique rows from each sequence of equivalent rows as specified by keep.

◆ unique_count() [1/2]

cudf::size_type cudf::unique_count ( column_view const &  input,
null_policy  null_handling,
nan_policy  nan_handling 
)

Count the number of consecutive groups of equivalent rows in a column.

If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_NULL, both NaN and null values are ignored. If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_VALID, only null is ignored, NaN is considered in count.

nulls are handled as equal.

Parameters
[in]inputThe column_view whose consecutive groups of equivalent rows will be counted
[in]null_handlingflag to include or ignore null while counting
[in]nan_handlingflag to consider NaN==null or not
Returns
number of consecutive groups of equivalent rows in the column

◆ unique_count() [2/2]

cudf::size_type cudf::unique_count ( table_view const &  input,
null_equality  nulls_equal = null_equality::EQUAL 
)

Count the number of consecutive groups of equivalent rows in a table.

Parameters
[in]inputTable whose consecutive groups of equivalent rows will be counted
[in]nulls_equalflag to denote if null elements should be considered equal. nulls are not equal if null_equality::UNEQUAL.
Returns
number of consecutive groups of equivalent rows in the column