Set Operations#

group set_operations

Functions

std::unique_ptr<column> have_overlap(lists_column_view const &lhs, lists_column_view const &rhs, null_equality nulls_equal = null_equality::EQUAL, nan_equality nans_equal = nan_equality::ALL_EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Check if lists at each row of the given lists columns overlap.

Given two input lists columns, each list row in one column is checked if it has any common elements with the corresponding row of the other column.

A null input row in any of the input lists columns will result in a null output row.

Example:

lhs    = { {0, 1, 2}, {1, 2, 3}, null,         {4, null, 5} }
rhs    = { {1, 2, 3}, {4, 5},    {null, 7, 8}, {null, null} }
result = { true, false, null, true }

Throws:
  • cudf::logic_error – if the input lists columns have different sizes.

  • cudf::logic_error – if children of the input lists columns have different data types.

Parameters:
  • lhs – The input lists column for one side

  • rhs – The input lists column for the other side

  • nulls_equal – Flag to specify whether null elements should be considered as equal, default to be UNEQUAL which means only non-null elements are checked for overlapping

  • nans_equal – Flag to specify whether floating-point NaNs should be considered as equal

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned object

Returns:

A column of type BOOL containing the check results

std::unique_ptr<column> intersect_distinct(lists_column_view const &lhs, lists_column_view const &rhs, null_equality nulls_equal = null_equality::EQUAL, nan_equality nans_equal = nan_equality::ALL_EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Create a lists column of distinct elements common to two input lists columns.

Given two input lists columns lhs and rhs, an output lists column is created in a way such that each of its row i contains a list of distinct elements that can be found in both lhs[i] and rhs[i].

The order of distinct elements in the output rows is unspecified.

A null input row in any of the input lists columns will result in a null output row.

Example:

lhs    = { {2, 1, 2}, {1, 2, 3}, null,         {4, null, 5} }
rhs    = { {1, 2, 3}, {4, 5},    {null, 7, 8}, {null, null} }
result = { {1, 2}, {}, null, {null} }

Throws:
  • cudf::logic_error – if the input lists columns have different sizes.

  • cudf::logic_error – if children of the input lists columns have different data types.

Parameters:
  • lhs – The input lists column for one side

  • rhs – The input lists column for the other side

  • nulls_equal – Flag to specify whether null elements should be considered as equal

  • nans_equal – Flag to specify whether floating-point NaNs should be considered as equal

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned object

Returns:

A lists column containing the intersection results

std::unique_ptr<column> union_distinct(lists_column_view const &lhs, lists_column_view const &rhs, null_equality nulls_equal = null_equality::EQUAL, nan_equality nans_equal = nan_equality::ALL_EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Create a lists column of distinct elements found in either of two input lists columns.

Given two input lists columns lhs and rhs, an output lists column is created in a way such that each of its row i contains a list of distinct elements that can be found in either lhs[i] or rhs[i].

The order of distinct elements in the output rows is unspecified.

A null input row in any of the input lists columns will result in a null output row.

Example:

lhs    = { {2, 1, 2}, {1, 2, 3}, null,         {4, null, 5} }
rhs    = { {1, 2, 3}, {4, 5},    {null, 7, 8}, {null, null} }
result = { {1, 2, 3}, {1, 2, 3, 4, 5}, null, {4, null, 5} }

Throws:
  • cudf::logic_error – if the input lists columns have different sizes.

  • cudf::logic_error – if children of the input lists columns have different data types.

Parameters:
  • lhs – The input lists column for one side

  • rhs – The input lists column for the other side

  • nulls_equal – Flag to specify whether null elements should be considered as equal

  • nans_equal – Flag to specify whether floating-point NaNs should be considered as equal

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned object

Returns:

A lists column containing the union results

std::unique_ptr<column> difference_distinct(lists_column_view const &lhs, lists_column_view const &rhs, null_equality nulls_equal = null_equality::EQUAL, nan_equality nans_equal = nan_equality::ALL_EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Create a lists column of distinct elements found only in the left input column.

Given two input lists columns lhs and rhs, an output lists column is created in a way such that each of its row i contains a list of distinct elements that can be found in lhs[i] but are not found in rhs[i].

The order of distinct elements in the output rows is unspecified.

A null input row in any of the input lists columns will result in a null output row.

Example:

lhs    = { {2, 1, 2}, {1, 2, 3}, null,         {4, null, 5} }
rhs    = { {1, 2, 3}, {4, 5},    {null, 7, 8}, {null, null} }
result = { {}, {1, 2, 3}, null, {4, 5} }

Throws:
  • cudf::logic_error – if the input lists columns have different sizes.

  • cudf::logic_error – if children of the input lists columns have different data types.

Parameters:
  • lhs – The input lists column of elements that may be included

  • rhs – The input lists column of elements to exclude

  • nulls_equal – Flag to specify whether null elements should be considered as equal

  • nans_equal – Flag to specify whether floating-point NaNs should be considered as equal

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned object

Returns:

A lists column containing the difference results