Column Copy#

group column_copy

Enums

enum class out_of_bounds_policy : bool#

Policy to account for possible out-of-bounds indices.

NULLIFY means to nullify output values corresponding to out-of-bounds gather_map values. DONT_CHECK means do not check whether the indices are out-of-bounds, for better performance.

Values:

enumerator NULLIFY#

Output values corresponding to out-of-bounds indices are null.

enumerator DONT_CHECK#

No bounds checking is performed, better performance.

enum class mask_allocation_policy : int32_t#

Indicates when to allocate a mask, based on an existing mask.

Values:

enumerator NEVER#

Do not allocate a null mask, regardless of input.

enumerator RETAIN#

Allocate a null mask if the input contains one.

enumerator ALWAYS#

Allocate a null mask, regardless of input.

enum class sample_with_replacement : bool#

Indicates whether a row can be sampled more than once.

Values:

enumerator FALSE#

A row can be sampled only once.

enumerator TRUE#

A row can be sampled more than once.

Functions

std::unique_ptr<table> reverse(table_view const &source_table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Reverses the rows within a table.

Creates a new table that is the reverse of source_table. Example:

source = [[4,5,6], [7,8,9], [10,11,12]]
return = [[6,5,4], [9,8,7], [12,11,10]]

Parameters:
  • source_table – Table that will be reversed

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned table’s device memory

Returns:

Reversed table

std::unique_ptr<column> reverse(column_view const &source_column, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Reverses the elements of a column.

Creates a new column that is the reverse of source_column. Example:

source = [4,5,6]
return = [6,5,4]

Parameters:
  • source_column – Column that will be reversed

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned table’s device memory

Returns:

Reversed column

std::unique_ptr<column> empty_like(column_view const &input)#

Initializes and returns an empty column of the same type as the input.

Parameters:

input[in] Immutable view of input column to emulate

Returns:

An empty column of same type as input

std::unique_ptr<column> empty_like(scalar const &input)#

Initializes and returns an empty column of the same type as the input.

Parameters:

input[in] Scalar to emulate

Returns:

An empty column of same type as input

std::unique_ptr<column> allocate_like(column_view const &input, mask_allocation_policy mask_alloc = mask_allocation_policy::RETAIN, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Creates an uninitialized new column of the same size and type as the input.

Supports only fixed-width types.

If the mask_alloc allocates a validity mask that mask is also uninitialized and the validity bits and the null count should be set by the caller.

Throws:

cudf::data_type_error – if input type is not of fixed width.

Parameters:
  • input – Immutable view of input column to emulate

  • mask_alloc – Optional, Policy for allocating null mask. Defaults to RETAIN

  • mr – Device memory resource used to allocate the returned column’s device memory

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

A column with sufficient uninitialized capacity to hold the same number of elements as input of the same type as input.type()

std::unique_ptr<column> allocate_like(column_view const &input, size_type size, mask_allocation_policy mask_alloc = mask_allocation_policy::RETAIN, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Creates an uninitialized new column of the specified size and same type as the input.

Supports only fixed-width types.

If the mask_alloc allocates a validity mask that mask is also uninitialized and the validity bits and the null count should be set by the caller.

Parameters:
  • input – Immutable view of input column to emulate

  • size – The desired number of elements that the new column should have capacity for

  • mask_alloc – Optional, Policy for allocating null mask. Defaults to RETAIN

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

A column with sufficient uninitialized capacity to hold the specified number of elements as input of the same type as input.type()

std::unique_ptr<table> empty_like(table_view const &input_table)#

Creates a table of empty columns with the same types as the input_table

Creates the cudf::column objects, but does not allocate any underlying device memory for the column’s data or bitmask.

Parameters:

input_table[in] Immutable view of input table to emulate

Returns:

A table of empty columns with the same types as the columns in input_table

void copy_range_in_place(column_view const &source, mutable_column_view &target, size_type source_begin, size_type source_end, size_type target_begin, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Copies a range of elements in-place from one column to another.

Overwrites the range of elements in target indicated by the indices [target_begin, target_begin + N) with the elements from source indicated by the indices [source_begin, source_end) (where N = (source_end - source_begin)). Use the out-of-place copy function returning std::unique_ptr<column> for uses cases requiring memory reallocation. For example for strings columns and other variable-width types.

If source and target refer to the same elements and the ranges overlap, the behavior is undefined.

Throws:
  • cudf::data_type_error – if memory reallocation is required (e.g. for variable width types).

  • std::out_of_range – for invalid range (if source_begin > source_end, source_begin < 0, source_begin >= source.size(), source_end > source.size(), target_begin < 0, target_begin >= target.size(), or target_begin + (source_end - source_begin) > target.size()).

  • cudf::data_type_error – if target and source have different types.

  • std::invalid_argument – if source has null values and target is not nullable.

Parameters:
  • source – The column to copy from

  • target – The preallocated column to copy into

  • source_begin – The starting index of the source range (inclusive)

  • source_end – The index of the last element in the source range (exclusive)

  • target_begin – The starting index of the target range (inclusive)

  • stream – CUDA stream used for device memory operations and kernel launches

std::unique_ptr<column> copy_range(column_view const &source, column_view const &target, size_type source_begin, size_type source_end, size_type target_begin, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Copies a range of elements out-of-place from one column to another.

Creates a new column as if an in-place copy was performed into target. A copy of target is created first and then the elements indicated by the indices [target_begin, target_begin + N) were copied from the elements indicated by the indices [source_begin, source_end) of source (where N = (source_end - source_begin)). Elements outside the range are copied from target into the returned new column target.

If source and target refer to the same elements and the ranges overlap, the behavior is undefined.

A range is considered invalid if:

  • Either the begin or end indices are out of bounds for the corresponding column

  • Begin is greater than end for source or target

  • The size of the source range would overflow the target column starting at target_begin

Throws:
  • std::out_of_range – for any invalid range.

  • cudf::data_type_error – if target and source have different types.

  • cudf::data_type_error – if the data type is not fixed width, string, or dictionary

Parameters:
  • source – The column to copy from inside the range

  • target – The column to copy from outside the range

  • source_begin – The starting index of the source range (inclusive)

  • source_end – The index of the last element in the source range (exclusive)

  • target_begin – The starting index of the target range (inclusive)

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

The result target column

std::unique_ptr<column> copy_if_else(column_view const &lhs, column_view const &rhs, column_view const &boolean_mask, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new column, where each element is selected from either lhs or rhs based on the value of the corresponding element in boolean_mask.

Selects each element i in the output column from either rhs or lhs using the following rule: output[i] = (boolean_mask.valid(i) and boolean_mask[i]) ? lhs[i] : rhs[i]

Throws:
  • cudf::data_type_error – if lhs and rhs are not of the same type

  • std::invalid_argument – if lhs and rhs are not of the same length

  • cudf::data_type_error – if boolean mask is not of type bool

  • std::invalid_argument – if boolean mask is not of the same length as lhs and rhs

Parameters:
  • lhs – left-hand column_view

  • rhs – right-hand column_view

  • boolean_mask – column of type_id::BOOL8 representing “left (true) / right (false)” boolean for each element. Null element represents false.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

new column with the selected elements

std::unique_ptr<column> copy_if_else(scalar const &lhs, column_view const &rhs, column_view const &boolean_mask, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new column, where each element is selected from either lhs or rhs based on the value of the corresponding element in boolean_mask.

Selects each element i in the output column from either rhs or lhs using the following rule: output[i] = (boolean_mask.valid(i) and boolean_mask[i]) ? lhs : rhs[i]

Throws:
  • cudf::data_type_error – if lhs and rhs are not of the same type

  • cudf::data_type_error – if boolean mask is not of type bool

  • std::invalid_argument – if boolean mask is not of the same length as lhs and rhs

Parameters:
  • lhs – left-hand scalar

  • rhs – right-hand column_view

  • boolean_mask – column of type_id::BOOL8 representing “left (true) / right (false)” boolean for each element. Null element represents false.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

new column with the selected elements

std::unique_ptr<column> copy_if_else(column_view const &lhs, scalar const &rhs, column_view const &boolean_mask, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new column, where each element is selected from either lhs or rhs based on the value of the corresponding element in boolean_mask.

Selects each element i in the output column from either rhs or lhs using the following rule: output[i] = (boolean_mask.valid(i) and boolean_mask[i]) ? lhs[i] : rhs

Throws:
  • cudf::data_type_error – if lhs and rhs are not of the same type

  • cudf::data_type_error – if boolean mask is not of type bool

  • std::invalid_argument – if boolean mask is not of the same length as lhs and rhs

Parameters:
  • lhs – left-hand column_view

  • rhs – right-hand scalar

  • boolean_mask – column of type_id::BOOL8 representing “left (true) / right (false)” boolean for each element. Null element represents false.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

new column with the selected elements

std::unique_ptr<column> copy_if_else(scalar const &lhs, scalar const &rhs, column_view const &boolean_mask, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new column, where each element is selected from either lhs or rhs based on the value of the corresponding element in boolean_mask.

Selects each element i in the output column from either rhs or lhs using the following rule: output[i] = (boolean_mask.valid(i) and boolean_mask[i]) ? lhs : rhs

Throws:

cudf::logic_error – if boolean mask is not of type bool

Parameters:
  • lhs – left-hand scalar

  • rhs – right-hand scalar

  • boolean_mask – column of type_id::BOOL8 representing “left (true) / right (false)” boolean for each element. null element represents false.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

new column with the selected elements

std::unique_ptr<scalar> get_element(column_view const &input, size_type index, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Get the element at specified index from a column.

Warning

This function is expensive (invokes a kernel launch). So, it is not recommended to be used in performance sensitive code or inside a loop.

Throws:

std::out_of_range – if index is not within the range [0, input.size())

Parameters:
  • input – Column view to get the element from

  • index – Index into input to get the element at

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned scalar’s device memory

Returns:

Scalar containing the single value

std::unique_ptr<table> sample(table_view const &input, size_type const n, sample_with_replacement replacement = sample_with_replacement::FALSE, int64_t const seed = 0, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Gather n samples from given input randomly.

Example:
input: {col1: {1, 2, 3, 4, 5}, col2: {6, 7, 8, 9, 10}}
n: 3
replacement: false

output:       {col1: {3, 1, 4}, col2: {8, 6, 9}}

replacement: true

output:       {col1: {3, 1, 1}, col2: {8, 6, 6}}
Throws:
Parameters:
  • input – View of a table to sample

  • n – non-negative number of samples expected from input

  • replacement – Allow or disallow sampling of the same row more than once

  • seed – Seed value to initiate random number generator

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned table’s device memory

Returns:

Table containing samples from input

bool has_nonempty_nulls(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Checks if a column or its descendants have non-empty null rows.

A LIST or STRING column might have non-empty rows that are marked as null. A STRUCT OR LIST column might have child columns that have non-empty null rows. Other types of columns are deemed incapable of having non-empty null rows. E.g. Fixed width columns have no concept of an “empty” row.

Note

This function is exact. If it returns true, there exists one or more non-empty null elements.

Parameters:
  • input – The column which is (and whose descendants are) to be checked for non-empty null rows.

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

true If either the column or its descendants have non-empty null rows

Returns:

false If neither the column or its descendants have non-empty null rows

bool may_have_nonempty_nulls(column_view const &input)#

Approximates if a column or its descendants may have non-empty null elements.

False positives are possible, but false negatives are not.

Compared to the exact has_nonempty_nulls() function, this function is typically more efficient.

Complexity:

  • Best case: O(count_descendants(input))

  • Worst case: O(count_descendants(input)) * m, where m is the number of rows in the largest descendant

Note

This function is approximate.

  • true: Non-empty null elements could exist

  • false: Non-empty null elements definitely do not exist

Parameters:

input – The column which is (and whose descendants are) to be checked for non-empty null rows

Returns:

true If either the column or its descendants have null rows

Returns:

false If neither the column nor its descendants have null rows

std::unique_ptr<column> purge_nonempty_nulls(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Copy input into output while purging any non-empty null rows in the column or its descendants.

If the input column is not of compound type (LIST/STRING/STRUCT/DICTIONARY), the output will be the same as input.

The purge operation only applies directly to LIST and STRING columns, but it applies indirectly to STRUCT/DICTIONARY columns as well, since these columns may have child columns that are LIST or STRING.

Examples:

auto const lists   = lists_column_wrapper<int32_t>{ {0,1}, {2,3}, {4,5} }.release();
cudf::set_null_mask(lists->null_mask(), 1, 2, false);

lists[1] is now null, but the lists child column still stores `{2,3}`.
The lists column contents will be:
  Validity: 101
  Offsets:  [0, 2, 4, 6]
  Child:    [0, 1, 2, 3, 4, 5]

After purging the contents of the list's null rows, the column's contents will be:
  Validity: 101
  Offsets:  [0, 2, 2, 4]
  Child:    [0, 1, 4, 5]
auto const strings = strings_column_wrapper{ "AB", "CD", "EF" }.release();
cudf::set_null_mask(strings->null_mask(), 1, 2, false);

strings[1] is now null, but the strings column still stores `"CD"`.
The lists column contents will be:
  Validity: 101
  Offsets:  [0, 2, 4, 6]
  Child:    [A, B, C, D, E, F]

After purging the contents of the list's null rows, the column's contents
will be:
  Validity: 101
  Offsets:  [0, 2, 2, 4]
  Child:    [A, B, E, F]
auto const lists   = lists_column_wrapper<int32_t>{ {0,1}, {2,3}, {4,5} };
auto const structs = structs_column_wrapper{ {lists}, null_at(1) };

structs[1].child is now null, but the lists column still stores `{2,3}`.
The lists column contents will be:
  Validity: 101
  Offsets:  [0, 2, 4, 6]
  Child:    [0, 1, 2, 3, 4, 5]

After purging the contents of the list's null rows, the column's contents
will be:
  Validity: 101
  Offsets:  [0, 2, 2, 4]
  Child:    [0, 1, 4, 5]
Parameters:
  • input – The column whose null rows are to be checked and purged

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

A new column with equivalent contents to input, but with null rows purged