Transformation Transform#

group transformation_transform

Functions

std::unique_ptr<column> transform(column_view const &input, std::string const &unary_udf, data_type output_type, bool is_ptx, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Creates a new column by applying a unary function against every element of an input column.

Computes: out[i] = F(in[i])

The output null mask is the same is the input null mask so if input[i] is null then output[i] is also null

Parameters:
  • input – An immutable view of the input column to transform

  • unary_udf – The PTX/CUDA string of the unary function to apply

  • output_type – The output type that is compatible with the output type in the UDF

  • is_ptx – true: the UDF is treated as PTX code; false: the UDF is treated as CUDA code

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

The column resulting from applying the unary function to every element of the input

std::pair<std::unique_ptr<rmm::device_buffer>, size_type> nans_to_nulls(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Creates a null_mask from input by converting NaN to null and preserving existing null values and also returns new null_count.

Throws:

cudf::logic_error – if input.type() is a non-floating type

Parameters:
  • input – An immutable view of the input column of floating-point type

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned bitmask

Returns:

A pair containing a device_buffer with the new bitmask and it’s null count obtained by replacing NaN in input with null.

std::unique_ptr<column> compute_column(table_view const &table, ast::expression const &expr, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Compute a new column by evaluating an expression tree on a table.

This evaluates an expression over a table to produce a new column. Also called an n-ary transform.

Throws:

cudf::logic_error – if passed an expression operating on table_reference::RIGHT.

Parameters:
  • table – The table used for expression evaluation

  • expr – The root of the expression tree

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource

Returns:

Output column

std::pair<std::unique_ptr<rmm::device_buffer>, cudf::size_type> bools_to_mask(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Creates a bitmask from a column of boolean elements.

If element i in input is true, bit i in the resulting mask is set (1). Else, if element i is false or null, bit i is unset (0).

Throws:

cudf::logic_error – if input.type() is a non-boolean type

Parameters:
  • input – Boolean elements to convert to a bitmask

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned bitmask

Returns:

A pair containing a device_buffer with the new bitmask and it’s null count obtained from input considering true represent valid/1 and false represent invalid/0.

std::pair<std::unique_ptr<cudf::table>, std::unique_ptr<cudf::column>> encode(cudf::table_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Encode the rows of the given table as integers.

The encoded values are integers in the range [0, n), where n is the number of distinct rows in the input table. The result table is such that keys[result[i]] == input[i], where keys is a table containing the distinct rows in input in sorted ascending order. Nulls, if any, are sorted to the end of the keys table.

Examples:

input: [{'a', 'b', 'b', 'a'}]
output: [{'a', 'b'}], {0, 1, 1, 0}

input: [{1, 3, 1, 2, 9}, {1, 2, 1, 3, 5}]
output: [{1, 2, 3, 9}, {1, 3, 2, 5}], {0, 2, 0, 1, 3}

Parameters:
  • input – Table containing values to be encoded

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned table’s device memory

Returns:

A pair containing the distinct row of the input table in sorter order, and a column of integer indices representing the encoded rows.

std::pair<std::unique_ptr<column>, table_view> one_hot_encode(column_view const &input, column_view const &categories, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Encodes input by generating a new column for each value in categories indicating the presence of that value in input.

The resulting per-category columns are returned concatenated as a single column viewed by a table_view.

The ith row of the jth column in the output table equals 1 if input[i] == categories[j], and 0 otherwise.

The ith row of the jth column in the output table equals 1 if input[i] == categories[j], and 0 otherwise.

Examples:

input: [{'a', 'c', null, 'c', 'b'}]
categories: ['c', null]
output: [{0, 1, 0, 1, 0}, {0, 0, 1, 0, 0}]

Throws:

cudf::logic_error – if input and categories are of different types.

Parameters:
  • input – Column containing values to be encoded

  • categories – Column containing categories

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned table’s device memory

Returns:

A pair containing the owner to all encoded data and a table view into the data

std::unique_ptr<column> mask_to_bools(bitmask_type const *bitmask, size_type begin_bit, size_type end_bit, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Creates a boolean column from given bitmask.

Returns a bool for each bit in [begin_bit, end_bit). If bit i in least-significant bit numbering is set (1), then element i in the output is true, otherwise false.

Examples:

input: {0b10101010}
output: [{false, true, false, true, false, true, false, true}]

Throws:
Parameters:
  • bitmask – A device pointer to the bitmask which needs to be converted

  • begin_bit – position of the bit from which the conversion should start

  • end_bit – position of the bit before which the conversion should stop

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned columns’ device memory

Returns:

A boolean column representing the given mask from [begin_bit, end_bit)

std::unique_ptr<column> row_bit_count(table_view const &t, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns an approximate cumulative size in bits of all columns in the table_view for each row.

This function counts bits instead of bytes to account for the null mask which only has one bit per row.

Each row in the returned column is the sum of the per-row size for each column in the table.

In some cases, this is an inexact approximation. Specifically, columns of lists and strings require N+1 offsets to represent N rows. It is up to the caller to calculate the small additional overhead of the terminating offset for any group of rows being considered.

This function returns the per-row sizes as the columns are currently formed. This can end up being larger than the number you would get by gathering the rows. Specifically, the push-down of struct column validity masks can nullify rows that contain data for string or list columns. In these cases, the size returned is conservative:

row_bit_count(column(x)) >= row_bit_count(gather(column(x)))

Parameters:
  • t – The table view to perform the computation on

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned columns’ device memory

Returns:

A 32-bit integer column containing the per-row bit counts

std::unique_ptr<column> segmented_row_bit_count(table_view const &t, size_type segment_length, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns an approximate cumulative size in bits of all columns in the table_view for each segment of rows.

This is similar to counting bit size per row for the input table in cudf::row_bit_count, except that row sizes are accumulated by segments.

Currently, only fixed-length segments are supported. In case the input table has number of rows not divisible by segment_length, its last segment is considered as shorter than the others.

Throws:

std::invalid_argument – if the input segment_length is non-positive or larger than the number of rows in the input table.

Parameters:
  • t – The table view to perform the computation on

  • segment_length – The number of rows in each segment for which the total size is computed

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned columns’ device memory

Returns:

A 32-bit integer column containing the bit counts for each segment of rows