Files | Functions

Files

file  transform.hpp
 Column APIs for transforming rows.
 

Functions

std::unique_ptr< columncudf::transform (column_view const &input, std::string const &unary_udf, data_type output_type, bool is_ptx, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Creates a new column by applying a unary function against every element of an input column. More...
 
std::pair< std::unique_ptr< rmm::device_buffer >, size_typecudf::nans_to_nulls (column_view const &input, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Creates a null_mask from input by converting NaN to null and preserving existing null values and also returns new null_count. More...
 
std::unique_ptr< columncudf::compute_column (table_view const &table, ast::expression const &expr, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Compute a new column by evaluating an expression tree on a table. More...
 
std::pair< std::unique_ptr< rmm::device_buffer >, cudf::size_typecudf::bools_to_mask (column_view const &input, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Creates a bitmask from a column of boolean elements. More...
 
std::pair< std::unique_ptr< cudf::table >, std::unique_ptr< cudf::column > > cudf::encode (cudf::table_view const &input, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Encode the rows of the given table as integers. More...
 
std::pair< std::unique_ptr< column >, table_viewcudf::one_hot_encode (column_view const &input, column_view const &categories, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Encodes input by generating a new column for each value in categories indicating the presence of that value in input. More...
 
std::unique_ptr< columncudf::mask_to_bools (bitmask_type const *bitmask, size_type begin_bit, size_type end_bit, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Creates a boolean column from given bitmask. More...
 
std::unique_ptr< columncudf::row_bit_count (table_view const &t, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns an approximate cumulative size in bits of all columns in the table_view for each row. More...
 
std::unique_ptr< columncudf::segmented_row_bit_count (table_view const &t, size_type segment_length, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns an approximate cumulative size in bits of all columns in the table_view for each segment of rows. More...
 

Detailed Description

Function Documentation

◆ bools_to_mask()

std::pair<std::unique_ptr<rmm::device_buffer>, cudf::size_type> cudf::bools_to_mask ( column_view const &  input,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Creates a bitmask from a column of boolean elements.

If element i in input is true, bit i in the resulting mask is set (1). Else, if element i is false or null, bit i is unset (0).

Exceptions
cudf::logic_errorif input.type() is a non-boolean type
Parameters
inputBoolean elements to convert to a bitmask
mrDevice memory resource used to allocate the returned bitmask
Returns
A pair containing a device_buffer with the new bitmask and it's null count obtained from input considering true represent valid/1 and false represent invalid/0.

◆ compute_column()

std::unique_ptr<column> cudf::compute_column ( table_view const &  table,
ast::expression const &  expr,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Compute a new column by evaluating an expression tree on a table.

This evaluates an expression over a table to produce a new column. Also called an n-ary transform.

Exceptions
cudf::logic_errorif passed an expression operating on table_reference::RIGHT.
Parameters
tableThe table used for expression evaluation
exprThe root of the expression tree
mrDevice memory resource
Returns
Output column

◆ encode()

std::pair<std::unique_ptr<cudf::table>, std::unique_ptr<cudf::column> > cudf::encode ( cudf::table_view const &  input,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Encode the rows of the given table as integers.

The encoded values are integers in the range [0, n), where n is the number of distinct rows in the input table. The result table is such that keys[result[i]] == input[i], where keys is a table containing the distinct rows in input in sorted ascending order. Nulls, if any, are sorted to the end of the keys table.

Examples:

input: [{'a', 'b', 'b', 'a'}]
output: [{'a', 'b'}], {0, 1, 1, 0}
input: [{1, 3, 1, 2, 9}, {1, 2, 1, 3, 5}]
output: [{1, 2, 3, 9}, {1, 3, 2, 5}], {0, 2, 0, 1, 3}
Parameters
inputTable containing values to be encoded
mrDevice memory resource used to allocate the returned table's device memory
Returns
A pair containing the distinct row of the input table in sorter order, and a column of integer indices representing the encoded rows.

◆ mask_to_bools()

std::unique_ptr<column> cudf::mask_to_bools ( bitmask_type const *  bitmask,
size_type  begin_bit,
size_type  end_bit,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Creates a boolean column from given bitmask.

Returns a bool for each bit in [begin_bit, end_bit). If bit i in least-significant bit numbering is set (1), then element i in the output is true, otherwise false.

Exceptions
cudf::logic_errorif bitmask is null and end_bit-begin_bit > 0
cudf::logic_errorif begin_bit > end_bit

Examples:

input: {0b10101010}
output: [{false, true, false, true, false, true, false, true}]
Parameters
bitmaskA device pointer to the bitmask which needs to be converted
begin_bitposition of the bit from which the conversion should start
end_bitposition of the bit before which the conversion should stop
mrDevice memory resource used to allocate the returned columns' device memory
Returns
A boolean column representing the given mask from [begin_bit, end_bit)

◆ nans_to_nulls()

std::pair<std::unique_ptr<rmm::device_buffer>, size_type> cudf::nans_to_nulls ( column_view const &  input,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Creates a null_mask from input by converting NaN to null and preserving existing null values and also returns new null_count.

Exceptions
cudf::logic_errorif input.type() is a non-floating type
Parameters
inputAn immutable view of the input column of floating-point type
mrDevice memory resource used to allocate the returned bitmask
Returns
A pair containing a device_buffer with the new bitmask and it's null count obtained by replacing NaN in input with null.

◆ one_hot_encode()

std::pair<std::unique_ptr<column>, table_view> cudf::one_hot_encode ( column_view const &  input,
column_view const &  categories,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Encodes input by generating a new column for each value in categories indicating the presence of that value in input.

The resulting per-category columns are returned concatenated as a single column viewed by a table_view.

The ith row of the jth column in the output table equals 1 if input[i] == categories[j], and 0 otherwise.

The ith row of the jth column in the output table equals 1 if input[i] == categories[j], and 0 otherwise.

Examples:

input: [{'a', 'c', null, 'c', 'b'}]
categories: ['c', null]
output: [{0, 1, 0, 1, 0}, {0, 0, 1, 0, 0}]
Exceptions
cudf::logic_errorif input and categories are of different types.
Parameters
inputColumn containing values to be encoded
categoriesColumn containing categories
mrDevice memory resource used to allocate the returned table's device memory
Returns
A pair containing the owner to all encoded data and a table view into the data

◆ row_bit_count()

std::unique_ptr<column> cudf::row_bit_count ( table_view const &  t,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns an approximate cumulative size in bits of all columns in the table_view for each row.

This function counts bits instead of bytes to account for the null mask which only has one bit per row.

Each row in the returned column is the sum of the per-row size for each column in the table.

In some cases, this is an inexact approximation. Specifically, columns of lists and strings require N+1 offsets to represent N rows. It is up to the caller to calculate the small additional overhead of the terminating offset for any group of rows being considered.

This function returns the per-row sizes as the columns are currently formed. This can end up being larger than the number you would get by gathering the rows. Specifically, the push-down of struct column validity masks can nullify rows that contain data for string or list columns. In these cases, the size returned is conservative:

row_bit_count(column(x)) >= row_bit_count(gather(column(x)))

Parameters
tThe table view to perform the computation on
mrDevice memory resource used to allocate the returned columns' device memory
Returns
A 32-bit integer column containing the per-row bit counts

◆ segmented_row_bit_count()

std::unique_ptr<column> cudf::segmented_row_bit_count ( table_view const &  t,
size_type  segment_length,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns an approximate cumulative size in bits of all columns in the table_view for each segment of rows.

This is similar to counting bit size per row for the input table in cudf::row_bit_count, except that row sizes are accumulated by segments.

Currently, only fixed-length segments are supported. In case the input table has number of rows not divisible by segment_length, its last segment is considered as shorter than the others.

Exceptions
std::invalid_argumentif the input segment_length is non-positive or larger than the number of rows in the input table.
Parameters
tThe table view to perform the computation on
segment_lengthThe number of rows in each segment for which the total size is computed
mrDevice memory resource used to allocate the returned columns' device memory
Returns
A 32-bit integer column containing the bit counts for each segment of rows

◆ transform()

std::unique_ptr<column> cudf::transform ( column_view const &  input,
std::string const &  unary_udf,
data_type  output_type,
bool  is_ptx,
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Creates a new column by applying a unary function against every element of an input column.

Computes: out[i] = F(in[i])

The output null mask is the same is the input null mask so if input[i] is null then output[i] is also null

Parameters
inputAn immutable view of the input column to transform
unary_udfThe PTX/CUDA string of the unary function to apply
output_typeThe output type that is compatible with the output type in the UDF
is_ptxtrue: the UDF is treated as PTX code; false: the UDF is treated as CUDA code
mrDevice memory resource used to allocate the returned column's device memory
Returns
The column resulting from applying the unary function to every element of the input