Interop Arrow#

group interop_arrow

Typedefs

using unique_schema_t = std::unique_ptr<ArrowSchema, void (*)(ArrowSchema*)>#

typedef for a unique_ptr to an ArrowSchema with custom deleter

using unique_device_array_t = std::unique_ptr<ArrowDeviceArray, void (*)(ArrowDeviceArray*)>#

typedef for a unique_ptr to an ArrowDeviceArray with a custom deleter

using owned_columns_t = std::vector<std::unique_ptr<cudf::column>>#

typedef for a vector of owning columns, used for conversion from ArrowDeviceArray

using unique_table_view_t = std::unique_ptr<cudf::table_view, custom_view_deleter<cudf::table_view>>#

typedef for a unique_ptr to a cudf::table_view with custom deleter

using unique_column_view_t = std::unique_ptr<cudf::column_view, custom_view_deleter<cudf::column_view>>#

typedef for a unique_ptr to a cudf::column_view with custom deleter

Functions

unique_schema_t to_arrow_schema(cudf::table_view const &input, cudf::host_span<column_metadata const> metadata)#

Create ArrowSchema from cudf table and metadata.

Populates and returns an ArrowSchema C struct using a table and metadata.

Note

For decimals, since the precision is not stored for them in libcudf, decimals will be converted to an Arrow decimal128 which has the widest precision that cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 with the precision of 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 with the precision of 38.

Parameters:
  • input – Table to create a schema from

  • metadata – Contains the hierarchy of names of columns and children

Returns:

ArrowSchema generated from input

unique_device_array_t to_arrow_device(cudf::table &&table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create ArrowDeviceArray from cudf table and metadata.

Populates the C struct ArrowDeviceArray without performing copies if possible. This maintains the data on the GPU device and gives ownership of the table and its buffers to the ArrowDeviceArray struct.

After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up the memory.

Note

For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.

Note

Copies will be performed in the cases where cudf differs from Arrow such as in the representation of bools (Arrow uses a bitmap, cudf uses 1-byte per value).

Parameters:
  • table – Input table, ownership of the data will be moved to the result

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used for any allocations during conversion

Returns:

ArrowDeviceArray which will have ownership of the GPU data, consumer must call release

unique_device_array_t to_arrow_device(cudf::column &&col, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create ArrowDeviceArray from cudf column and metadata.

Populates the C struct ArrowDeviceArray without performing copies if possible. This maintains the data on the GPU device and gives ownership of the table and its buffers to the ArrowDeviceArray struct.

After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up the memory.

Note

For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similar, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.

Note

Copies will be performed in the cases where cudf differs from Arrow such as in the representation of bools (Arrow uses a bitmap, cudf uses 1 byte per value).

Parameters:
  • col – Input column, ownership of the data will be moved to the result

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used for any allocations during conversion

Returns:

ArrowDeviceArray which will have ownership of the GPU data

unique_device_array_t to_arrow_device(cudf::table_view const &table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create ArrowDeviceArray from a table view.

Populates the C struct ArrowDeviceArray performing copies only if necessary. This wraps the data on the GPU device and gives a view of the table data to the ArrowDeviceArray struct. If the caller frees the data referenced by the table_view, using the returned object results in undefined behavior.

After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up any memory created during conversion.

Copies will be performed in the cases where cudf differs from Arrow:

  • BOOL8: Arrow uses a bitmap and cudf uses 1 byte per value

  • DECIMAL32 and DECIMAL64: Converted to Arrow decimal128

  • STRING: Arrow expects a single value int32 offset child array for empty strings columns

Note

For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.

Parameters:
  • table – Input table

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used for any allocations during conversion

Returns:

ArrowDeviceArray which will have ownership of any copied data

unique_device_array_t to_arrow_device(cudf::column_view const &col, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create ArrowDeviceArray from a column view.

Populates the C struct ArrowDeviceArray performing copies only if necessary. This wraps the data on the GPU device and gives a view of the column data to the ArrowDeviceArray struct. If the caller frees the data referenced by the column_view, using the returned object results in undefined behavior.

After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up any memory created during conversion.

Copies will be performed in the cases where cudf differs from Arrow:

  • BOOL8: Arrow uses a bitmap and cudf uses 1 byte per value

  • DECIMAL32 and DECIMAL64: Converted to Arrow decimal128

  • STRING: Arrow expects a single value int32 offset child array for empty strings columns

Note

For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similar, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.

Parameters:
  • col – Input column

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used for any allocations during conversion

Returns:

ArrowDeviceArray which will have ownership of any copied data

unique_device_array_t to_arrow_host(cudf::table_view const &table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Copy table view data to host and create ArrowDeviceArray for it.

Populates the C struct ArrowDeviceArray, copying the cudf data to the host. The returned ArrowDeviceArray will have a device_type of CPU and will have no ties to the memory referenced by the table view passed in. The deleter for the returned unique_ptr will call the release callback on the ArrowDeviceArray automatically.

Note

For decimals, since the precision is not stored for them in libcudf, it will be converted to an Arrow decimal128 that has the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of precision 38.

Parameters:
  • table – Input table

  • stream – CUDA stream used for the device memory operations and kernel launches

  • mr – Device memory resource used for any allocations during conversion

Returns:

ArrowDeviceArray generated from input table

unique_device_array_t to_arrow_host(cudf::column_view const &col, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Copy column view data to host and create ArrowDeviceArray for it.

Populates the C struct ArrowDeviceArray, copying the cudf data to the host. The returned ArrowDeviceArray will have a device_type of CPU and will have no ties to the memory referenced by the column view passed in. The deleter for the returned unique_ptr will call the release callback on the ArrowDeviceArray automatically.

Note

For decimals, since the precision is not stored for them in libcudf, it will be converted to an Arrow decimal128 that has the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of precision 38.

Parameters:
  • col – Input column

  • stream – CUDA stream used for the device memory operations and kernel launches

  • mr – Device memory resource used for any allocations during conversion

Returns:

ArrowDeviceArray generated from input column

std::unique_ptr<cudf::table> from_arrow(ArrowSchema const *schema, ArrowArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create cudf::table from given ArrowArray and ArrowSchema input.

The conversion will not call release on the input Array.

Throws:
  • std::invalid_argument – if either schema or input are NULL

  • cudf::data_type_error – if the input array is not a struct array.

Parameters:
  • schemaArrowSchema pointer to describe the type of the data

  • inputArrowArray pointer that needs to be converted to cudf::table

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate cudf::table

Returns:

cudf table generated from given arrow data

std::unique_ptr<cudf::column> from_arrow_column(ArrowSchema const *schema, ArrowArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create cudf::column from a given ArrowArray and ArrowSchema input.

The conversion will not call release on the input Array.

Throws:

std::invalid_argument – if either schema or input are NULL

Parameters:
  • schemaArrowSchema pointer to describe the type of the data

  • inputArrowArray pointer that needs to be converted to cudf::column

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate cudf::column

Returns:

cudf column generated from given arrow data

std::unique_ptr<table> from_arrow_host(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create cudf::table from given ArrowDeviceArray input.

The conversion will not call release on the input Array.

Throws:
  • std::invalid_argument – if either schema or input are NULL

  • std::invalid_argument – if the device_type is not ARROW_DEVICE_CPU

  • cudf::data_type_error – if the input array is not a struct array, non-struct arrays should be passed to from_arrow_host_column instead.

Parameters:
  • schemaArrowSchema pointer to describe the type of the data

  • inputArrowDeviceArray pointer to object owning the Arrow data

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to perform cuda allocation

Returns:

cudf table generated from the given Arrow data

std::unique_ptr<table> from_arrow_stream(ArrowArrayStream *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create cudf::table from given ArrowArrayStream input.

The conversion WILL release the input ArrayArrayStream and its constituent arrays or schema since Arrow streams are not suitable for multiple reads.

Throws:

std::invalid_argument – if input is NULL

Parameters:
  • inputArrowArrayStream pointer to object that will produce ArrowArray data

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to perform cuda allocation

Returns:

cudf table generated from the given Arrow data

std::unique_ptr<column> from_arrow_host_column(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create cudf::column from given ArrowDeviceArray input.

The conversion will not call release on the input Array.

Throws:
  • std::invalid_argument – if either schema or input are NULL

  • std::invalid_argument – if the device_type is not ARROW_DEVICE_CPU

  • cudf::data_type_error – if input arrow data type is not supported in cudf.

Parameters:
  • schemaArrowSchema pointer to describe the type of the data

  • inputArrowDeviceArray pointer to object owning the Arrow data

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to perform cuda allocation

Returns:

cudf column generated from the given Arrow data

unique_table_view_t from_arrow_device(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create cudf::table_view from given ArrowDeviceArray and ArrowSchema

Constructs a non-owning cudf::table_view using ArrowDeviceArray and ArrowSchema, data must be accessible to the CUDA device. Because the resulting cudf::table_view will not own the data, the ArrowDeviceArray must be kept alive for the lifetime of the result. It is the responsibility of callers to ensure they call the release callback on the ArrowDeviceArray after it is no longer needed, and that the cudf::table_view is not accessed after this happens.

Each child of the input struct will be the columns of the resulting table_view.

Note

The custom deleter used for the unique_ptr to the table_view maintains ownership over any memory which is allocated, such as converting boolean columns from the bitmap used by Arrow to the 1-byte per value for cudf.

Note

If the input ArrowDeviceArray contained a non-null sync_event it is assumed to be a cudaEvent_t* and the passed in stream will have cudaStreamWaitEvent called on it with the event. This function, however, will not explicitly synchronize on the stream.

Throws:
  • std::invalid_argument – if device_type is not ARROW_DEVICE_CUDA, ARROW_DEVICE_CUDA_HOST or ARROW_DEVICE_CUDA_MANAGED

  • cudf::data_type_error – if the input array is not a struct array, non-struct arrays should be passed to from_arrow_device_column instead.

  • cudf::data_type_error – if the input arrow data type is not supported.

Parameters:
  • schemaArrowSchema pointer to object describing the type of the device array

  • inputArrowDeviceArray pointer to object owning the Arrow data

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to perform any allocations

Returns:

cudf::table_view generated from given Arrow data

unique_column_view_t from_arrow_device_column(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create cudf::column_view from given ArrowDeviceArray and ArrowSchema

Constructs a non-owning cudf::column_view using ArrowDeviceArray and ArrowSchema, data must be accessible to the CUDA device. Because the resulting cudf::column_view will not own the data, the ArrowDeviceArray must be kept alive for the lifetime of the result. It is the responsibility of callers to ensure they call the release callback on the ArrowDeviceArray after it is no longer needed, and that the cudf::column_view is not accessed after this happens.

Note

The custom deleter used for the unique_ptr to the table_view maintains ownership over any memory which is allocated, such as converting boolean columns from the bitmap used by Arrow to the 1-byte per value for cudf.

Note

If the input ArrowDeviceArray contained a non-null sync_event it is assumed to be a cudaEvent_t* and the passed in stream will have cudaStreamWaitEvent called on it with the event. This function, however, will not explicitly synchronize on the stream.

Throws:
  • std::invalid_argument – if device_type is not ARROW_DEVICE_CUDA, ARROW_DEVICE_CUDA_HOST or ARROW_DEVICE_CUDA_MANAGED

  • cudf::data_type_error – input arrow data type is not supported.

Parameters:
  • schemaArrowSchema pointer to object describing the type of the device array

  • inputArrowDeviceArray pointer to object owning the Arrow data

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to perform any allocations

Returns:

cudf::column_view generated from given Arrow data

struct column_metadata#
#include <interop.hpp>

Detailed metadata information for arrow array.

As of now this contains only name in the hierarchy of children of cudf column, but in future this can be updated as per requirement.

Public Functions

inline column_metadata(std::string _name)#

Construct a new column metadata object.

Parameters:

_name – Name of the column

Public Members

std::string name#

Name of the column.

std::vector<column_metadata> children_meta#

Metadata of children of the column.

template<typename ViewType>
struct custom_view_deleter#
#include <interop.hpp>

functor for a custom deleter to a unique_ptr of table_view

When converting from an ArrowDeviceArray, there are cases where data can’t be zero-copy (i.e. bools or non-UINT32 dictionary indices). This custom deleter is used to maintain ownership over the data allocated since a cudf::table_view doesn’t hold ownership.

Public Functions

inline explicit custom_view_deleter(owned_columns_t &&owned)#

Construct a new custom view deleter object.

Parameters:

owned – Vector of owning columns

inline void operator()(ViewType *ptr) const#

operator to delete the unique_ptr

Parameters:

ptr – Pointer to the object to be deleted

Public Members

owned_columns_t owned_mem_#

Owned columns that must be deleted.