Interop Arrow#
- group interop_arrow
Typedefs
-
using unique_schema_t = std::unique_ptr<ArrowSchema, void (*)(ArrowSchema*)>#
typedef for a unique_ptr to an ArrowSchema with custom deleter
-
using unique_device_array_t = std::unique_ptr<ArrowDeviceArray, void (*)(ArrowDeviceArray*)>#
typedef for a unique_ptr to an ArrowDeviceArray with a custom deleter
-
using owned_columns_t = std::vector<std::unique_ptr<cudf::column>>#
typedef for a vector of owning columns, used for conversion from ArrowDeviceArray
-
using unique_table_view_t = std::unique_ptr<cudf::table_view, custom_view_deleter<cudf::table_view>>#
typedef for a unique_ptr to a
cudf::table_view
with custom deleter
-
using unique_column_view_t = std::unique_ptr<cudf::column_view, custom_view_deleter<cudf::column_view>>#
typedef for a unique_ptr to a
cudf::column_view
with custom deleter
Functions
-
std::shared_ptr<arrow::Table> to_arrow(table_view input, std::vector<column_metadata> const &metadata = {}, rmm::cuda_stream_view stream = cudf::get_default_stream(), arrow::MemoryPool *ar_mr = arrow::default_memory_pool())#
Create
arrow::Table
from cudf tableinput
Converts the
cudf::table_view
toarrow::Table
with the provided metadatacolumn_names
.- Deprecated:
Since 24.08. Use cudf::to_arrow_host instead.
Note
For decimals, since the precision is not stored for them in libcudf, it will be converted to an Arrow decimal128 that has the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.
- Throws:
cudf::logic_error – if
column_names
size doesn’t match with number of columns.- Parameters:
input – table_view that needs to be converted to arrow Table
metadata – Contains hierarchy of names of columns and children
stream – CUDA stream used for device memory operations and kernel launches
ar_mr – arrow memory pool to allocate memory for arrow Table
- Returns:
arrow Table generated from
input
-
std::shared_ptr<arrow::Scalar> to_arrow(cudf::scalar const &input, column_metadata const &metadata = {}, rmm::cuda_stream_view stream = cudf::get_default_stream(), arrow::MemoryPool *ar_mr = arrow::default_memory_pool())#
Create
arrow::Scalar
from cudf scalarinput
Converts the
cudf::scalar
toarrow::Scalar
.- Deprecated:
Since 24.08.
Note
For decimals, since the precision is not stored for them in libcudf, it will be converted to an Arrow decimal128 that has the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.
- Parameters:
input – scalar that needs to be converted to arrow Scalar
metadata – Contains hierarchy of names of columns and children
stream – CUDA stream used for device memory operations and kernel launches
ar_mr – arrow memory pool to allocate memory for arrow Scalar
- Returns:
arrow Scalar generated from
input
-
unique_schema_t to_arrow_schema(cudf::table_view const &input, cudf::host_span<column_metadata const> metadata)#
Create ArrowSchema from cudf table and metadata.
Populates and returns an ArrowSchema C struct using a table and metadata.
Note
For decimals, since the precision is not stored for them in libcudf, decimals will be converted to an Arrow decimal128 which has the widest precision that cudf decimal type supports. For example,
numeric::decimal32
will be converted to Arrow decimal128 with the precision of 9 which is the maximum precision for 32-bit types. Similarly,numeric::decimal128
will be converted to Arrow decimal128 with the precision of 38.- Parameters:
input – Table to create a schema from
metadata – Contains the hierarchy of names of columns and children
- Returns:
ArrowSchema generated from
input
-
unique_device_array_t to_arrow_device(cudf::table &&table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
ArrowDeviceArray
from cudf table and metadata.Populates the C struct ArrowDeviceArray without performing copies if possible. This maintains the data on the GPU device and gives ownership of the table and its buffers to the ArrowDeviceArray struct.
After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up the memory.
Note
For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.
Note
Copies will be performed in the cases where cudf differs from Arrow such as in the representation of bools (Arrow uses a bitmap, cudf uses 1-byte per value).
- Parameters:
table – Input table, ownership of the data will be moved to the result
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used for any allocations during conversion
- Returns:
ArrowDeviceArray which will have ownership of the GPU data, consumer must call release
-
unique_device_array_t to_arrow_device(cudf::column &&col, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
ArrowDeviceArray
from cudf column and metadata.Populates the C struct ArrowDeviceArray without performing copies if possible. This maintains the data on the GPU device and gives ownership of the table and its buffers to the ArrowDeviceArray struct.
After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up the memory.
Note
For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similar, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.
Note
Copies will be performed in the cases where cudf differs from Arrow such as in the representation of bools (Arrow uses a bitmap, cudf uses 1 byte per value).
- Parameters:
col – Input column, ownership of the data will be moved to the result
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used for any allocations during conversion
- Returns:
ArrowDeviceArray which will have ownership of the GPU data
-
unique_device_array_t to_arrow_device(cudf::table_view const &table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
ArrowDeviceArray
from a table view.Populates the C struct ArrowDeviceArray performing copies only if necessary. This wraps the data on the GPU device and gives a view of the table data to the ArrowDeviceArray struct. If the caller frees the data referenced by the table_view, using the returned object results in undefined behavior.
After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up any memory created during conversion.
Copies will be performed in the cases where cudf differs from Arrow:
BOOL8: Arrow uses a bitmap and cudf uses 1 byte per value
DECIMAL32 and DECIMAL64: Converted to Arrow decimal128
STRING: Arrow expects a single value int32 offset child array for empty strings columns
Note
For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.
- Parameters:
table – Input table
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used for any allocations during conversion
- Returns:
ArrowDeviceArray which will have ownership of any copied data
-
unique_device_array_t to_arrow_device(cudf::column_view const &col, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
ArrowDeviceArray
from a column view.Populates the C struct ArrowDeviceArray performing copies only if necessary. This wraps the data on the GPU device and gives a view of the column data to the ArrowDeviceArray struct. If the caller frees the data referenced by the column_view, using the returned object results in undefined behavior.
After calling this function, the release callback on the returned ArrowDeviceArray must be called to clean up any memory created during conversion.
Copies will be performed in the cases where cudf differs from Arrow:
BOOL8: Arrow uses a bitmap and cudf uses 1 byte per value
DECIMAL32 and DECIMAL64: Converted to Arrow decimal128
STRING: Arrow expects a single value int32 offset child array for empty strings columns
Note
For decimals, since the precision is not stored for them in libcudf it will be converted to an Arrow decimal128 with the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similar, numeric::decimal128 will be converted to Arrow decimal128 of the precision 38.
- Parameters:
col – Input column
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used for any allocations during conversion
- Returns:
ArrowDeviceArray which will have ownership of any copied data
-
unique_device_array_t to_arrow_host(cudf::table_view const &table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Copy table view data to host and create
ArrowDeviceArray
for it.Populates the C struct ArrowDeviceArray, copying the cudf data to the host. The returned ArrowDeviceArray will have a device_type of CPU and will have no ties to the memory referenced by the table view passed in. The deleter for the returned unique_ptr will call the release callback on the ArrowDeviceArray automatically.
Note
For decimals, since the precision is not stored for them in libcudf, it will be converted to an Arrow decimal128 that has the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of precision 38.
- Parameters:
table – Input table
stream – CUDA stream used for the device memory operations and kernel launches
mr – Device memory resource used for any allocations during conversion
- Returns:
ArrowDeviceArray generated from input table
-
unique_device_array_t to_arrow_host(cudf::column_view const &col, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Copy column view data to host and create
ArrowDeviceArray
for it.Populates the C struct ArrowDeviceArray, copying the cudf data to the host. The returned ArrowDeviceArray will have a device_type of CPU and will have no ties to the memory referenced by the column view passed in. The deleter for the returned unique_ptr will call the release callback on the ArrowDeviceArray automatically.
Note
For decimals, since the precision is not stored for them in libcudf, it will be converted to an Arrow decimal128 that has the widest-precision the cudf decimal type supports. For example, numeric::decimal32 will be converted to Arrow decimal128 of the precision 9 which is the maximum precision for 32-bit types. Similarly, numeric::decimal128 will be converted to Arrow decimal128 of precision 38.
- Parameters:
col – Input column
stream – CUDA stream used for the device memory operations and kernel launches
mr – Device memory resource used for any allocations during conversion
- Returns:
ArrowDeviceArray generated from input column
-
std::unique_ptr<table> from_arrow(arrow::Table const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::table
from given arrow Table input.- Deprecated:
Since 24.08. Use cudf::from_arrow_host instead.
- Parameters:
input – arrow:Table that needs to be converted to
cudf::table
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate
cudf::table
- Returns:
cudf table generated from given arrow Table
-
std::unique_ptr<cudf::scalar> from_arrow(arrow::Scalar const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::scalar
from given arrow Scalar input.- Deprecated:
Since 24.08. Use arrow’s
MakeArrayFromScalar
on the input, followed byExportArray
to obtain something that can be consumed byfrom_arrow_host
. Then usecudf::get_element
to extract a device scalar from the column.
- Parameters:
input –
arrow::Scalar
that needs to be converted tocudf::scalar
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate
cudf::scalar
- Returns:
cudf scalar generated from given arrow Scalar
-
std::unique_ptr<cudf::table> from_arrow(ArrowSchema const *schema, ArrowArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::table
from given ArrowArray and ArrowSchema input.The conversion will not call release on the input Array.
- Throws:
std::invalid_argument – if either schema or input are NULL
cudf::data_type_error – if the input array is not a struct array.
- Parameters:
schema –
ArrowSchema
pointer to describe the type of the datainput –
ArrowArray
pointer that needs to be converted to cudf::tablestream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate
cudf::table
- Returns:
cudf table generated from given arrow data
-
std::unique_ptr<cudf::column> from_arrow_column(ArrowSchema const *schema, ArrowArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::column
from a given ArrowArray and ArrowSchema input.The conversion will not call release on the input Array.
- Throws:
std::invalid_argument – if either schema or input are NULL
- Parameters:
schema –
ArrowSchema
pointer to describe the type of the datainput –
ArrowArray
pointer that needs to be converted to cudf::columnstream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate
cudf::column
- Returns:
cudf column generated from given arrow data
-
std::unique_ptr<table> from_arrow_host(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::table
from given ArrowDeviceArray input.The conversion will not call release on the input Array.
- Throws:
std::invalid_argument – if either schema or input are NULL
std::invalid_argument – if the device_type is not
ARROW_DEVICE_CPU
cudf::data_type_error – if the input array is not a struct array, non-struct arrays should be passed to
from_arrow_host_column
instead.
- Parameters:
schema –
ArrowSchema
pointer to describe the type of the datainput –
ArrowDeviceArray
pointer to object owning the Arrow datastream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to perform cuda allocation
- Returns:
cudf table generated from the given Arrow data
-
std::unique_ptr<table> from_arrow_stream(ArrowArrayStream *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::table
from given ArrowArrayStream input.The conversion WILL release the input ArrayArrayStream and its constituent arrays or schema since Arrow streams are not suitable for multiple reads.
- Throws:
std::invalid_argument – if input is NULL
- Parameters:
input –
ArrowArrayStream
pointer to object that will produce ArrowArray datastream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to perform cuda allocation
- Returns:
cudf table generated from the given Arrow data
-
std::unique_ptr<column> from_arrow_host_column(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::column
from given ArrowDeviceArray input.The conversion will not call release on the input Array.
- Throws:
std::invalid_argument – if either schema or input are NULL
std::invalid_argument – if the device_type is not
ARROW_DEVICE_CPU
cudf::data_type_error – if input arrow data type is not supported in cudf.
- Parameters:
schema –
ArrowSchema
pointer to describe the type of the datainput –
ArrowDeviceArray
pointer to object owning the Arrow datastream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to perform cuda allocation
- Returns:
cudf column generated from the given Arrow data
-
unique_table_view_t from_arrow_device(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::table_view
from givenArrowDeviceArray
andArrowSchema
Constructs a non-owning
cudf::table_view
usingArrowDeviceArray
andArrowSchema
, data must be accessible to the CUDA device. Because the resultingcudf::table_view
will not own the data, theArrowDeviceArray
must be kept alive for the lifetime of the result. It is the responsibility of callers to ensure they call the release callback on theArrowDeviceArray
after it is no longer needed, and that thecudf::table_view
is not accessed after this happens.Each child of the input struct will be the columns of the resulting table_view.
Note
The custom deleter used for the unique_ptr to the table_view maintains ownership over any memory which is allocated, such as converting boolean columns from the bitmap used by Arrow to the 1-byte per value for cudf.
Note
If the input
ArrowDeviceArray
contained a non-null sync_event it is assumed to be acudaEvent_t*
and the passed in stream will havecudaStreamWaitEvent
called on it with the event. This function, however, will not explicitly synchronize on the stream.- Throws:
std::invalid_argument – if device_type is not
ARROW_DEVICE_CUDA
,ARROW_DEVICE_CUDA_HOST
orARROW_DEVICE_CUDA_MANAGED
cudf::data_type_error – if the input array is not a struct array, non-struct arrays should be passed to
from_arrow_device_column
instead.cudf::data_type_error – if the input arrow data type is not supported.
- Parameters:
schema –
ArrowSchema
pointer to object describing the type of the device arrayinput –
ArrowDeviceArray
pointer to object owning the Arrow datastream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to perform any allocations
- Returns:
cudf::table_view
generated from given Arrow data
-
unique_column_view_t from_arrow_device_column(ArrowSchema const *schema, ArrowDeviceArray const *input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Create
cudf::column_view
from givenArrowDeviceArray
andArrowSchema
Constructs a non-owning
cudf::column_view
usingArrowDeviceArray
andArrowSchema
, data must be accessible to the CUDA device. Because the resultingcudf::column_view
will not own the data, theArrowDeviceArray
must be kept alive for the lifetime of the result. It is the responsibility of callers to ensure they call the release callback on theArrowDeviceArray
after it is no longer needed, and that thecudf::column_view
is not accessed after this happens.Note
The custom deleter used for the unique_ptr to the table_view maintains ownership over any memory which is allocated, such as converting boolean columns from the bitmap used by Arrow to the 1-byte per value for cudf.
Note
If the input
ArrowDeviceArray
contained a non-null sync_event it is assumed to be acudaEvent_t*
and the passed in stream will havecudaStreamWaitEvent
called on it with the event. This function, however, will not explicitly synchronize on the stream.- Throws:
std::invalid_argument – if device_type is not
ARROW_DEVICE_CUDA
,ARROW_DEVICE_CUDA_HOST
orARROW_DEVICE_CUDA_MANAGED
cudf::data_type_error – input arrow data type is not supported.
- Parameters:
schema –
ArrowSchema
pointer to object describing the type of the device arrayinput –
ArrowDeviceArray
pointer to object owning the Arrow datastream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to perform any allocations
- Returns:
cudf::column_view
generated from given Arrow data
-
struct column_metadata#
- #include <interop.hpp>
Detailed metadata information for arrow array.
As of now this contains only name in the hierarchy of children of cudf column, but in future this can be updated as per requirement.
Public Functions
-
inline column_metadata(std::string _name)#
Construct a new column metadata object.
- Parameters:
_name – Name of the column
Public Members
-
std::string name#
Name of the column.
-
std::vector<column_metadata> children_meta#
Metadata of children of the column.
-
inline column_metadata(std::string _name)#
-
template<typename ViewType>
struct custom_view_deleter# - #include <interop.hpp>
functor for a custom deleter to a unique_ptr of table_view
When converting from an ArrowDeviceArray, there are cases where data can’t be zero-copy (i.e. bools or non-UINT32 dictionary indices). This custom deleter is used to maintain ownership over the data allocated since a
cudf::table_view
doesn’t hold ownership.Public Functions
-
inline explicit custom_view_deleter(owned_columns_t &&owned)#
Construct a new custom view deleter object.
- Parameters:
owned – Vector of owning columns
Public Members
-
owned_columns_t owned_mem_#
Owned columns that must be deleted.
-
inline explicit custom_view_deleter(owned_columns_t &&owned)#
-
using unique_schema_t = std::unique_ptr<ArrowSchema, void (*)(ArrowSchema*)>#