Copy Split#

group copy_split

Functions

std::vector<packed_table> contiguous_split(cudf::table_view const &input, std::vector<size_type> const &splits, rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Performs a deep-copy split of a table_view into a vector of packed_table where each packed_table is using a single contiguous block of memory for all of the split’s column data.

The memory for the output views is allocated in a single contiguous rmm::device_buffer returned in the packed_table. There is no top-level owning table.

The returned views of input are constructed from a vector of indices, that indicate where each split should occur. The ith returned table_view is sliced as [0, splits[i]) if i=0, else [splits[i], input.size()) if i is the last view and [splits[i-1], splits[i]] otherwise.

For all i it is expected splits[i] <= splits[i+1] <= input.size(). For a splits size N, there will always be N+1 splits in the output.

Example:
input:   [{10, 12, 14, 16, 18, 20, 22, 24, 26, 28},
          {50, 52, 54, 56, 58, 60, 62, 64, 66, 68}]
splits:  {2, 5, 9}
output:  [{{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}},
          {{50, 52}, {54, 56, 58}, {60, 62, 64, 66}, {68}}]

Note

It is the caller’s responsibility to ensure that the returned views do not outlive the viewed device memory contained in the all_data field of the returned packed_table.

Throws:
  • std::out_of_range – if splits has end index > size of input.

  • std::out_of_range – When the value in splits is not in the range [0, input.size()).

  • std::invalid_argument – When the values in the splits are ‘strictly decreasing’.

Parameters:
  • input – View of a table to split

  • splits – A vector of indices where the view will be split

  • mr – An optional memory resource to use for all returned device allocations

Returns:

The set of requested views of input indicated by the splits and the viewed memory buffer

packed_columns pack(cudf::table_view const &input, rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Deep-copy a table_view into a serialized contiguous memory format.

The metadata from the table_view is copied into a host vector of bytes and the data from the table_view is copied into a device_buffer. Pass the output of this function into cudf::unpack to deserialize.

Parameters:
  • input – View of the table to pack

  • mr – An optional memory resource to use for all returned device allocations

Returns:

packed_columns A struct containing the serialized metadata and data in contiguous host and device memory respectively

std::vector<uint8_t> pack_metadata(table_view const &table, uint8_t const *contiguous_buffer, size_t buffer_size)#

Produce the metadata used for packing a table stored in a contiguous buffer.

The metadata from the table_view is copied into a host vector of bytes which can be used to construct a packed_columns or packed_table structure. The caller is responsible for guaranteeing that all of the columns in the table point into contiguous_buffer.

Parameters:
  • table – View of the table to pack

  • contiguous_buffer – A contiguous buffer of device memory which contains the data referenced by the columns in table

  • buffer_size – The size of contiguous_buffer

Returns:

Vector of bytes representing the metadata used to unpack a packed_columns struct

table_view unpack(packed_columns const &input)#

Deserialize the result of cudf::pack.

Converts the result of a serialized table into a table_view that points to the data stored in the contiguous device buffer contained in input.

It is the caller’s responsibility to ensure that the table_view in the output does not outlive the data in the input.

No new device memory is allocated in this function.

Parameters:

input – The packed columns to unpack

Returns:

The unpacked table_view

table_view unpack(uint8_t const *metadata, uint8_t const *gpu_data)#

Deserialize the result of cudf::pack.

Converts the result of a serialized table into a table_view that points to the data stored in the contiguous device buffer contained in gpu_data using the metadata contained in the host buffer metadata.

It is the caller’s responsibility to ensure that the table_view in the output does not outlive the data in the input.

No new device memory is allocated in this function.

Parameters:
  • metadata – The host-side metadata buffer resulting from the initial pack() call

  • gpu_data – The device-side contiguous buffer storing the data that will be referenced by the resulting table_view

Returns:

The unpacked table_view

std::vector<column_view> split(column_view const &input, host_span<size_type const> splits, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Splits a column_view into a set of column_views according to a set of indices derived from expected splits.

The returned view’s of input are constructed from vector of splits, which indicates where the split should occur. The ith returned column_view is sliced as [0, splits[i]) if i=0, else [splits[i], input.size()) if i is the last view and [splits[i-1], splits[i]] otherwise.

For all i it is expected splits[i] <= splits[i+1] <= input.size() For a splits size N, there will always be N+1 splits in the output

Example:
input:   {10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
splits:  {2, 5, 9}
output:  {{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}}

Note

It is the caller’s responsibility to ensure that the returned views do not outlive the viewed device memory.

Throws:
  • std::out_of_range – if splits has end index > size of input.

  • std::out_of_range – When the value in splits is not in the range [0, input.size()).

  • std::invalid_argument – When the values in the splits are ‘strictly decreasing’.

Parameters:
  • input – View of column to split

  • splits – Indices where the view will be split

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

The set of requested views of input indicated by the splits

std::vector<column_view> split(column_view const &input, std::initializer_list<size_type> splits, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Splits a column_view into a set of column_views according to a set of indices derived from expected splits.

The returned view’s of input are constructed from vector of splits, which indicates where the split should occur. The ith returned column_view is sliced as [0, splits[i]) if i=0, else [splits[i], input.size()) if i is the last view and [splits[i-1], splits[i]] otherwise.

For all i it is expected splits[i] <= splits[i+1] <= input.size() For a splits size N, there will always be N+1 splits in the output

Example:
input:   {10, 12, 14, 16, 18, 20, 22, 24, 26, 28}
splits:  {2, 5, 9}
output:  {{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}}

Note

It is the caller’s responsibility to ensure that the returned views do not outlive the viewed device memory.

Throws:
  • std::out_of_range – if splits has end index > size of input.

  • std::out_of_range – When the value in splits is not in the range [0, input.size()).

  • std::invalid_argument – When the values in the splits are ‘strictly decreasing’.

Parameters:
  • input – View of column to split

  • splits – Indices where the view will be split

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

The set of requested views of input indicated by the splits

std::vector<table_view> split(table_view const &input, host_span<size_type const> splits, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Splits a table_view into a set of table_views according to a set of indices derived from expected splits.

The returned views of input are constructed from vector of splits, which indicates where the split should occur. The ith returned table_view is sliced as [0, splits[i]) if i=0, else [splits[i], input.size()) if i is the last view and [splits[i-1], splits[i]] otherwise.

For all i it is expected splits[i] <= splits[i+1] <= input.size() For a splits size N, there will always be N+1 splits in the output

Example:
input:   [{10, 12, 14, 16, 18, 20, 22, 24, 26, 28},
          {50, 52, 54, 56, 58, 60, 62, 64, 66, 68}]
splits:  {2, 5, 9}
output:  [{{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}},
          {{50, 52}, {54, 56, 58}, {60, 62, 64, 66}, {68}}]

Note

It is the caller’s responsibility to ensure that the returned views do not outlive the viewed device memory.

Throws:
  • std::out_of_range – if splits has end index > size of input.

  • std::out_of_range – When the value in splits is not in the range [0, input.size()).

  • std::invalid_argument – When the values in the splits are ‘strictly decreasing’.

Parameters:
  • input – View of a table to split

  • splits – Indices where the view will be split

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

The set of requested views of input indicated by the splits

std::vector<table_view> split(table_view const &input, std::initializer_list<size_type> splits, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Splits a table_view into a set of table_views according to a set of indices derived from expected splits.

The returned views of input are constructed from vector of splits, which indicates where the split should occur. The ith returned table_view is sliced as [0, splits[i]) if i=0, else [splits[i], input.size()) if i is the last view and [splits[i-1], splits[i]] otherwise.

For all i it is expected splits[i] <= splits[i+1] <= input.size() For a splits size N, there will always be N+1 splits in the output

Example:
input:   [{10, 12, 14, 16, 18, 20, 22, 24, 26, 28},
          {50, 52, 54, 56, 58, 60, 62, 64, 66, 68}]
splits:  {2, 5, 9}
output:  [{{10, 12}, {14, 16, 18}, {20, 22, 24, 26}, {28}},
          {{50, 52}, {54, 56, 58}, {60, 62, 64, 66}, {68}}]

Note

It is the caller’s responsibility to ensure that the returned views do not outlive the viewed device memory.

Throws:
  • std::out_of_range – if splits has end index > size of input.

  • std::out_of_range – When the value in splits is not in the range [0, input.size()).

  • std::invalid_argument – When the values in the splits are ‘strictly decreasing’.

Parameters:
  • input – View of a table to split

  • splits – Indices where the view will be split

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

The set of requested views of input indicated by the splits

struct packed_columns#
#include <contiguous_split.hpp>

Column data in a serialized format.

Contains data from an array of columns in two contiguous buffers: one on host, which contains table metadata and one on device which contains the table data.

Public Functions

inline packed_columns(std::unique_ptr<std::vector<uint8_t>> &&md, std::unique_ptr<rmm::device_buffer> &&gd)#

Construct a new packed columns object.

Parameters:
  • md – Host-side metadata buffer

  • gd – Device-side data buffer

Public Members

std::unique_ptr<std::vector<uint8_t>> metadata#

Host-side metadata buffer.

std::unique_ptr<rmm::device_buffer> gpu_data#

Device-side data buffer.

struct packed_table#
#include <contiguous_split.hpp>

The result(s) of a cudf::contiguous_split.

Each table_view resulting from a split operation performed by contiguous_split, will be returned wrapped in a packed_table. The table_view and internal column_views in this struct are not owned by a top level cudf::table or cudf::column. The backing memory and metadata is instead owned by the data field and is in one contiguous block.

The user is responsible for assuring that the table or any derived table_views do not outlive the memory owned by data.

Public Members

cudf::table_view table#

Result table_view of a cudf::contiguous_split.

packed_columns data#

Column data owned.

class chunked_pack#
#include <contiguous_split.hpp>

Perform a chunked “pack” operation of the input table_view using a user provided buffer of size user_buffer_size.

The intent of this operation is to be used in a streamed fashion at times of GPU out-of-memory, where we want to minimize the number of small cudaMemcpy calls and tracking of all the metadata associated with cudf tables. Because of the memory constraints, all thrust and scratch memory allocations are using the passed-in memory resource exclusively, not a per-device memory resource.

This class defines two methods that must be used in concert to carry out the chunked_pack: has_next and next. Here is an example:

// Create a table_view
cudf::table_view tv = ...;

// Choose a memory resource (optional). This memory resource is used for scratch/thrust temporary
// data. In memory constrained cases, this can be used to set aside scratch memory
// for `chunked_pack` at the beginning of a program.
auto mr = cudf::get_current_device_resource_ref();

// Define a buffer size for each chunk: the larger the buffer is, the more SMs can be
// occupied by this algorithm.
//
// Internally, the GPU unit of work is a 1MB batch. When we instantiate `cudf::chunked_pack`,
// all the 1MB batches for the source table_view are computed up front. Additionally,
// chunked_pack calculates the number of iterations that are required to go through all those
// batches given a `user_buffer_size` buffer. The number of 1MB batches in each iteration (chunk)
// equals the number of CUDA blocks that will be used for the main kernel launch.
//
std::size_t user_buffer_size = 128*1024*1024;

auto chunked_packer = cudf::chunked_pack::create(tv, user_buffer_size, mr);

std::size_t host_offset = 0;
auto host_buffer = ...; // obtain a host buffer you would like to copy to

while (chunked_packer->has_next()) {
  // get a user buffer of size `user_buffer_size`
  cudf::device_span<uint8_t> user_buffer = ...;
  std::size_t bytes_copied = chunked_packer->next(user_buffer);

  // buffer will hold the contents of at most `user_buffer_size` bytes
  // of the contiguously packed input `table_view`. You are now free to copy
  // this memory somewhere else, for example, to host.
  cudaMemcpyAsync(
    host_buffer.data() + host_offset,
    user_buffer.data(),
    bytes_copied,
    cudaMemcpyDefault,
    stream);

  host_offset += bytes_copied;
}

Public Functions

explicit chunked_pack(cudf::table_view const &input, std::size_t user_buffer_size, rmm::device_async_resource_ref temp_mr = cudf::get_current_device_resource_ref())#

Construct a chunked_pack class.

Parameters:
  • input – source table_view to pack

  • user_buffer_size – buffer size (in bytes) that will be passed on next. Must be at least 1MB

  • temp_mr – An optional memory resource to be used for temporary and scratch allocations only

~chunked_pack()#

Destructor that will be implemented as default. Declared with definition here because contiguous_split_state is incomplete at this stage.

std::size_t get_total_contiguous_size() const#

Obtain the total size of the contiguously packed table_view.

Returns:

total size (in bytes) of all the chunks

bool has_next() const#

Function to check if there are chunks left to be copied.

Returns:

true if there are chunks left to be copied, and false otherwise

std::size_t next(cudf::device_span<uint8_t> const &user_buffer)#

Packs the next chunk into user_buffer. This should be called as long as has_next returns true. If next is called when has_next is false, an exception is thrown.

Throws:
  • cudf::logic_error – If the size of user_buffer is different than user_buffer_size

  • cudf::logic_error – If called after all chunks have been copied

Parameters:

user_buffer – device span target for the chunk. The size of this span must equal the user_buffer_size parameter passed at construction

Returns:

The number of bytes that were written to user_buffer (at most user_buffer_size)

std::unique_ptr<std::vector<uint8_t>> build_metadata() const#

Build the opaque metadata for all added columns.

Returns:

A vector containing the serialized column metadata

Public Static Functions

static std::unique_ptr<chunked_pack> create(cudf::table_view const &input, std::size_t user_buffer_size, rmm::device_async_resource_ref temp_mr = cudf::get_current_device_resource_ref())#

Creates a chunked_pack instance to perform a “pack” of the table_view “input”, where a buffer of user_buffer_size is filled with chunks of the overall operation. This operation can be used in cases where GPU memory is constrained.

The memory resource (temp_mr) could be a special memory resource to be used in situations when GPU memory is low and we want scratch and temporary allocations to happen from a small reserved pool of memory. Note that it defaults to the regular cuDF per-device resource.

Throws:

cudf::logic_error – When user_buffer_size is less than 1MB

Parameters:
  • input – source table_view to pack

  • user_buffer_size – buffer size (in bytes) that will be passed on next. Must be at least 1MB

  • temp_mr – RMM memory resource to be used for temporary and scratch allocations only

Returns:

a unique_ptr of chunked_pack