Files | |
file | partitioning.hpp |
Column partitioning APIs. | |
Enumerations | |
enum class | cudf::hash_id { cudf::HASH_IDENTITY = 0 , cudf::HASH_MURMUR3 } |
Identifies the hash function to be used in hash partitioning. More... | |
Functions | |
std::pair< std::unique_ptr< table >, std::vector< size_type > > | cudf::partition (table_view const &t, column_view const &partition_map, size_type num_partitions, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) |
Partitions rows of t according to the mapping specified by partition_map . More... | |
std::pair< std::unique_ptr< table >, std::vector< size_type > > | cudf::hash_partition (table_view const &input, std::vector< size_type > const &columns_to_hash, int num_partitions, hash_id hash_function=hash_id::HASH_MURMUR3, uint32_t seed=DEFAULT_HASH_SEED, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) |
Partitions rows from the input table into multiple output tables. More... | |
std::pair< std::unique_ptr< cudf::table >, std::vector< cudf::size_type > > | cudf::round_robin_partition (table_view const &input, cudf::size_type num_partitions, cudf::size_type start_partition=0, rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource()) |
Round-robin partition. More... | |
|
strong |
Identifies the hash function to be used in hash partitioning.
Enumerator | |
---|---|
HASH_IDENTITY | Identity hash function that simply returns the key to be hashed. |
HASH_MURMUR3 | Murmur3 hash function. |
Definition at line 41 of file partitioning.hpp.
std::pair<std::unique_ptr<table>, std::vector<size_type> > cudf::hash_partition | ( | table_view const & | input, |
std::vector< size_type > const & | columns_to_hash, | ||
int | num_partitions, | ||
hash_id | hash_function = hash_id::HASH_MURMUR3 , |
||
uint32_t | seed = DEFAULT_HASH_SEED , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Partitions rows from the input table into multiple output tables.
Partitions rows of input
into num_partitions
bins based on the hash value of the columns specified by columns_to_hash
. Rows partitioned into the same bin are grouped consecutively in the output table. Returns a vector of row offsets to the start of each partition in the output table.
std::out_of_range | if index is columns_to_hash is invalid |
input | The table to partition |
columns_to_hash | Indices of input columns to hash |
num_partitions | The number of partitions to use |
hash_function | Optional hash id that chooses the hash function to use |
seed | Optional seed value to the hash function |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned table's device memory |
std::pair<std::unique_ptr<table>, std::vector<size_type> > cudf::partition | ( | table_view const & | t, |
column_view const & | partition_map, | ||
size_type | num_partitions, | ||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Partitions rows of t
according to the mapping specified by partition_map
.
For each row at i
in t
, partition_map[i]
indicates which partition row i
belongs to. partition
creates a new table by rearranging the rows of t
such that rows in the same partition are contiguous. The returned table is in ascending partition order from [0, num_partitions)
. The order within each partition is undefined.
Returns a vector<size_type>
of num_partitions + 1
values that indicate the starting position of each partition within the returned table, i.e., partition i
starts at offsets[i]
(inclusive) and ends at offset[i+1]
(exclusive). As a result, if value j
in [0, num_partitions)
does not appear in partition_map
, partition j
will be empty, i.e., offsets[j+1] - offsets[j] == 0
.
Values in partition_map
must be in the range [0, num_partitions)
, otherwise behavior is undefined.
cudf::logic_error | when partition_map is a non-integer type |
cudf::logic_error | when partition_map.has_nulls() == true |
cudf::logic_error | when partition_map.size() != t.num_rows() |
t | The table to partition |
partition_map | Non-nullable column of integer values that map each row in t to it's partition. |
num_partitions | The total number of partitions |
mr | Device memory resource used to allocate the returned table's device memory |
num_partitions + 1
offsets to each partition such that the size of partition i
is determined by offset[i+1] - offset[i]
. std::pair<std::unique_ptr<cudf::table>, std::vector<cudf::size_type> > cudf::round_robin_partition | ( | table_view const & | input, |
cudf::size_type | num_partitions, | ||
cudf::size_type | start_partition = 0 , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Round-robin partition.
Returns a new table with rows re-arranged into partition groups and a vector of row offsets to the start of each partition in the output table. Rows are assigned partitions based on their row index in the table, in a round robin fashion.
cudf::logic_error | if num_partitions <= 1 |
cudf::logic_error | if start_partition >= num_partitions |
A good analogy for the algorithm is dealing out cards:
The algorithm has two outcomes:
A player's deck (partition) is the range of cards starting at the corresponding offset and ending at the next player's starting offset or the last card in the deck if it's the last player.
When num_partitions > nrows, we have more players than cards. We start dealing to the first indicated player and continuing around the players until we run out of cards before we run out of players. Players that did not get any cards are represented by offset[i] == offset[i+1] or offset[i] == table.num_rows() if i == num_partitions-1
meaning there are no cards (rows) in their deck (partition).
[in] | input | The input table to be round-robin partitioned |
[in] | num_partitions | Number of partitions for the table |
[in] | start_partition | Index of the 1st partition |
[in] | mr | Device memory resource used to allocate the returned table's device memory |