Reorder Partition#
- group reorder_partition
Enums
Functions
-
std::pair<std::unique_ptr<table>, std::vector<size_type>> partition(table_view const &t, column_view const &partition_map, size_type num_partitions, rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Partitions rows of
t
according to the mapping specified bypartition_map
.For each row at
i
int
,partition_map[i]
indicates which partition rowi
belongs to.partition
creates a new table by rearranging the rows oft
such that rows in the same partition are contiguous. The returned table is in ascending partition order from[0, num_partitions)
. The order within each partition is undefined.Returns a
vector<size_type>
ofnum_partitions + 1
values that indicate the starting position of each partition within the returned table, i.e., partitioni
starts atoffsets[i]
(inclusive) and ends atoffset[i+1]
(exclusive). As a result, if valuej
in[0, num_partitions)
does not appear inpartition_map
, partitionj
will be empty, i.e.,offsets[j+1] - offsets[j] == 0
.Values in
partition_map
must be in the range[0, num_partitions)
, otherwise behavior is undefined.- Throws:
cudf::logic_error – when
partition_map
is a non-integer typecudf::logic_error – when
partition_map.has_nulls() == true
cudf::logic_error – when
partition_map.size() != t.num_rows()
- Parameters:
t – The table to partition
partition_map – Non-nullable column of integer values that map each row in
t
to it’s partition.num_partitions – The total number of partitions
mr – Device memory resource used to allocate the returned table’s device memory
- Returns:
Pair containing the reordered table and vector of
num_partitions + 1
offsets to each partition such that the size of partitioni
is determined byoffset[i+1] - offset[i]
.
-
std::pair<std::unique_ptr<table>, std::vector<size_type>> hash_partition(table_view const &input, std::vector<size_type> const &columns_to_hash, int num_partitions, hash_id hash_function = hash_id::HASH_MURMUR3, uint32_t seed = DEFAULT_HASH_SEED, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Partitions rows from the input table into multiple output tables.
Partitions rows of
input
intonum_partitions
bins based on the hash value of the columns specified bycolumns_to_hash
. Rows partitioned into the same bin are grouped consecutively in the output table. Returns a vector of row offsets to the start of each partition in the output table.- Throws:
std::out_of_range – if index is
columns_to_hash
is invalid- Parameters:
input – The table to partition
columns_to_hash – Indices of input columns to hash
num_partitions – The number of partitions to use
hash_function – Optional hash id that chooses the hash function to use
seed – Optional seed value to the hash function
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned table’s device memory
- Returns:
An output table and a vector of row offsets to each partition
-
std::pair<std::unique_ptr<cudf::table>, std::vector<cudf::size_type>> round_robin_partition(table_view const &input, cudf::size_type num_partitions, cudf::size_type start_partition = 0, rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Round-robin partition.
Returns a new table with rows re-arranged into partition groups and a vector of row offsets to the start of each partition in the output table. Rows are assigned partitions based on their row index in the table, in a round robin fashion.
A good analogy for the algorithm is dealing out cards:
The deck of cards is represented as the rows in the table.
The number of partitions is the number of players being dealt cards.
the start_partition indicates which player starts getting cards first.
The algorithm has two outcomes:
Another deck of cards formed by stacking each player’s cards back into a deck again, preserving the order of cards dealt to each player, starting with player 0.
A vector into the output deck indicating where a player’s cards start.
A player’s deck (partition) is the range of cards starting at the corresponding offset and ending at the next player’s starting offset or the last card in the deck if it’s the last player.
When num_partitions > nrows, we have more players than cards. We start dealing to the first indicated player and continuing around the players until we run out of cards before we run out of players. Players that did not get any cards are represented by
offset[i] == offset[i+1] or offset[i] == table.num_rows() if i == num_partitions-1
meaning there are no cards (rows) in their deck (partition).Example 1: input: table => col 1 {0, ..., 12} num_partitions = 3 start_partition = 0 output: pair<table, partition_offsets> table => col 1 {0,3,6,9,12,1,4,7,10,2,5,8,11} partition_offsets => {0,5,9} Example 2: input: table => col 1 {0, ..., 12} num_partitions = 3 start_partition = 1 output: pair<table, partition_offsets> table => col 1 {2,5,8,11,0,3,6,9,12,1,4,7,10} partition_offsets => {0,4,9} Example 3: input: table => col 1 {0, ..., 10} num_partitions = 3 start_partition = 0 output: pair<table, partition_offsets> table => col 1 {0,3,6,9,1,4,7,10,2,5,8} partition_offsets => {0,4,8} Example 4: input: table => col 1 {0, ..., 10} num_partitions = 3 start_partition = 1 output: pair<table, partition_offsets> table => col 1 {2,5,8,0,3,6,9,1,4,7,10} partition_offsets => {0,3,7} Example 5: input: table => col 1 {0, ..., 10} num_partitions = 3 start_partition = 2 output: pair<table, partition_offsets> table => col 1 {1,4,7,10,2,5,8,0,3,6,9} partition_offsets => {0,4,7} Example 6: input: table => col 1 {0, ..., 10} num_partitions = 15 > num_rows = 11 start_partition = 2 output: pair<table, partition_offsets> table => col 1 {0,1,2,3,4,5,6,7,8,9,10} partition_offsets => {0,0,0,1,2,3,4,5,6,7,8,9,10,11,11} Example 7: input: table => col 1 {0, ..., 10} num_partitions = 15 > num_rows = 11 start_partition = 10 output: pair<table, partition_offsets> table => col 1 {5,6,7,8,9,10,0,1,2,3,4} partition_offsets => {0,1,2,3,4,5,6,6,6,6,6,7,8,9,10} Example 8: input: table => col 1 {0, ..., 10} num_partitions = 15 > num_rows = 11 start_partition = 14 output: pair<table, partition_offsets> table => col 1 {1,2,3,4,5,6,7,8,9,10,0} partition_offsets => {0,1,2,3,4,5,6,7,8,9,10,10,10,10,10} Example 9: input: table => col 1 {0, ..., 10} num_partitions = 11 == num_rows = 11 start_partition = 2 output: pair<table, partition_offsets> table => col 1 {9,10,0,1,2,3,4,5,6,7,8} partition_offsets => {0,1,2,3,4,5,6,7,8,9,10}
- Throws:
cudf::logic_error – if
num_partitions <= 1
cudf::logic_error – if
start_partition >= num_partitions
- Parameters:
input – [in] The input table to be round-robin partitioned
num_partitions – [in] Number of partitions for the table
start_partition – [in] Index of the 1st partition
mr – [in] Device memory resource used to allocate the returned table’s device memory
- Returns:
A std::pair consisting of a unique_ptr to the partitioned table and the partition offsets for each partition within the table.
-
std::pair<std::unique_ptr<table>, std::vector<size_type>> partition(table_view const &t, column_view const &partition_map, size_type num_partitions, rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#