Hash join that builds a hash table with the right table on construction and probes results in subsequent *_join member functions.
More...
#include <hash_join.hpp>
Public Types | |
| using | impl_type = typename cudf::detail::hash_join< cudf::hashing::detail::MurmurHash3_x86_32< cudf::hash_value_type > > |
| Implementation type. | |
Static Public Member Functions | |
| static std::pair< std::unique_ptr< rmm::device_uvector< size_type > >, std::unique_ptr< rmm::device_uvector< size_type > > > | finalize_partitioned_full_join (cudf::host_span< cudf::device_span< size_type const > const > left_partials, cudf::host_span< cudf::device_span< size_type const > const > right_partials, size_type left_table_num_rows, size_type right_table_num_rows, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) |
| Finalizes a partitioned full join by concatenating all per-partition results and appending the unmatched right rows (the complement). More... | |
Hash join that builds a hash table with the right table on construction and probes results in subsequent *_join member functions.
This class enables the hash join scheme that builds with the right table once and probes with many left tables (possibly in parallel).
Definition at line 65 of file hash_join.hpp.
| cudf::hash_join::hash_join | ( | cudf::table_view const & | right, |
| null_equality | compare_nulls, | ||
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Construct a hash join object for subsequent probe calls.
hash_join object must not outlive the table viewed by right, else behavior is undefined.| std::invalid_argument | if the right table has no columns |
| right | The right table, from which the hash table is built |
| compare_nulls | Controls whether null join-key values should match or not |
| stream | CUDA stream used for device memory operations and kernel launches |
| cudf::hash_join::hash_join | ( | cudf::table_view const & | right, |
| nullable_join | has_nulls, | ||
| null_equality | compare_nulls, | ||
| double | load_factor, | ||
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Construct a hash join object for subsequent probe calls.
hash_join object must not outlive the table viewed by right, else behavior is undefined.| std::invalid_argument | if the right table has no columns |
| right | The right table, from which the hash table is built |
| compare_nulls | Controls whether null join-key values should match or not |
| stream | CUDA stream used for device memory operations and kernel launches |
| std::invalid_argument | if load_factor is not greater than 0 and less than or equal to 1 |
| has_nulls | Flag to indicate if there exists any nulls in the right table or any left table that will be used later for join |
| load_factor | The hash table occupancy ratio in (0,1]. A value of 0.5 means 50% desired occupancy. |
|
static |
Finalizes a partitioned full join by concatenating all per-partition results and appending the unmatched right rows (the complement).
Call this method after calling partitioned_full_join() for every partition. It combines the per-partition indices with the unmatched right row indices (a global property across all partitions) and returns a single (left_indices, right_indices) pair equivalent to the output of full_join().
| left_partials | Per-partition left_indices views produced by partitioned_full_join() |
| right_partials | Per-partition right_indices views produced by partitioned_full_join() |
| left_table_num_rows | Total number of rows in the original left table |
| right_table_num_rows | Total number of rows in the right table |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the result device memory |
left_indices, right_indices] representing the full join output. | std::pair<std::unique_ptr<rmm::device_uvector<size_type> >, std::unique_ptr<rmm::device_uvector<size_type> > > cudf::hash_join::full_join | ( | cudf::table_view const & | left, |
| std::optional< std::size_t > | output_size = {}, |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Returns the row indices that can be used to construct the result of performing a full join between two tables.
output_size is smaller than the actual output size.| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table, from which the tuples are probed |
| output_size | Optional value which allows users to specify the exact output size |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the returned table and columns' device memory. |
left_indices, right_indices] that can be used to construct the result of performing a full join between two tables with left and right as the join keys . | cudf::join_match_context cudf::hash_join::full_join_match_context | ( | cudf::table_view const & | left, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Returns context information about matches between the left and right tables.
This method computes, for each row in the left table, how many matching rows exist in the right table according to full join semantics, and returns the number of matches through a join_match_context object.
For full join, this includes matches for left table rows, and the result may need to be combined with unmatched rows from the right table to get the complete picture.
| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table to join with the pre-processed right table |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the result device memory |
| std::size_t cudf::hash_join::full_join_size | ( | cudf::table_view const & | left, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Returns the exact number of matches (rows) when performing a full join with the specified left table.
| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table, from which the tuples are probed |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the intermediate table and columns' device memory. |
left and right as the join keys . | std::pair<std::unique_ptr<rmm::device_uvector<size_type> >, std::unique_ptr<rmm::device_uvector<size_type> > > cudf::hash_join::inner_join | ( | cudf::table_view const & | left, |
| std::optional< std::size_t > | output_size = {}, |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Returns the row indices that can be used to construct the result of performing an inner join between two tables.
output_size is smaller than the actual output size.| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table, from which the tuples are probed |
| output_size | Optional value which allows users to specify the exact output size |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the returned table and columns' device memory. |
left_indices, right_indices] that can be used to construct the result of performing an inner join between two tables with left and right as the join keys . | cudf::join_match_context cudf::hash_join::inner_join_match_context | ( | cudf::table_view const & | left, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Returns context information about matches between the left and right tables.
This method computes, for each row in the left table, how many matching rows exist in the right table according to inner join semantics, and returns the number of matches through a join_match_context object.
This is particularly useful for:
| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table to join with the pre-processed right table |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the result device memory |
| std::size_t cudf::hash_join::inner_join_size | ( | cudf::table_view const & | left, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) | const |
Returns the exact number of matches (rows) when performing an inner join with the specified left table.
| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table, from which the tuples are probed |
| stream | CUDA stream used for device memory operations and kernel launches |
left and right as the join keys . | std::pair<std::unique_ptr<rmm::device_uvector<size_type> >, std::unique_ptr<rmm::device_uvector<size_type> > > cudf::hash_join::left_join | ( | cudf::table_view const & | left, |
| std::optional< std::size_t > | output_size = {}, |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Returns the row indices that can be used to construct the result of performing a left join between two tables.
output_size is smaller than the actual output size.| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table, from which the tuples are probed |
| output_size | Optional value which allows users to specify the exact output size |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the returned table and columns' device memory. |
left_indices, right_indices] that can be used to construct the result of performing a left join between two tables with left and right as the join keys. | cudf::join_match_context cudf::hash_join::left_join_match_context | ( | cudf::table_view const & | left, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Returns context information about matches between the left and right tables.
This method computes, for each row in the left table, how many matching rows exist in the right table according to left join semantics, and returns the number of matches through a join_match_context object.
For left join, every row in the left table will have at least one match (either with a matching row from the right table or with a null placeholder).
| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table to join with the pre-processed right table |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the result device memory |
| std::size_t cudf::hash_join::left_join_size | ( | cudf::table_view const & | left, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) | const |
Returns the exact number of matches (rows) when performing a left join with the specified left table.
| std::invalid_argument | If the input left table has nulls while this hash_join object was not constructed with null check. |
| left | The left table, from which the tuples are probed |
| stream | CUDA stream used for device memory operations and kernel launches |
left and right as the join keys . | std::pair<std::unique_ptr<rmm::device_uvector<size_type> >, std::unique_ptr<rmm::device_uvector<size_type> > > cudf::hash_join::partitioned_full_join | ( | cudf::join_partition_context const & | context, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Performs a full join probe on a partition of the probe table.
This method executes the probe-side of a full join between a specific partition of the probe table (defined by the join_partition_context) and the build table. The context must have been previously created by calling full_join_match_context().
finalize_partitioned_full_join() to obtain the complete full join output.The returned left_indices are relative to the original complete probe table.
| std::invalid_argument | If context.left_table_context is null, if its _match_counts is null, or if [left_start_idx, left_end_idx) is outside the bounds of the left table. |
| context | The partition context containing match information and partition bounds |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the join indices' device memory |
left_indices, right_indices] for this partition | std::pair<std::unique_ptr<rmm::device_uvector<size_type> >, std::unique_ptr<rmm::device_uvector<size_type> > > cudf::hash_join::partitioned_inner_join | ( | cudf::join_partition_context const & | context, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Performs an inner join on a partition of the probe table.
This method executes an inner join between a specific partition of the probe table (defined by the join_partition_context) and the build table. The context must have been previously created by calling inner_join_match_context().
The returned left_indices are relative to the original complete probe table, not just the partition, so they can be used directly with the original probe table.
| std::invalid_argument | If context.left_table_context is null, if its _match_counts is null, or if [left_start_idx, left_end_idx) is outside the bounds of the left table. |
| context | The partition context containing match information and partition bounds |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the join indices' device memory |
left_indices, right_indices] for this partition | std::pair<std::unique_ptr<rmm::device_uvector<size_type> >, std::unique_ptr<rmm::device_uvector<size_type> > > cudf::hash_join::partitioned_left_join | ( | cudf::join_partition_context const & | context, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) | const |
Performs a left join on a partition of the probe table.
This method executes a left join between a specific partition of the probe table (defined by the join_partition_context) and the build table. The context must have been previously created by calling left_join_match_context().
The returned left_indices are relative to the original complete probe table.
| std::invalid_argument | If context.left_table_context is null, if its _match_counts is null, or if [left_start_idx, left_end_idx) is outside the bounds of the left table. |
| context | The partition context containing match information and partition bounds |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the join indices' device memory |
left_indices, right_indices] for this partition