Public Member Functions | List of all members
cudf::mark_join Class Reference

Mark-based hash join for semi/anti join with left table reuse. More...

#include <mark_join.hpp>

Public Member Functions

 mark_join (mark_join const &)=delete
 
 mark_join (mark_join &&)=delete
 
mark_joinoperator= (mark_join const &)=delete
 
mark_joinoperator= (mark_join &&)=delete
 
 mark_join (cudf::table_view const &left, cudf::null_equality compare_nulls, cudf::join_prefilter prefilter, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Constructs a mark join object with explicit prefilter selection. More...
 
 mark_join (cudf::table_view const &left, double load_factor, cudf::null_equality compare_nulls=cudf::null_equality::EQUAL, cudf::join_prefilter prefilter=cudf::join_prefilter::NO, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Constructs a mark join object with explicit prefilter selection. More...
 
std::unique_ptr< rmm::device_uvector< size_type > > semi_join (cudf::table_view const &right, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) const
 Returns left row indices that have at least one match in the right table. More...
 
std::unique_ptr< rmm::device_uvector< size_type > > anti_join (cudf::table_view const &right, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) const
 Returns left row indices that have no match in the right table. More...
 

Detailed Description

Mark-based hash join for semi/anti join with left table reuse.

Builds a hash table from the left (build) table using a multiset that allows duplicate keys. The probe kernel atomically marks matching left entries via CAS on the hash MSB, then a retrieve kernel collects marked (semi) or unmarked (anti) entries.

This class enables building the hash table once and probing multiple times with different right (probe) tables, amortizing the build cost. Probe-side prefiltering can be enabled at construction time via join_prefilter.

Note
This class is designed for the case where the left table is reused across multiple semi/anti join operations. It should only be used when:
  • The left (build) table is smaller than the right (probe) table. Building a hash table from the larger table is memory-inefficient and leads to poor probe performance due to longer collision chains.
  • The left table is reasonably small (e.g. ≤1M rows). The mark-based probe walks the hash table linearly per right row, so performance degrades with large hash tables. For large left tables, consider alternative join strategies.
  • The left table is probed multiple times. If only a single join is needed, there is no benefit to reuse and standard join APIs may be more efficient.

For the common case where the right (filter) table is reused, use cudf::filtered_join instead, which builds a distinct set from the right table.

Note
All NaNs are considered as equal

Definition at line 61 of file mark_join.hpp.

Constructor & Destructor Documentation

◆ mark_join() [1/2]

cudf::mark_join::mark_join ( cudf::table_view const &  left,
cudf::null_equality  compare_nulls,
cudf::join_prefilter  prefilter,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Constructs a mark join object with explicit prefilter selection.

Parameters
leftThe left table; the hash table is built from this table
compare_nullsControls whether null join-key values should match or not
prefilterControls whether an optional probe-side prefilter is enabled
streamCUDA stream used for device memory operations and kernel launches

◆ mark_join() [2/2]

cudf::mark_join::mark_join ( cudf::table_view const &  left,
double  load_factor,
cudf::null_equality  compare_nulls = cudf::null_equality::EQUAL,
cudf::join_prefilter  prefilter = cudf::join_prefilter::NO,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Constructs a mark join object with explicit prefilter selection.

Parameters
leftThe left table; the hash table is built from this table
load_factorHash table load factor in range (0,1]
compare_nullsControls whether null join-key values should match or not
prefilterControls whether an optional probe-side prefilter is enabled
streamCUDA stream used for device memory operations and kernel launches

Member Function Documentation

◆ anti_join()

std::unique_ptr<rmm::device_uvector<size_type> > cudf::mark_join::anti_join ( cudf::table_view const &  right,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
) const

Returns left row indices that have no match in the right table.

Parameters
rightThe right table; probed against the hash table built from the left table
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned device memory
Returns
Device vector of left row indices

◆ semi_join()

std::unique_ptr<rmm::device_uvector<size_type> > cudf::mark_join::semi_join ( cudf::table_view const &  right,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
) const

Returns left row indices that have at least one match in the right table.

Parameters
rightThe right table; probed against the hash table built from the left table
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned device memory
Returns
Device vector of left row indices

The documentation for this class was generated from the following file: