Public Member Functions | List of all members
rapidsmpf::streaming::BloomFilter Struct Reference

Utility managing construction and use of a bloom filter. More...

#include <bloom_filter.hpp>

Public Member Functions

 BloomFilter (std::shared_ptr< Context > ctx, std::shared_ptr< Communicator > comm, std::uint64_t seed, std::size_t num_filter_blocks) noexcept
 Construct storage for a bloom filter. More...
 
std::shared_ptr< Communicator > const & comm () const noexcept
 Gets the communicator associated with this BloomFilter. More...
 
Actor build (std::shared_ptr< Channel > ch_in, std::shared_ptr< Channel > ch_out, OpID tag)
 Build a bloom filter from the input channel. More...
 
Actor apply (std::shared_ptr< Channel > bloom_filter, std::shared_ptr< Channel > ch_in, std::shared_ptr< Channel > ch_out, std::vector< cudf::size_type > keys)
 Apply a bloom filter to an input channel. More...
 

Detailed Description

Utility managing construction and use of a bloom filter.

This class provides methods to build a bloom filter from a stream of TableChunks and then apply that filter to a different stream.

A bloom filter is a fixed size probabilistic data structure that provides approximate set membership queries with no false negatives. That is, let A be some set and f(A) be the bloom filter representation of that set. Then, for all a ∈ A it holds that a ∈ f(A). Conversely, there is a false positive rate that increases with the number of distinct values inserted into the bloom filter, and decreases with the number of filter blocks. That is, for any given bloom filter, there exists a ∉ A such that a ∈ f(A).

See https://arxiv.org/pdf/2512.15595 for details on the GPU implementation used.

We use bloom filters to provide runtime pre-filtering of tables during shuffle-based joins. We gather the keys that will match from the build side and use those to pre-filter the probe side before shuffling.

Definition at line 38 of file bloom_filter.hpp.

Constructor & Destructor Documentation

◆ BloomFilter()

rapidsmpf::streaming::BloomFilter::BloomFilter ( std::shared_ptr< Context ctx,
std::shared_ptr< Communicator comm,
std::uint64_t  seed,
std::size_t  num_filter_blocks 
)
inlineexplicitnoexcept

Construct storage for a bloom filter.

Parameters
ctxStreaming context.
commCommunicator for the collective operation.
seedHash seed used when hashing values into the filter.
num_filter_blocksNumber of blocks in the filter.

Definition at line 47 of file bloom_filter.hpp.

Member Function Documentation

◆ apply()

Actor rapidsmpf::streaming::BloomFilter::apply ( std::shared_ptr< Channel bloom_filter,
std::shared_ptr< Channel ch_in,
std::shared_ptr< Channel ch_out,
std::vector< cudf::size_type >  keys 
)

Apply a bloom filter to an input channel.

Parameters
bloom_filterChannel containing the bloom filter (a single message).
ch_inInput channel of TableChunks to apply bloom filter to.
ch_outOutput channel receiving filtered TableChunks.
keysIndices selecting the key columns for the hash fingerprint
Note
The application of the bloom filter expects exactly one message to come through the bloom_filter channel, which must be drained after that message is sent.
Returns
Coroutine representing the application of the bloom filter.

◆ build()

Actor rapidsmpf::streaming::BloomFilter::build ( std::shared_ptr< Channel ch_in,
std::shared_ptr< Channel ch_out,
OpID  tag 
)

Build a bloom filter from the input channel.

Parameters
ch_inInput channel of TableChunks to build bloom filter for.
ch_outOutput channel receiving a single message containing the bloom filter.
tagDisambiguating tag to combine filters across ranks.
Returns
Coroutine representing the construction of the bloom filter.

◆ comm()

std::shared_ptr<Communicator> const& rapidsmpf::streaming::BloomFilter::comm ( ) const
inlinenoexcept

Gets the communicator associated with this BloomFilter.

Returns
Shared pointer to communicator.

Definition at line 63 of file bloom_filter.hpp.


The documentation for this struct was generated from the following file: