Utility managing construction and use of a bloom filter. More...
#include <bloom_filter.hpp>
Public Member Functions | |
| BloomFilter (std::shared_ptr< Context > ctx, std::shared_ptr< Communicator > comm, std::uint64_t seed, std::size_t num_filter_blocks) noexcept | |
| Construct storage for a bloom filter. More... | |
| std::shared_ptr< Communicator > const & | comm () const noexcept |
| Gets the communicator associated with this BloomFilter. More... | |
| Actor | build (std::shared_ptr< Channel > ch_in, std::shared_ptr< Channel > ch_out, OpID tag) |
| Build a bloom filter from the input channel. More... | |
| Actor | apply (std::shared_ptr< Channel > bloom_filter, std::shared_ptr< Channel > ch_in, std::shared_ptr< Channel > ch_out, std::vector< cudf::size_type > keys) |
| Apply a bloom filter to an input channel. More... | |
Utility managing construction and use of a bloom filter.
This class provides methods to build a bloom filter from a stream of TableChunks and then apply that filter to a different stream.
A bloom filter is a fixed size probabilistic data structure that provides approximate set membership queries with no false negatives. That is, let A be some set and f(A) be the bloom filter representation of that set. Then, for all a ∈ A it holds that a ∈ f(A). Conversely, there is a false positive rate that increases with the number of distinct values inserted into the bloom filter, and decreases with the number of filter blocks. That is, for any given bloom filter, there exists a ∉ A such that a ∈ f(A).
See https://arxiv.org/pdf/2512.15595 for details on the GPU implementation used.
We use bloom filters to provide runtime pre-filtering of tables during shuffle-based joins. We gather the keys that will match from the build side and use those to pre-filter the probe side before shuffling.
Definition at line 38 of file bloom_filter.hpp.
|
inlineexplicitnoexcept |
Construct storage for a bloom filter.
| ctx | Streaming context. |
| comm | Communicator for the collective operation. |
| seed | Hash seed used when hashing values into the filter. |
| num_filter_blocks | Number of blocks in the filter. |
Definition at line 47 of file bloom_filter.hpp.
| Actor rapidsmpf::streaming::BloomFilter::apply | ( | std::shared_ptr< Channel > | bloom_filter, |
| std::shared_ptr< Channel > | ch_in, | ||
| std::shared_ptr< Channel > | ch_out, | ||
| std::vector< cudf::size_type > | keys | ||
| ) |
Apply a bloom filter to an input channel.
| bloom_filter | Channel containing the bloom filter (a single message). |
| ch_in | Input channel of TableChunks to apply bloom filter to. |
| ch_out | Output channel receiving filtered TableChunks. |
| keys | Indices selecting the key columns for the hash fingerprint |
bloom_filter channel, which must be drained after that message is sent.| Actor rapidsmpf::streaming::BloomFilter::build | ( | std::shared_ptr< Channel > | ch_in, |
| std::shared_ptr< Channel > | ch_out, | ||
| OpID | tag | ||
| ) |
Build a bloom filter from the input channel.
| ch_in | Input channel of TableChunks to build bloom filter for. |
| ch_out | Output channel receiving a single message containing the bloom filter. |
| tag | Disambiguating tag to combine filters across ranks. |
|
inlinenoexcept |
Gets the communicator associated with this BloomFilter.
Definition at line 63 of file bloom_filter.hpp.