Object-oriented HyperLogLog sketch for approximate distinct counting. More...
#include <approx_distinct_count.hpp>
Public Types | |
| using | impl_type = cudf::detail::approx_distinct_count< cudf::hashing::detail::XXHash_64 > |
| Implementation type. | |
Public Member Functions | |
| approx_distinct_count (table_view const &input, std::int32_t precision=12, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream()) | |
| Constructs an approximate distinct count sketch from a table. More... | |
| approx_distinct_count (cuda::std::span< cuda::std::byte > sketch_span, std::int32_t precision, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream()) | |
| Constructs an approximate distinct count sketch from serialized sketch bytes. More... | |
| approx_distinct_count (approx_distinct_count const &)=delete | |
| approx_distinct_count & | operator= (approx_distinct_count const &)=delete |
| approx_distinct_count (approx_distinct_count &&)=default | |
| Default move constructor. | |
| approx_distinct_count & | operator= (approx_distinct_count &&)=default |
| Move assignment operator. More... | |
| void | add (table_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream()) |
| Adds rows from a table to the sketch. More... | |
| void | merge (approx_distinct_count const &other, rmm::cuda_stream_view stream=cudf::get_default_stream()) |
| Merges another sketch into this sketch. More... | |
| void | merge (cuda::std::span< cuda::std::byte > sketch_span, rmm::cuda_stream_view stream=cudf::get_default_stream()) |
| Merges a sketch from raw bytes into this sketch. More... | |
| std::size_t | estimate (rmm::cuda_stream_view stream=cudf::get_default_stream()) const |
| Estimates the approximate number of distinct rows in the sketch. More... | |
| cuda::std::span< cuda::std::byte > | sketch () noexcept |
| Gets the raw sketch bytes for serialization or external merging. More... | |
| cuda::std::span< cuda::std::byte const > | sketch () const noexcept |
| Gets the raw sketch bytes for serialization or external merging (const overload) More... | |
| null_policy | null_handling () const noexcept |
| Gets the null handling policy for this sketch. More... | |
| nan_policy | nan_handling () const noexcept |
| Gets the NaN handling policy for this sketch. More... | |
| std::int32_t | precision () const noexcept |
| Gets the precision parameter for this sketch. More... | |
Object-oriented HyperLogLog sketch for approximate distinct counting.
This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.
The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).
Common precision values:
Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:
This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark's configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.
Example usage:
Definition at line 76 of file approx_distinct_count.hpp.
| cudf::approx_distinct_count::approx_distinct_count | ( | table_view const & | input, |
| std::int32_t | precision = 12, |
||
| null_policy | null_handling = null_policy::EXCLUDE, |
||
| nan_policy | nan_handling = nan_policy::NAN_IS_NULL, |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Constructs an approximate distinct count sketch from a table.
| input | Table whose rows will be added to the sketch |
| precision | The precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12. |
| null_handling | INCLUDE or EXCLUDE rows with nulls (default: EXCLUDE) |
| nan_handling | NAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL) |
| stream | CUDA stream used for device memory operations and kernel launches |
| cudf::approx_distinct_count::approx_distinct_count | ( | cuda::std::span< cuda::std::byte > | sketch_span, |
| std::int32_t | precision, | ||
| null_policy | null_handling = null_policy::EXCLUDE, |
||
| nan_policy | nan_handling = nan_policy::NAN_IS_NULL, |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Constructs an approximate distinct count sketch from serialized sketch bytes.
This constructor enables distributed distinct counting by allowing sketches to be constructed from serialized data. The sketch data is copied into the newly created object, which then owns its own independent storage.
| sketch_span | The serialized sketch bytes to reconstruct from |
| precision | The precision parameter that was used to create the sketch (4-18) |
| null_handling | INCLUDE or EXCLUDE rows with nulls (default: EXCLUDE) |
| nan_handling | NAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL) |
| stream | CUDA stream used for device memory operations and kernel launches |
| void cudf::approx_distinct_count::add | ( | table_view const & | input, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Adds rows from a table to the sketch.
| input | Table whose rows will be added |
| stream | CUDA stream used for device memory operations and kernel launches |
| std::size_t cudf::approx_distinct_count::estimate | ( | rmm::cuda_stream_view | stream = cudf::get_default_stream() | ) | const |
Estimates the approximate number of distinct rows in the sketch.
| stream | CUDA stream used for device memory operations and kernel launches |
| void cudf::approx_distinct_count::merge | ( | approx_distinct_count const & | other, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Merges another sketch into this sketch.
After merging, this sketch will contain the combined distinct count estimate of both sketches.
| std::invalid_argument | if the sketches have different precision values |
| std::invalid_argument | if the sketches have different null handling policies |
| std::invalid_argument | if the sketches have different NaN handling policies |
| other | The sketch to merge into this sketch |
| stream | CUDA stream used for device memory operations and kernel launches |
| void cudf::approx_distinct_count::merge | ( | cuda::std::span< cuda::std::byte > | sketch_span, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Merges a sketch from raw bytes into this sketch.
This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.
| sketch_span | The sketch bytes to merge into this sketch |
| stream | CUDA stream used for device memory operations and kernel launches |
|
noexcept |
Gets the NaN handling policy for this sketch.
|
noexcept |
Gets the null handling policy for this sketch.
|
default |
Move assignment operator.
|
noexcept |
Gets the precision parameter for this sketch.
|
noexcept |
Gets the raw sketch bytes for serialization or external merging (const overload)
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.
|
noexcept |
Gets the raw sketch bytes for serialization or external merging.
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.