Object-oriented HyperLogLog sketch for approximate distinct counting. More...
#include <approx_distinct_count.hpp>
Classes | |
| struct | desired_standard_error |
| Strong type wrapper for the desired standard error constructor parameter. More... | |
Public Types | |
| using | impl_type = cudf::detail::approx_distinct_count< cudf::hashing::detail::XXHash_64 > |
| Implementation type. | |
Public Member Functions | |
| approx_distinct_count (table_view const &input, std::int32_t precision=12, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream()) | |
| Constructs an approximate distinct count sketch from a table with specified precision. More... | |
| approx_distinct_count (table_view const &input, desired_standard_error error, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream()) | |
| Constructs an approximate distinct count sketch from a table with specified standard error. More... | |
| approx_distinct_count (cuda::std::span< cuda::std::byte > sketch_span, std::int32_t precision, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL) | |
| Constructs a non-owning sketch that operates on user-allocated storage. More... | |
| approx_distinct_count (approx_distinct_count const &)=delete | |
| approx_distinct_count & | operator= (approx_distinct_count const &)=delete |
| approx_distinct_count (approx_distinct_count &&)=default | |
| Default move constructor. | |
| approx_distinct_count & | operator= (approx_distinct_count &&)=default |
| Move assignment operator. More... | |
| void | add (table_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream()) |
| Adds rows from a table to the sketch. More... | |
| void | merge (approx_distinct_count const &other, rmm::cuda_stream_view stream=cudf::get_default_stream()) |
| Merges another sketch into this sketch. More... | |
| void | merge (cuda::std::span< cuda::std::byte const > sketch_span, rmm::cuda_stream_view stream=cudf::get_default_stream()) |
| Merges a sketch from raw bytes into this sketch. More... | |
| std::size_t | estimate (rmm::cuda_stream_view stream=cudf::get_default_stream()) const |
| Estimates the approximate number of distinct rows in the sketch. More... | |
| cuda::std::span< cuda::std::byte > | sketch () noexcept |
| Gets the raw sketch bytes for serialization or external merging. More... | |
| cuda::std::span< cuda::std::byte const > | sketch () const noexcept |
| Gets the raw sketch bytes for serialization or external merging (const overload) More... | |
| null_policy | null_handling () const noexcept |
| Gets the null handling policy for this sketch. More... | |
| nan_policy | nan_handling () const noexcept |
| Gets the NaN handling policy for this sketch. More... | |
| std::int32_t | precision () const noexcept |
| Gets the precision parameter for this sketch. More... | |
| double | standard_error () const noexcept |
| Gets the standard error (error tolerance) for this sketch. More... | |
Static Public Member Functions | |
| static std::size_t | sketch_bytes (std::int32_t precision) |
| Gets the number of bytes required for sketch storage at a given precision. More... | |
| static std::size_t | sketch_alignment () |
| Gets the alignment required for sketch storage. More... | |
Object-oriented HyperLogLog sketch for approximate distinct counting.
This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.
The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).
Common precision values:
Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:
This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark's configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.
Example usage:
Definition at line 82 of file approx_distinct_count.hpp.
| cudf::approx_distinct_count::approx_distinct_count | ( | table_view const & | input, |
| std::int32_t | precision = 12, |
||
| null_policy | null_handling = null_policy::EXCLUDE, |
||
| nan_policy | nan_handling = nan_policy::NAN_IS_NULL, |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Constructs an approximate distinct count sketch from a table with specified precision.
| input | Table whose rows will be added to the sketch |
| precision | The precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12. |
| null_handling | INCLUDE or EXCLUDE rows with nulls (default: EXCLUDE) |
| nan_handling | NAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL) |
| stream | CUDA stream used for device memory operations and kernel launches |
| cudf::approx_distinct_count::approx_distinct_count | ( | table_view const & | input, |
| desired_standard_error | error, | ||
| null_policy | null_handling = null_policy::EXCLUDE, |
||
| nan_policy | nan_handling = nan_policy::NAN_IS_NULL, |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Constructs an approximate distinct count sketch from a table with specified standard error.
This constructor allows specifying the desired standard error (error tolerance) directly, which is more intuitive than specifying the precision parameter. The precision is calculated as: ceil(2 * log2(1.04 / standard_error)).
Since precision must be an integer, the actual standard error may be better (smaller) than requested. Use the standard_error() getter to retrieve the actual value.
| input | Table whose rows will be added to the sketch |
| error | The desired standard error (e.g., approx_distinct_count::desired_standard_error{0.01} for ~1%) |
| null_handling | INCLUDE or EXCLUDE rows with nulls (default: EXCLUDE) |
| nan_handling | NAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL) |
| stream | CUDA stream used for device memory operations and kernel launches |
| std::invalid_argument | if standard_error value is not positive |
| cudf::approx_distinct_count::approx_distinct_count | ( | cuda::std::span< cuda::std::byte > | sketch_span, |
| std::int32_t | precision, | ||
| null_policy | null_handling = null_policy::EXCLUDE, |
||
| nan_policy | nan_handling = nan_policy::NAN_IS_NULL |
||
| ) |
Constructs a non-owning sketch that operates on user-allocated storage.
This constructor creates a sketch that operates directly on the provided storage without copying. This enables zero-copy operations on pre-existing buffers, such as sketch data stored in a column or received from another process.
| sketch_span | The sketch bytes to operate on (must remain valid) |
| precision | The precision parameter for the sketch (4-18) |
| null_handling | INCLUDE or EXCLUDE rows with nulls (default: EXCLUDE) |
| nan_handling | NAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL) |
| void cudf::approx_distinct_count::add | ( | table_view const & | input, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Adds rows from a table to the sketch.
| input | Table whose rows will be added |
| stream | CUDA stream used for device memory operations and kernel launches |
| std::size_t cudf::approx_distinct_count::estimate | ( | rmm::cuda_stream_view | stream = cudf::get_default_stream() | ) | const |
Estimates the approximate number of distinct rows in the sketch.
| stream | CUDA stream used for device memory operations and kernel launches |
| void cudf::approx_distinct_count::merge | ( | approx_distinct_count const & | other, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Merges another sketch into this sketch.
After merging, this sketch will contain the combined distinct count estimate of both sketches.
| std::invalid_argument | if the sketches have different precision values |
| std::invalid_argument | if the sketches have different null handling policies |
| std::invalid_argument | if the sketches have different NaN handling policies |
| other | The sketch to merge into this sketch |
| stream | CUDA stream used for device memory operations and kernel launches |
| void cudf::approx_distinct_count::merge | ( | cuda::std::span< cuda::std::byte const > | sketch_span, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream() |
||
| ) |
Merges a sketch from raw bytes into this sketch.
This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.
| sketch_span | The sketch bytes to merge into this sketch |
| stream | CUDA stream used for device memory operations and kernel launches |
|
noexcept |
Gets the NaN handling policy for this sketch.
|
noexcept |
Gets the null handling policy for this sketch.
|
default |
Move assignment operator.
|
noexcept |
Gets the precision parameter for this sketch.
|
noexcept |
Gets the raw sketch bytes for serialization or external merging (const overload)
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.
|
noexcept |
Gets the raw sketch bytes for serialization or external merging.
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.
|
static |
Gets the alignment required for sketch storage.
|
static |
Gets the number of bytes required for sketch storage at a given precision.
| precision | The HLL precision parameter (4-18) |
|
noexcept |
Gets the standard error (error tolerance) for this sketch.
The standard error is calculated from precision as: 1.04 / sqrt(2^precision). This represents the expected relative error of the cardinality estimate.