Object-oriented HyperLogLog sketch for approximate distinct counting. More...

#include <approx_distinct_count.hpp>

Classes
struct	desired_standard_error
	Strong type wrapper for the desired standard error constructor parameter. More...

Public Types
using	impl_type = cudf::detail::approx_distinct_count< cudf::hashing::detail::XXHash_64 >
	Implementation type.

Public Member Functions
	approx_distinct_count (table_view const &input, std::int32_t precision=12, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream())
	Constructs an approximate distinct count sketch from a table with specified precision. More...

	approx_distinct_count (table_view const &input, desired_standard_error error, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream())
	Constructs an approximate distinct count sketch from a table with specified standard error. More...

	approx_distinct_count (cuda::std::span< cuda::std::byte > sketch_span, std::int32_t precision, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL)
	Constructs a non-owning sketch that operates on user-allocated storage. More...

	approx_distinct_count (approx_distinct_count const &)=delete

approx_distinct_count &	operator= (approx_distinct_count const &)=delete

	approx_distinct_count (approx_distinct_count &&)=default
	Default move constructor.

approx_distinct_count &	operator= (approx_distinct_count &&)=default
	Move assignment operator. More...

void	add (table_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream())
	Adds rows from a table to the sketch. More...

void	merge (approx_distinct_count const &other, rmm::cuda_stream_view stream=cudf::get_default_stream())
	Merges another sketch into this sketch. More...

void	merge (cuda::std::span< cuda::std::byte const > sketch_span, rmm::cuda_stream_view stream=cudf::get_default_stream())
	Merges a sketch from raw bytes into this sketch. More...

std::size_t	estimate (rmm::cuda_stream_view stream=cudf::get_default_stream()) const
	Estimates the approximate number of distinct rows in the sketch. More...

cuda::std::span< cuda::std::byte >	sketch () noexcept
	Gets the raw sketch bytes for serialization or external merging. More...

cuda::std::span< cuda::std::byte const >	sketch () const noexcept
	Gets the raw sketch bytes for serialization or external merging (const overload) More...

null_policy	null_handling () const noexcept
	Gets the null handling policy for this sketch. More...

nan_policy	nan_handling () const noexcept
	Gets the NaN handling policy for this sketch. More...

std::int32_t	precision () const noexcept
	Gets the precision parameter for this sketch. More...

double	standard_error () const noexcept
	Gets the standard error (error tolerance) for this sketch. More...

Static Public Member Functions
static std::size_t	sketch_bytes (std::int32_t precision)
	Gets the number of bytes required for sketch storage at a given precision. More...

static std::size_t	sketch_alignment ()
	Gets the alignment required for sketch storage. More...

Detailed Description

Object-oriented HyperLogLog sketch for approximate distinct counting.

This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.

The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).

HyperLogLog Precision Parameter

The precision parameter (p) is the number of bits used to index into the register array. It determines the number of registers (m = 2^p) in the HLL sketch:

Memory usage: 2^p * 4 bytes (m registers of 4 bytes each for GPU atomics)
Standard error: 1.04 / sqrt(m) = 1.04 / sqrt(2^p)

Common precision values:

p = 10: m = 1,024 registers, ~3.2% standard error, 4KB memory
p = 12 (default): m = 4,096 registers, ~1.6% standard error, 16KB memory
p = 14: m = 16,384 registers, ~0.8% standard error, 64KB memory
p = 16: m = 65,536 registers, ~0.4% standard error, 256KB memory

Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:

Below 4: Too few registers for HLL's statistical assumptions, resulting in high variance and unstable estimates.
Above 18: Rapidly diminishing accuracy gains while incurring significant memory growth, making the structure no longer space-efficient for approximate counting.

This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark's configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.

Example usage:

auto adc = cudf::approx_distinct_count(table1);
auto count1 = adc.estimate();
 
adc.add(table2);
auto count2 = adc.estimate();

Definition at line 82 of file approx_distinct_count.hpp.

Constructor & Destructor Documentation

◆ approx_distinct_count() [1/3]

cudf::approx_distinct_count::approx_distinct_count	(	table_view const &	input,
		std::int32_t	precision = `12`,
		null_policy	null_handling = `null_policy::EXCLUDE`,
		nan_policy	nan_handling = `nan_policy::NAN_IS_NULL`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`
	)

Constructs an approximate distinct count sketch from a table with specified precision.

Parameters

input	Table whose rows will be added to the sketch
precision	The precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12.
null_handling	`INCLUDE` or `EXCLUDE` rows with nulls (default: `EXCLUDE`)
nan_handling	`NAN_IS_VALID` or `NAN_IS_NULL` (default: `NAN_IS_NULL`)
stream	CUDA stream used for device memory operations and kernel launches

◆ approx_distinct_count() [2/3]

cudf::approx_distinct_count::approx_distinct_count	(	table_view const &	input,
		desired_standard_error	error,
		null_policy	null_handling = `null_policy::EXCLUDE`,
		nan_policy	nan_handling = `nan_policy::NAN_IS_NULL`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`
	)

Constructs an approximate distinct count sketch from a table with specified standard error.

This constructor allows specifying the desired standard error (error tolerance) directly, which is more intuitive than specifying the precision parameter. The precision is calculated as: ceil(2 * log2(1.04 / standard_error)).

Since precision must be an integer, the actual standard error may be better (smaller) than requested. Use the standard_error() getter to retrieve the actual value.

Parameters

input	Table whose rows will be added to the sketch
error	The desired standard error (e.g., `approx_distinct_count::desired_standard_error{0.01}` for ~1%)
null_handling	`INCLUDE` or `EXCLUDE` rows with nulls (default: `EXCLUDE`)
nan_handling	`NAN_IS_VALID` or `NAN_IS_NULL` (default: `NAN_IS_NULL`)
stream	CUDA stream used for device memory operations and kernel launches

Exceptions

std::invalid_argument if standard_error value is not positive

◆ approx_distinct_count() [3/3]

cudf::approx_distinct_count::approx_distinct_count	(	cuda::std::span< cuda::std::byte >	sketch_span,
		std::int32_t	precision,
		null_policy	null_handling = `null_policy::EXCLUDE`,
		nan_policy	nan_handling = `nan_policy::NAN_IS_NULL`
	)

Constructs a non-owning sketch that operates on user-allocated storage.

This constructor creates a sketch that operates directly on the provided storage without copying. This enables zero-copy operations on pre-existing buffers, such as sketch data stored in a column or received from another process.

Warning: The caller must ensure the storage remains valid for the lifetime of this object. The sketch will read from and write to the provided storage directly.

Parameters

sketch_span	The sketch bytes to operate on (must remain valid)
precision	The precision parameter for the sketch (4-18)
null_handling	`INCLUDE` or `EXCLUDE` rows with nulls (default: `EXCLUDE`)
nan_handling	`NAN_IS_VALID` or `NAN_IS_NULL` (default: `NAN_IS_NULL`)

Member Function Documentation

◆ add()

void cudf::approx_distinct_count::add	(	table_view const &	input,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`
	)

Adds rows from a table to the sketch.

Parameters

input	Table whose rows will be added
stream	CUDA stream used for device memory operations and kernel launches

◆ estimate()

std::size_t cudf::approx_distinct_count::estimate ( rmm::cuda_stream_view stream = cudf::get_default_stream() ) const

Estimates the approximate number of distinct rows in the sketch.

Parameters

stream CUDA stream used for device memory operations and kernel launches

Returns: Approximate number of distinct rows

◆ merge() [1/2]

void cudf::approx_distinct_count::merge	(	approx_distinct_count const &	other,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`
	)

Merges another sketch into this sketch.

After merging, this sketch will contain the combined distinct count estimate of both sketches.

Exceptions

std::invalid_argument	if the sketches have different precision values
std::invalid_argument	if the sketches have different null handling policies
std::invalid_argument	if the sketches have different NaN handling policies

Parameters

other	The sketch to merge into this sketch
stream	CUDA stream used for device memory operations and kernel launches

◆ merge() [2/2]

void cudf::approx_distinct_count::merge	(	cuda::std::span< cuda::std::byte const >	sketch_span,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`
	)

Merges a sketch from raw bytes into this sketch.

This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.

Warning: It is the caller's responsibility to ensure that the provided sketch span was created with the same approx_distinct_count configuration (precision, null/NaN handling, etc.) as this sketch. Merging incompatible sketches will produce incorrect results.

Parameters

sketch_span	The sketch bytes to merge into this sketch
stream	CUDA stream used for device memory operations and kernel launches

◆ nan_handling()

nan_policy cudf::approx_distinct_count::nan_handling ( ) const

noexcept

Gets the NaN handling policy for this sketch.

Returns: The NaN policy set at construction

◆ null_handling()

null_policy cudf::approx_distinct_count::null_handling ( ) const

noexcept

Gets the null handling policy for this sketch.

Returns: The null policy set at construction

◆ operator=()

approx_distinct_count& cudf::approx_distinct_count::operator= ( approx_distinct_count && )

default

Move assignment operator.

Returns: A reference to this object

◆ precision()

std::int32_t cudf::approx_distinct_count::precision ( ) const

noexcept

Gets the precision parameter for this sketch.

Returns: The precision value set at construction

◆ sketch() [1/2]

cuda::std::span<cuda::std::byte const> cudf::approx_distinct_count::sketch ( ) const

noexcept

Gets the raw sketch bytes for serialization or external merging (const overload)

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns: A span view of the sketch bytes

◆ sketch() [2/2]

cuda::std::span<cuda::std::byte> cudf::approx_distinct_count::sketch ( )

noexcept

Gets the raw sketch bytes for serialization or external merging.

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns: A span view of the sketch bytes

◆ sketch_alignment()

static std::size_t cudf::approx_distinct_count::sketch_alignment ( )

static

Gets the alignment required for sketch storage.

Returns: The required alignment in bytes

◆ sketch_bytes()

static std::size_t cudf::approx_distinct_count::sketch_bytes ( std::int32_t precision )

static

Gets the number of bytes required for sketch storage at a given precision.

Parameters

precision The HLL precision parameter (4-18)

Returns: The number of bytes required for the sketch

◆ standard_error()

double cudf::approx_distinct_count::standard_error ( ) const

noexcept

Gets the standard error (error tolerance) for this sketch.

The standard error is calculated from precision as: 1.04 / sqrt(2^precision). This represents the expected relative error of the cardinality estimate.

Returns: The actual standard error based on the sketch's precision

The documentation for this class was generated from the following file:

approx_distinct_count.hpp

Classes

Public Types

Public Member Functions

Static Public Member Functions

Detailed Description

Constructor & Destructor Documentation

◆ approx_distinct_count() [1/3]

◆ approx_distinct_count() [2/3]

◆ approx_distinct_count() [3/3]

Member Function Documentation

◆ add()

◆ estimate()

◆ merge() [1/2]

◆ merge() [2/2]

◆ nan_handling()

◆ null_handling()

◆ operator=()

◆ precision()

◆ sketch() [1/2]

◆ sketch() [2/2]

◆ sketch_alignment()

◆ sketch_bytes()

◆ standard_error()