Classes | Public Types | Public Member Functions | Static Public Member Functions | List of all members
cudf::approx_distinct_count Class Reference

Object-oriented HyperLogLog sketch for approximate distinct counting. More...

#include <approx_distinct_count.hpp>

Classes

struct  desired_standard_error
 Strong type wrapper for the desired standard error constructor parameter. More...
 

Public Types

using impl_type = cudf::detail::approx_distinct_count< cudf::hashing::detail::XXHash_64 >
 Implementation type.
 

Public Member Functions

 approx_distinct_count (table_view const &input, std::int32_t precision=12, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Constructs an approximate distinct count sketch from a table with specified precision. More...
 
 approx_distinct_count (table_view const &input, desired_standard_error error, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Constructs an approximate distinct count sketch from a table with specified standard error. More...
 
 approx_distinct_count (cuda::std::span< cuda::std::byte > sketch_span, std::int32_t precision, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL)
 Constructs a non-owning sketch that operates on user-allocated storage. More...
 
 approx_distinct_count (approx_distinct_count const &)=delete
 
approx_distinct_countoperator= (approx_distinct_count const &)=delete
 
 approx_distinct_count (approx_distinct_count &&)=default
 Default move constructor.
 
approx_distinct_countoperator= (approx_distinct_count &&)=default
 Move assignment operator. More...
 
void add (table_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Adds rows from a table to the sketch. More...
 
void merge (approx_distinct_count const &other, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Merges another sketch into this sketch. More...
 
void merge (cuda::std::span< cuda::std::byte const > sketch_span, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Merges a sketch from raw bytes into this sketch. More...
 
std::size_t estimate (rmm::cuda_stream_view stream=cudf::get_default_stream()) const
 Estimates the approximate number of distinct rows in the sketch. More...
 
cuda::std::span< cuda::std::byte > sketch () noexcept
 Gets the raw sketch bytes for serialization or external merging. More...
 
cuda::std::span< cuda::std::byte const > sketch () const noexcept
 Gets the raw sketch bytes for serialization or external merging (const overload) More...
 
null_policy null_handling () const noexcept
 Gets the null handling policy for this sketch. More...
 
nan_policy nan_handling () const noexcept
 Gets the NaN handling policy for this sketch. More...
 
std::int32_t precision () const noexcept
 Gets the precision parameter for this sketch. More...
 
double standard_error () const noexcept
 Gets the standard error (error tolerance) for this sketch. More...
 

Static Public Member Functions

static std::size_t sketch_bytes (std::int32_t precision)
 Gets the number of bytes required for sketch storage at a given precision. More...
 
static std::size_t sketch_alignment ()
 Gets the alignment required for sketch storage. More...
 

Detailed Description

Object-oriented HyperLogLog sketch for approximate distinct counting.

This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.

The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).

HyperLogLog Precision Parameter
The precision parameter (p) is the number of bits used to index into the register array. It determines the number of registers (m = 2^p) in the HLL sketch:
  • Memory usage: 2^p * 4 bytes (m registers of 4 bytes each for GPU atomics)
  • Standard error: 1.04 / sqrt(m) = 1.04 / sqrt(2^p)

Common precision values:

Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:

This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark's configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.

Example usage:

auto adc = cudf::approx_distinct_count(table1);
auto count1 = adc.estimate();
adc.add(table2);
auto count2 = adc.estimate();
Object-oriented HyperLogLog sketch for approximate distinct counting.

Definition at line 82 of file approx_distinct_count.hpp.

Constructor & Destructor Documentation

◆ approx_distinct_count() [1/3]

cudf::approx_distinct_count::approx_distinct_count ( table_view const &  input,
std::int32_t  precision = 12,
null_policy  null_handling = null_policy::EXCLUDE,
nan_policy  nan_handling = nan_policy::NAN_IS_NULL,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Constructs an approximate distinct count sketch from a table with specified precision.

Parameters
inputTable whose rows will be added to the sketch
precisionThe precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12.
null_handlingINCLUDE or EXCLUDE rows with nulls (default: EXCLUDE)
nan_handlingNAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL)
streamCUDA stream used for device memory operations and kernel launches

◆ approx_distinct_count() [2/3]

cudf::approx_distinct_count::approx_distinct_count ( table_view const &  input,
desired_standard_error  error,
null_policy  null_handling = null_policy::EXCLUDE,
nan_policy  nan_handling = nan_policy::NAN_IS_NULL,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Constructs an approximate distinct count sketch from a table with specified standard error.

This constructor allows specifying the desired standard error (error tolerance) directly, which is more intuitive than specifying the precision parameter. The precision is calculated as: ceil(2 * log2(1.04 / standard_error)).

Since precision must be an integer, the actual standard error may be better (smaller) than requested. Use the standard_error() getter to retrieve the actual value.

Parameters
inputTable whose rows will be added to the sketch
errorThe desired standard error (e.g., approx_distinct_count::desired_standard_error{0.01} for ~1%)
null_handlingINCLUDE or EXCLUDE rows with nulls (default: EXCLUDE)
nan_handlingNAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL)
streamCUDA stream used for device memory operations and kernel launches
Exceptions
std::invalid_argumentif standard_error value is not positive

◆ approx_distinct_count() [3/3]

cudf::approx_distinct_count::approx_distinct_count ( cuda::std::span< cuda::std::byte >  sketch_span,
std::int32_t  precision,
null_policy  null_handling = null_policy::EXCLUDE,
nan_policy  nan_handling = nan_policy::NAN_IS_NULL 
)

Constructs a non-owning sketch that operates on user-allocated storage.

This constructor creates a sketch that operates directly on the provided storage without copying. This enables zero-copy operations on pre-existing buffers, such as sketch data stored in a column or received from another process.

Warning
The caller must ensure the storage remains valid for the lifetime of this object. The sketch will read from and write to the provided storage directly.
Parameters
sketch_spanThe sketch bytes to operate on (must remain valid)
precisionThe precision parameter for the sketch (4-18)
null_handlingINCLUDE or EXCLUDE rows with nulls (default: EXCLUDE)
nan_handlingNAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL)

Member Function Documentation

◆ add()

void cudf::approx_distinct_count::add ( table_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Adds rows from a table to the sketch.

Parameters
inputTable whose rows will be added
streamCUDA stream used for device memory operations and kernel launches

◆ estimate()

std::size_t cudf::approx_distinct_count::estimate ( rmm::cuda_stream_view  stream = cudf::get_default_stream()) const

Estimates the approximate number of distinct rows in the sketch.

Parameters
streamCUDA stream used for device memory operations and kernel launches
Returns
Approximate number of distinct rows

◆ merge() [1/2]

void cudf::approx_distinct_count::merge ( approx_distinct_count const &  other,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Merges another sketch into this sketch.

After merging, this sketch will contain the combined distinct count estimate of both sketches.

Exceptions
std::invalid_argumentif the sketches have different precision values
std::invalid_argumentif the sketches have different null handling policies
std::invalid_argumentif the sketches have different NaN handling policies
Parameters
otherThe sketch to merge into this sketch
streamCUDA stream used for device memory operations and kernel launches

◆ merge() [2/2]

void cudf::approx_distinct_count::merge ( cuda::std::span< cuda::std::byte const >  sketch_span,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Merges a sketch from raw bytes into this sketch.

This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.

Warning
It is the caller's responsibility to ensure that the provided sketch span was created with the same approx_distinct_count configuration (precision, null/NaN handling, etc.) as this sketch. Merging incompatible sketches will produce incorrect results.
Parameters
sketch_spanThe sketch bytes to merge into this sketch
streamCUDA stream used for device memory operations and kernel launches

◆ nan_handling()

nan_policy cudf::approx_distinct_count::nan_handling ( ) const
noexcept

Gets the NaN handling policy for this sketch.

Returns
The NaN policy set at construction

◆ null_handling()

null_policy cudf::approx_distinct_count::null_handling ( ) const
noexcept

Gets the null handling policy for this sketch.

Returns
The null policy set at construction

◆ operator=()

approx_distinct_count& cudf::approx_distinct_count::operator= ( approx_distinct_count &&  )
default

Move assignment operator.

Returns
A reference to this object

◆ precision()

std::int32_t cudf::approx_distinct_count::precision ( ) const
noexcept

Gets the precision parameter for this sketch.

Returns
The precision value set at construction

◆ sketch() [1/2]

cuda::std::span<cuda::std::byte const> cudf::approx_distinct_count::sketch ( ) const
noexcept

Gets the raw sketch bytes for serialization or external merging (const overload)

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns
A span view of the sketch bytes

◆ sketch() [2/2]

cuda::std::span<cuda::std::byte> cudf::approx_distinct_count::sketch ( )
noexcept

Gets the raw sketch bytes for serialization or external merging.

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns
A span view of the sketch bytes

◆ sketch_alignment()

static std::size_t cudf::approx_distinct_count::sketch_alignment ( )
static

Gets the alignment required for sketch storage.

Returns
The required alignment in bytes

◆ sketch_bytes()

static std::size_t cudf::approx_distinct_count::sketch_bytes ( std::int32_t  precision)
static

Gets the number of bytes required for sketch storage at a given precision.

Parameters
precisionThe HLL precision parameter (4-18)
Returns
The number of bytes required for the sketch

◆ standard_error()

double cudf::approx_distinct_count::standard_error ( ) const
noexcept

Gets the standard error (error tolerance) for this sketch.

The standard error is calculated from precision as: 1.04 / sqrt(2^precision). This represents the expected relative error of the cardinality estimate.

Returns
The actual standard error based on the sketch's precision

The documentation for this class was generated from the following file: