Public Types | Public Member Functions | List of all members
cudf::approx_distinct_count Class Reference

Object-oriented HyperLogLog sketch for approximate distinct counting. More...

#include <approx_distinct_count.hpp>

Public Types

using impl_type = cudf::detail::approx_distinct_count< cudf::hashing::detail::XXHash_64 >
 Implementation type.
 

Public Member Functions

 approx_distinct_count (table_view const &input, std::int32_t precision=12, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Constructs an approximate distinct count sketch from a table. More...
 
 approx_distinct_count (cuda::std::span< cuda::std::byte > sketch_span, std::int32_t precision, null_policy null_handling=null_policy::EXCLUDE, nan_policy nan_handling=nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Constructs an approximate distinct count sketch from serialized sketch bytes. More...
 
 approx_distinct_count (approx_distinct_count const &)=delete
 
approx_distinct_countoperator= (approx_distinct_count const &)=delete
 
 approx_distinct_count (approx_distinct_count &&)=default
 Default move constructor.
 
approx_distinct_countoperator= (approx_distinct_count &&)=default
 Move assignment operator. More...
 
void add (table_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Adds rows from a table to the sketch. More...
 
void merge (approx_distinct_count const &other, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Merges another sketch into this sketch. More...
 
void merge (cuda::std::span< cuda::std::byte > sketch_span, rmm::cuda_stream_view stream=cudf::get_default_stream())
 Merges a sketch from raw bytes into this sketch. More...
 
std::size_t estimate (rmm::cuda_stream_view stream=cudf::get_default_stream()) const
 Estimates the approximate number of distinct rows in the sketch. More...
 
cuda::std::span< cuda::std::byte > sketch () noexcept
 Gets the raw sketch bytes for serialization or external merging. More...
 
cuda::std::span< cuda::std::byte const > sketch () const noexcept
 Gets the raw sketch bytes for serialization or external merging (const overload) More...
 
null_policy null_handling () const noexcept
 Gets the null handling policy for this sketch. More...
 
nan_policy nan_handling () const noexcept
 Gets the NaN handling policy for this sketch. More...
 
std::int32_t precision () const noexcept
 Gets the precision parameter for this sketch. More...
 

Detailed Description

Object-oriented HyperLogLog sketch for approximate distinct counting.

This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.

The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).

HyperLogLog Precision Parameter
The precision parameter (p) is the number of bits used to index into the register array. It determines the number of registers (m = 2^p) in the HLL sketch:
  • Memory usage: 2^p * 4 bytes (m registers of 4 bytes each for GPU atomics)
  • Standard error: 1.04 / sqrt(m) = 1.04 / sqrt(2^p)

Common precision values:

Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:

This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark's configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.

Example usage:

auto adc = cudf::approx_distinct_count(table1);
auto count1 = adc.estimate();
adc.add(table2);
auto count2 = adc.estimate();
Object-oriented HyperLogLog sketch for approximate distinct counting.

Definition at line 76 of file approx_distinct_count.hpp.

Constructor & Destructor Documentation

◆ approx_distinct_count() [1/2]

cudf::approx_distinct_count::approx_distinct_count ( table_view const &  input,
std::int32_t  precision = 12,
null_policy  null_handling = null_policy::EXCLUDE,
nan_policy  nan_handling = nan_policy::NAN_IS_NULL,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Constructs an approximate distinct count sketch from a table.

Parameters
inputTable whose rows will be added to the sketch
precisionThe precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12.
null_handlingINCLUDE or EXCLUDE rows with nulls (default: EXCLUDE)
nan_handlingNAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL)
streamCUDA stream used for device memory operations and kernel launches

◆ approx_distinct_count() [2/2]

cudf::approx_distinct_count::approx_distinct_count ( cuda::std::span< cuda::std::byte >  sketch_span,
std::int32_t  precision,
null_policy  null_handling = null_policy::EXCLUDE,
nan_policy  nan_handling = nan_policy::NAN_IS_NULL,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Constructs an approximate distinct count sketch from serialized sketch bytes.

This constructor enables distributed distinct counting by allowing sketches to be constructed from serialized data. The sketch data is copied into the newly created object, which then owns its own independent storage.

Warning
The precision parameter must match the precision used to create the original sketch. The size of the sketch span must be exactly 2^precision bytes. The null and NaN handling policies must match those used when creating the original sketch. Providing incompatible parameters will produce incorrect results or errors.
Parameters
sketch_spanThe serialized sketch bytes to reconstruct from
precisionThe precision parameter that was used to create the sketch (4-18)
null_handlingINCLUDE or EXCLUDE rows with nulls (default: EXCLUDE)
nan_handlingNAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL)
streamCUDA stream used for device memory operations and kernel launches

Member Function Documentation

◆ add()

void cudf::approx_distinct_count::add ( table_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Adds rows from a table to the sketch.

Parameters
inputTable whose rows will be added
streamCUDA stream used for device memory operations and kernel launches

◆ estimate()

std::size_t cudf::approx_distinct_count::estimate ( rmm::cuda_stream_view  stream = cudf::get_default_stream()) const

Estimates the approximate number of distinct rows in the sketch.

Parameters
streamCUDA stream used for device memory operations and kernel launches
Returns
Approximate number of distinct rows

◆ merge() [1/2]

void cudf::approx_distinct_count::merge ( approx_distinct_count const &  other,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Merges another sketch into this sketch.

After merging, this sketch will contain the combined distinct count estimate of both sketches.

Exceptions
std::invalid_argumentif the sketches have different precision values
std::invalid_argumentif the sketches have different null handling policies
std::invalid_argumentif the sketches have different NaN handling policies
Parameters
otherThe sketch to merge into this sketch
streamCUDA stream used for device memory operations and kernel launches

◆ merge() [2/2]

void cudf::approx_distinct_count::merge ( cuda::std::span< cuda::std::byte >  sketch_span,
rmm::cuda_stream_view  stream = cudf::get_default_stream() 
)

Merges a sketch from raw bytes into this sketch.

This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.

Warning
It is the caller's responsibility to ensure that the provided sketch span was created with the same approx_distinct_count configuration (precision, null/NaN handling, etc.) as this sketch. Merging incompatible sketches will produce incorrect results.
Parameters
sketch_spanThe sketch bytes to merge into this sketch
streamCUDA stream used for device memory operations and kernel launches

◆ nan_handling()

nan_policy cudf::approx_distinct_count::nan_handling ( ) const
noexcept

Gets the NaN handling policy for this sketch.

Returns
The NaN policy set at construction

◆ null_handling()

null_policy cudf::approx_distinct_count::null_handling ( ) const
noexcept

Gets the null handling policy for this sketch.

Returns
The null policy set at construction

◆ operator=()

approx_distinct_count& cudf::approx_distinct_count::operator= ( approx_distinct_count &&  )
default

Move assignment operator.

Returns
A reference to this object

◆ precision()

std::int32_t cudf::approx_distinct_count::precision ( ) const
noexcept

Gets the precision parameter for this sketch.

Returns
The precision value set at construction

◆ sketch() [1/2]

cuda::std::span<cuda::std::byte const> cudf::approx_distinct_count::sketch ( ) const
noexcept

Gets the raw sketch bytes for serialization or external merging (const overload)

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns
A span view of the sketch bytes

◆ sketch() [2/2]

cuda::std::span<cuda::std::byte> cudf::approx_distinct_count::sketch ( )
noexcept

Gets the raw sketch bytes for serialization or external merging.

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns
A span view of the sketch bytes

The documentation for this class was generated from the following file: