Column Reduction#

group Reduction

Functions

cudf::size_type distinct_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Count the distinct elements in the column_view.

If nulls_equal == nulls_equal::UNEQUAL, all nulls are distinct.

Given an input column_view, number of distinct elements in this column_view is returned.

If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_NULL, both NaN and null values are ignored. If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_VALID, only null is ignored, NaN is considered in distinct count.

nulls are handled as equal.

Parameters:
  • input[in] The column_view whose distinct elements will be counted

  • null_handling[in] flag to include or ignore null while counting

  • nan_handling[in] flag to consider NaN==null or not

  • stream[in] CUDA stream used for device memory operations and kernel launches

Returns:

number of distinct rows in the table

cudf::size_type distinct_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Count the distinct rows in a table.

Parameters:
  • input[in] Table whose distinct rows will be counted

  • nulls_equal[in] flag to denote if null elements should be considered equal. nulls are not equal if null_equality::UNEQUAL.

  • stream[in] CUDA stream used for device memory operations and kernel launches

Returns:

number of distinct rows in the table

cudf::size_type unique_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Count the number of consecutive groups of equivalent rows in a column.

If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_NULL, both NaN and null values are ignored. If null_handling is null_policy::EXCLUDE and nan_handling is nan_policy::NAN_IS_VALID, only null is ignored, NaN is considered in count.

nulls are handled as equal.

Parameters:
  • input[in] The column_view whose consecutive groups of equivalent rows will be counted

  • null_handling[in] flag to include or ignore null while counting

  • nan_handling[in] flag to consider NaN==null or not

  • stream[in] CUDA stream used for device memory operations and kernel launches

Returns:

number of consecutive groups of equivalent rows in the column

cudf::size_type unique_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Count the number of consecutive groups of equivalent rows in a table.

Parameters:
  • input[in] Table whose consecutive groups of equivalent rows will be counted

  • nulls_equal[in] flag to denote if null elements should be considered equal nulls are not equal if null_equality::UNEQUAL.

  • stream[in] CUDA stream used for device memory operations and kernel launches

Returns:

number of consecutive groups of equivalent rows in the column

class approx_distinct_count#
#include <approx_distinct_count.hpp>

Object-oriented HyperLogLog sketch for approximate distinct counting.

This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.

The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).

Common precision values:

  • p = 10: m = 1,024 registers, ~3.2% standard error, 4KB memory

  • p = 12 (default): m = 4,096 registers, ~1.6% standard error, 16KB memory

  • p = 14: m = 16,384 registers, ~0.8% standard error, 64KB memory

  • p = 16: m = 65,536 registers, ~0.4% standard error, 256KB memory

HyperLogLog Precision Parameter

The precision parameter (p) is the number of bits used to index into the register array. It determines the number of registers (m = 2^p) in the HLL sketch:

  • Memory usage: 2^p * 4 bytes (m registers of 4 bytes each for GPU atomics)

  • Standard error: 1.04 / sqrt(m) = 1.04 / sqrt(2^p)

Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:

  • Below 4: Too few registers for HLL’s statistical assumptions, resulting in high variance and unstable estimates.

  • Above 18: Rapidly diminishing accuracy gains while incurring significant memory growth, making the structure no longer space-efficient for approximate counting.

This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark’s configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.

Example usage:

auto adc = cudf::approx_distinct_count(table1);
auto count1 = adc.estimate();

adc.add(table2);
auto count2 = adc.estimate();

Public Types

using impl_type = cudf::detail::approx_distinct_count<cudf::hashing::detail::XXHash_64>#

Implementation type.

Public Functions

approx_distinct_count(table_view const &input, std::int32_t precision = 12, null_policy null_handling = null_policy::EXCLUDE, nan_policy nan_handling = nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Constructs an approximate distinct count sketch from a table.

Parameters:
  • input – Table whose rows will be added to the sketch

  • precision – The precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12.

  • null_handlingINCLUDE or EXCLUDE rows with nulls (default: EXCLUDE)

  • nan_handlingNAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL)

  • stream – CUDA stream used for device memory operations and kernel launches

approx_distinct_count(cuda::std::span<cuda::std::byte> sketch_span, std::int32_t precision, null_policy null_handling = null_policy::EXCLUDE, nan_policy nan_handling = nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Constructs an approximate distinct count sketch from serialized sketch bytes.

This constructor enables distributed distinct counting by allowing sketches to be constructed from serialized data. The sketch data is copied into the newly created object, which then owns its own independent storage.

Warning

The precision parameter must match the precision used to create the original sketch. The size of the sketch span must be exactly 2^precision bytes. The null and NaN handling policies must match those used when creating the original sketch. Providing incompatible parameters will produce incorrect results or errors.

Parameters:
  • sketch_span – The serialized sketch bytes to reconstruct from

  • precision – The precision parameter that was used to create the sketch (4-18)

  • null_handlingINCLUDE or EXCLUDE rows with nulls (default: EXCLUDE)

  • nan_handlingNAN_IS_VALID or NAN_IS_NULL (default: NAN_IS_NULL)

  • stream – CUDA stream used for device memory operations and kernel launches

approx_distinct_count(approx_distinct_count&&) = default#

Default move constructor.

approx_distinct_count &operator=(approx_distinct_count&&) = default#

Move assignment operator.

Returns:

A reference to this object

void add(table_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Adds rows from a table to the sketch.

Parameters:
  • input – Table whose rows will be added

  • stream – CUDA stream used for device memory operations and kernel launches

void merge(approx_distinct_count const &other, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Merges another sketch into this sketch.

After merging, this sketch will contain the combined distinct count estimate of both sketches.

Throws:
  • std::invalid_argument – if the sketches have different precision values

  • std::invalid_argument – if the sketches have different null handling policies

  • std::invalid_argument – if the sketches have different NaN handling policies

Parameters:
  • other – The sketch to merge into this sketch

  • stream – CUDA stream used for device memory operations and kernel launches

void merge(cuda::std::span<cuda::std::byte> sketch_span, rmm::cuda_stream_view stream = cudf::get_default_stream())#

Merges a sketch from raw bytes into this sketch.

This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.

Warning

It is the caller’s responsibility to ensure that the provided sketch span was created with the same approx_distinct_count configuration (precision, null/NaN handling, etc.) as this sketch. Merging incompatible sketches will produce incorrect results.

Parameters:
  • sketch_span – The sketch bytes to merge into this sketch

  • stream – CUDA stream used for device memory operations and kernel launches

std::size_t estimate(rmm::cuda_stream_view stream = cudf::get_default_stream()) const#

Estimates the approximate number of distinct rows in the sketch.

Parameters:

stream – CUDA stream used for device memory operations and kernel launches

Returns:

Approximate number of distinct rows

cuda::std::span<cuda::std::byte> sketch() noexcept#

Gets the raw sketch bytes for serialization or external merging.

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns:

A span view of the sketch bytes

cuda::std::span<cuda::std::byte const> sketch() const noexcept#

Gets the raw sketch bytes for serialization or external merging (const overload)

The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.

Returns:

A span view of the sketch bytes

null_policy null_handling() const noexcept#

Gets the null handling policy for this sketch.

Returns:

The null policy set at construction

nan_policy nan_handling() const noexcept#

Gets the NaN handling policy for this sketch.

Returns:

The NaN policy set at construction

std::int32_t precision() const noexcept#

Gets the precision parameter for this sketch.

Returns:

The precision value set at construction