Column Reduction#
- group Reduction
Functions
-
cudf::size_type distinct_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the distinct elements in the column_view.
If
nulls_equal == nulls_equal::UNEQUAL, allnulls are distinct.Given an input column_view, number of distinct elements in this column_view is returned.
If
null_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_NULL, bothNaNandnullvalues are ignored. Ifnull_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_VALID, onlynullis ignored,NaNis considered in distinct count.nulls are handled as equal.- Parameters:
input – [in] The column_view whose distinct elements will be counted
null_handling – [in] flag to include or ignore
nullwhile countingnan_handling – [in] flag to consider
NaN==nullor notstream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of distinct rows in the table
-
cudf::size_type distinct_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the distinct rows in a table.
- Parameters:
input – [in] Table whose distinct rows will be counted
nulls_equal – [in] flag to denote if null elements should be considered equal. nulls are not equal if null_equality::UNEQUAL.
stream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of distinct rows in the table
-
cudf::size_type unique_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the number of consecutive groups of equivalent rows in a column.
If
null_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_NULL, bothNaNandnullvalues are ignored. Ifnull_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_VALID, onlynullis ignored,NaNis considered in count.nulls are handled as equal.- Parameters:
input – [in] The column_view whose consecutive groups of equivalent rows will be counted
null_handling – [in] flag to include or ignore
nullwhile countingnan_handling – [in] flag to consider
NaN==nullor notstream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of consecutive groups of equivalent rows in the column
-
cudf::size_type unique_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the number of consecutive groups of equivalent rows in a table.
- Parameters:
input – [in] Table whose consecutive groups of equivalent rows will be counted
nulls_equal – [in] flag to denote if null elements should be considered equal nulls are not equal if null_equality::UNEQUAL.
stream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of consecutive groups of equivalent rows in the column
-
class approx_distinct_count#
- #include <approx_distinct_count.hpp>
Object-oriented HyperLogLog sketch for approximate distinct counting.
This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.
The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).
Common precision values:
p = 10: m = 1,024 registers, ~3.2% standard error, 4KB memory
p = 12 (default): m = 4,096 registers, ~1.6% standard error, 16KB memory
p = 14: m = 16,384 registers, ~0.8% standard error, 64KB memory
p = 16: m = 65,536 registers, ~0.4% standard error, 256KB memory
- HyperLogLog Precision Parameter
The precision parameter (p) is the number of bits used to index into the register array. It determines the number of registers (m = 2^p) in the HLL sketch:
Memory usage: 2^p * 4 bytes (m registers of 4 bytes each for GPU atomics)
Standard error: 1.04 / sqrt(m) = 1.04 / sqrt(2^p)
Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:
Below 4: Too few registers for HLL’s statistical assumptions, resulting in high variance and unstable estimates.
Above 18: Rapidly diminishing accuracy gains while incurring significant memory growth, making the structure no longer space-efficient for approximate counting.
This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark’s configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.
Example usage:
auto adc = cudf::approx_distinct_count(table1); auto count1 = adc.estimate(); adc.add(table2); auto count2 = adc.estimate();
Public Types
Public Functions
-
approx_distinct_count(table_view const &input, std::int32_t precision = 12, null_policy null_handling = null_policy::EXCLUDE, nan_policy nan_handling = nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Constructs an approximate distinct count sketch from a table.
- Parameters:
input – Table whose rows will be added to the sketch
precision – The precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12.
null_handling –
INCLUDEorEXCLUDErows with nulls (default:EXCLUDE)nan_handling –
NAN_IS_VALIDorNAN_IS_NULL(default:NAN_IS_NULL)stream – CUDA stream used for device memory operations and kernel launches
-
approx_distinct_count(cuda::std::span<cuda::std::byte> sketch_span, std::int32_t precision, null_policy null_handling = null_policy::EXCLUDE, nan_policy nan_handling = nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Constructs an approximate distinct count sketch from serialized sketch bytes.
This constructor enables distributed distinct counting by allowing sketches to be constructed from serialized data. The sketch data is copied into the newly created object, which then owns its own independent storage.
Warning
The precision parameter must match the precision used to create the original sketch. The size of the sketch span must be exactly 2^precision bytes. The null and NaN handling policies must match those used when creating the original sketch. Providing incompatible parameters will produce incorrect results or errors.
- Parameters:
sketch_span – The serialized sketch bytes to reconstruct from
precision – The precision parameter that was used to create the sketch (4-18)
null_handling –
INCLUDEorEXCLUDErows with nulls (default:EXCLUDE)nan_handling –
NAN_IS_VALIDorNAN_IS_NULL(default:NAN_IS_NULL)stream – CUDA stream used for device memory operations and kernel launches
-
approx_distinct_count(approx_distinct_count&&) = default#
Default move constructor.
-
approx_distinct_count &operator=(approx_distinct_count&&) = default#
Move assignment operator.
- Returns:
A reference to this object
-
void add(table_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Adds rows from a table to the sketch.
- Parameters:
input – Table whose rows will be added
stream – CUDA stream used for device memory operations and kernel launches
-
void merge(approx_distinct_count const &other, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Merges another sketch into this sketch.
After merging, this sketch will contain the combined distinct count estimate of both sketches.
- Throws:
std::invalid_argument – if the sketches have different precision values
std::invalid_argument – if the sketches have different null handling policies
std::invalid_argument – if the sketches have different NaN handling policies
- Parameters:
other – The sketch to merge into this sketch
stream – CUDA stream used for device memory operations and kernel launches
-
void merge(cuda::std::span<cuda::std::byte> sketch_span, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Merges a sketch from raw bytes into this sketch.
This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.
Warning
It is the caller’s responsibility to ensure that the provided sketch span was created with the same approx_distinct_count configuration (precision, null/NaN handling, etc.) as this sketch. Merging incompatible sketches will produce incorrect results.
- Parameters:
sketch_span – The sketch bytes to merge into this sketch
stream – CUDA stream used for device memory operations and kernel launches
-
std::size_t estimate(rmm::cuda_stream_view stream = cudf::get_default_stream()) const#
Estimates the approximate number of distinct rows in the sketch.
- Parameters:
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
Approximate number of distinct rows
-
cuda::std::span<cuda::std::byte> sketch() noexcept#
Gets the raw sketch bytes for serialization or external merging.
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.
- Returns:
A span view of the sketch bytes
-
cuda::std::span<cuda::std::byte const> sketch() const noexcept#
Gets the raw sketch bytes for serialization or external merging (const overload)
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.
- Returns:
A span view of the sketch bytes
-
null_policy null_handling() const noexcept#
Gets the null handling policy for this sketch.
- Returns:
The null policy set at construction
-
nan_policy nan_handling() const noexcept#
Gets the NaN handling policy for this sketch.
- Returns:
The NaN policy set at construction
-
std::int32_t precision() const noexcept#
Gets the precision parameter for this sketch.
- Returns:
The precision value set at construction
-
cudf::size_type distinct_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#