Column Reduction#
- group Reduction
Functions
-
cudf::size_type distinct_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the distinct elements in the column_view.
If
nulls_equal == nulls_equal::UNEQUAL, allnulls are distinct.Given an input column_view, number of distinct elements in this column_view is returned.
If
null_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_NULL, bothNaNandnullvalues are ignored. Ifnull_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_VALID, onlynullis ignored,NaNis considered in distinct count.nulls are handled as equal.- Parameters:
input – [in] The column_view whose distinct elements will be counted
null_handling – [in] flag to include or ignore
nullwhile countingnan_handling – [in] flag to consider
NaN==nullor notstream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of distinct rows in the table
-
cudf::size_type distinct_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the distinct rows in a table.
- Parameters:
input – [in] Table whose distinct rows will be counted
nulls_equal – [in] flag to denote if null elements should be considered equal. nulls are not equal if null_equality::UNEQUAL.
stream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of distinct rows in the table
-
cudf::size_type unique_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the number of consecutive groups of equivalent rows in a column.
If
null_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_NULL, bothNaNandnullvalues are ignored. Ifnull_handlingis null_policy::EXCLUDE andnan_handlingis nan_policy::NAN_IS_VALID, onlynullis ignored,NaNis considered in count.nulls are handled as equal.- Parameters:
input – [in] The column_view whose consecutive groups of equivalent rows will be counted
null_handling – [in] flag to include or ignore
nullwhile countingnan_handling – [in] flag to consider
NaN==nullor notstream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of consecutive groups of equivalent rows in the column
-
cudf::size_type unique_count(table_view const &input, null_equality nulls_equal = null_equality::EQUAL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Count the number of consecutive groups of equivalent rows in a table.
- Parameters:
input – [in] Table whose consecutive groups of equivalent rows will be counted
nulls_equal – [in] flag to denote if null elements should be considered equal nulls are not equal if null_equality::UNEQUAL.
stream – [in] CUDA stream used for device memory operations and kernel launches
- Returns:
number of consecutive groups of equivalent rows in the column
-
class approx_distinct_count#
- #include <approx_distinct_count.hpp>
Object-oriented HyperLogLog sketch for approximate distinct counting.
This class provides an object-oriented interface to HyperLogLog sketches, allowing incremental addition of data and cardinality estimation.
The implementation uses XXHash64 to hash table rows into 64-bit values, which are then added to the HyperLogLog sketch without additional hashing (identity function).
Common precision values:
p = 10: m = 1,024 registers, ~3.2% standard error, 4KB memory
p = 12 (default): m = 4,096 registers, ~1.6% standard error, 16KB memory
p = 14: m = 16,384 registers, ~0.8% standard error, 64KB memory
p = 16: m = 65,536 registers, ~0.4% standard error, 256KB memory
- HyperLogLog Precision Parameter
The precision parameter (p) is the number of bits used to index into the register array. It determines the number of registers (m = 2^p) in the HLL sketch:
Memory usage: 2^p * 4 bytes (m registers of 4 bytes each for GPU atomics)
Standard error: 1.04 / sqrt(m) = 1.04 / sqrt(2^p)
Valid range: p ∈ [4, 18]. This is not a hard theoretical limit but an empirically recommended range:
Below 4: Too few registers for HLL’s statistical assumptions, resulting in high variance and unstable estimates.
Above 18: Rapidly diminishing accuracy gains while incurring significant memory growth, making the structure no longer space-efficient for approximate counting.
This range represents a practical engineering compromise from HLL++ and is widely adopted by systems such as Apache Spark. The default of 12 aligns with Spark’s configuration and is the largest precision that fits efficiently in GPU shared memory, enabling optimal performance for our implementation.
Example usage:
auto adc = cudf::approx_distinct_count(table1); auto count1 = adc.estimate(); adc.add(table2); auto count2 = adc.estimate();
Public Types
Public Functions
-
approx_distinct_count(table_view const &input, std::int32_t precision = 12, null_policy null_handling = null_policy::EXCLUDE, nan_policy nan_handling = nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Constructs an approximate distinct count sketch from a table with specified precision.
- Parameters:
input – Table whose rows will be added to the sketch
precision – The precision parameter for HyperLogLog (4-18). Higher precision gives better accuracy but uses more memory. Default is 12.
null_handling –
INCLUDEorEXCLUDErows with nulls (default:EXCLUDE)nan_handling –
NAN_IS_VALIDorNAN_IS_NULL(default:NAN_IS_NULL)stream – CUDA stream used for device memory operations and kernel launches
-
approx_distinct_count(table_view const &input, desired_standard_error error, null_policy null_handling = null_policy::EXCLUDE, nan_policy nan_handling = nan_policy::NAN_IS_NULL, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Constructs an approximate distinct count sketch from a table with specified standard error.
This constructor allows specifying the desired standard error (error tolerance) directly, which is more intuitive than specifying the precision parameter. The precision is calculated as:
ceil(2 * log2(1.04 / standard_error)).Since precision must be an integer, the actual standard error may be better (smaller) than requested. Use the
standard_error()getter to retrieve the actual value.- Parameters:
input – Table whose rows will be added to the sketch
error – The desired standard error (e.g.,
approx_distinct_count::desired_standard_error{0.01}for ~1%)null_handling –
INCLUDEorEXCLUDErows with nulls (default:EXCLUDE)nan_handling –
NAN_IS_VALIDorNAN_IS_NULL(default:NAN_IS_NULL)stream – CUDA stream used for device memory operations and kernel launches
- Throws:
std::invalid_argument – if standard_error value is not positive
-
approx_distinct_count(cuda::std::span<cuda::std::byte> sketch_span, std::int32_t precision, null_policy null_handling = null_policy::EXCLUDE, nan_policy nan_handling = nan_policy::NAN_IS_NULL)#
Constructs a non-owning sketch that operates on user-allocated storage.
This constructor creates a sketch that operates directly on the provided storage without copying. This enables zero-copy operations on pre-existing buffers, such as sketch data stored in a column or received from another process.
Warning
The caller must ensure the storage remains valid for the lifetime of this object. The sketch will read from and write to the provided storage directly.
- Parameters:
sketch_span – The sketch bytes to operate on (must remain valid)
precision – The precision parameter for the sketch (4-18)
null_handling –
INCLUDEorEXCLUDErows with nulls (default:EXCLUDE)nan_handling –
NAN_IS_VALIDorNAN_IS_NULL(default:NAN_IS_NULL)
-
approx_distinct_count(approx_distinct_count&&) = default#
Default move constructor.
-
approx_distinct_count &operator=(approx_distinct_count&&) = default#
Move assignment operator.
- Returns:
A reference to this object
-
void add(table_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Adds rows from a table to the sketch.
- Parameters:
input – Table whose rows will be added
stream – CUDA stream used for device memory operations and kernel launches
-
void merge(approx_distinct_count const &other, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Merges another sketch into this sketch.
After merging, this sketch will contain the combined distinct count estimate of both sketches.
- Throws:
std::invalid_argument – if the sketches have different precision values
std::invalid_argument – if the sketches have different null handling policies
std::invalid_argument – if the sketches have different NaN handling policies
- Parameters:
other – The sketch to merge into this sketch
stream – CUDA stream used for device memory operations and kernel launches
-
void merge(cuda::std::span<cuda::std::byte const> sketch_span, rmm::cuda_stream_view stream = cudf::get_default_stream())#
Merges a sketch from raw bytes into this sketch.
This allows merging sketches that have been serialized or created elsewhere, enabling distributed distinct counting scenarios.
Warning
It is the caller’s responsibility to ensure that the provided sketch span was created with the same approx_distinct_count configuration (precision, null/NaN handling, etc.) as this sketch. Merging incompatible sketches will produce incorrect results.
- Parameters:
sketch_span – The sketch bytes to merge into this sketch
stream – CUDA stream used for device memory operations and kernel launches
-
std::size_t estimate(rmm::cuda_stream_view stream = cudf::get_default_stream()) const#
Estimates the approximate number of distinct rows in the sketch.
- Parameters:
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
Approximate number of distinct rows
-
cuda::std::span<cuda::std::byte> sketch() noexcept#
Gets the raw sketch bytes for serialization or external merging.
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.
- Returns:
A span view of the sketch bytes
-
cuda::std::span<cuda::std::byte const> sketch() const noexcept#
Gets the raw sketch bytes for serialization or external merging (const overload)
The returned span provides access to the internal sketch storage. This can be used to serialize the sketch, transfer it between processes, or merge it with other sketches using the span-based merge API.
- Returns:
A span view of the sketch bytes
-
null_policy null_handling() const noexcept#
Gets the null handling policy for this sketch.
- Returns:
The null policy set at construction
-
nan_policy nan_handling() const noexcept#
Gets the NaN handling policy for this sketch.
- Returns:
The NaN policy set at construction
-
std::int32_t precision() const noexcept#
Gets the precision parameter for this sketch.
- Returns:
The precision value set at construction
-
double standard_error() const noexcept#
Gets the standard error (error tolerance) for this sketch.
The standard error is calculated from precision as:
1.04 / sqrt(2^precision). This represents the expected relative error of the cardinality estimate.- Returns:
The actual standard error based on the sketch’s precision
Public Static Functions
-
static std::size_t sketch_bytes(std::int32_t precision)#
Gets the number of bytes required for sketch storage at a given precision.
- Parameters:
precision – The HLL precision parameter (4-18)
- Returns:
The number of bytes required for the sketch
-
static std::size_t sketch_alignment()#
Gets the alignment required for sketch storage.
- Returns:
The required alignment in bytes
-
struct desired_standard_error#
- #include <approx_distinct_count.hpp>
Strong type wrapper for the desired standard error constructor parameter.
Use this type to construct an
approx_distinct_countwith a desired error tolerance instead of specifying precision directly.Example:
auto sketch = cudf::approx_distinct_count( table, cudf::approx_distinct_count::desired_standard_error{0.01});
Public Functions
-
inline explicit constexpr desired_standard_error(double v)#
Constructs a desired_standard_error with the given value.
- Parameters:
v – The requested standard error value (must be positive, e.g., 0.01 for ~1% error)
Public Members
-
double value#
The requested standard error value (must be positive)
-
inline explicit constexpr desired_standard_error(double v)#
-
cudf::size_type distinct_count(column_view const &input, null_policy null_handling, nan_policy nan_handling, rmm::cuda_stream_view stream = cudf::get_default_stream())#