Column Quantiles#
- group column_quantiles
Functions
-
std::unique_ptr<column> quantile(column_view const &input, std::vector<double> const &q, interpolation interp = interpolation::LINEAR, column_view const &ordered_indices = {}, bool exact = true, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Computes quantiles with interpolation.
Computes the specified quantiles by interpolating values between which they lie, using the interpolation strategy specified in
interp
.- Parameters:
input – [in] Column from which to compute quantile values
q – [in] Specified quantiles in range [0, 1]
interp – [in] Strategy used to select between values adjacent to a specified quantile.
ordered_indices – [in] Column containing the sorted order of
input
. If the column is empty, allinput
values are used in existing order. Indices must be in range [0,input.size()
), but are not required to be unique. Values not indexed by this column will be ignored.exact – [in] If true, returns doubles. If false, returns same type as input.
stream – [in] CUDA stream used for device memory operations and kernel launches
mr – [in] Device memory resource used to allocate the returned column’s device memory
- Returns:
Column of specified quantiles, with nulls for indeterminable values
-
std::unique_ptr<table> quantiles(table_view const &input, std::vector<double> const &q, interpolation interp = interpolation::NEAREST, cudf::sorted is_input_sorted = sorted::NO, std::vector<order> const &column_order = {}, std::vector<null_order> const &null_precedence = {}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the rows of the input corresponding to the requested quantiles.
Quantiles are cut points that divide the range of a dataset into continuous intervals. e.g: quartiles are the three cut points that divide a dataset into four equal-sized groups. See https://en.wikipedia.org/wiki/Quantile
The indices used to gather rows are computed by interpolating between the index on either side of the desired quantile. Since some columns may be non-arithmetic, interpolation between rows is limited to non-arithmetic strategies.
Non-arithmetic interpolation strategies include HIGHER, LOWER, and NEAREST.
quantiles
<= 0
correspond to row0
. (first) quantiles>= 1
correspond to rowinput.size() - 1
. (last)- Parameters:
input – Table used to compute quantile rows
q – Desired quantiles in range [0, 1]
interp – Strategy used to select between the two rows on either side of the desired quantile.
is_input_sorted – Indicates if the input has been pre-sorted
column_order – The desired sort order for each column
null_precedence – The desired order of null compared to other elements
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned table’s device memory
- Throws:
cudf::logic_error – if
interp
is an arithmetic interpolation strategycudf::logic_error – if
input
is empty
- Returns:
Table of specified quantiles, with nulls for indeterminable values
-
std::unique_ptr<column> percentile_approx(tdigest::tdigest_column_view const &input, column_view const &percentiles, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Calculate approximate percentiles on an input tdigest column.
tdigest (https://arxiv.org/pdf/1902.04023.pdf) columns are produced specifically by the TDIGEST and MERGE_TDIGEST aggregations. These columns represent compressed representations of a very large input data set that can be queried for quantile information.
Produces a LIST column where each row
i
represents output from querying the corresponding tdigest frominput
rowi
. The length of each output list is the number of percentages specified inpercentages
.- Parameters:
input – tdigest input data. One tdigest per row
percentiles – Desired percentiles in range [0, 1]
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Throws:
cudf::logic_error – if
input
is not a valid tdigest column.cudf::logic_error – if
percentiles
is not a FLOAT64 column.
- Returns:
LIST Column containing requested percentile values as FLOAT64
-
std::unique_ptr<column> quantile(column_view const &input, std::vector<double> const &q, interpolation interp = interpolation::LINEAR, column_view const &ordered_indices = {}, bool exact = true, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#