Dictionary Encode#
- group Encoding
Functions
-
std::unique_ptr<column> encode(column_view const &column, data_type indices_type = data_type{type_id::INT32}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Construct a dictionary column by dictionary encoding an existing column.
The output column is a DICTIONARY type with a keys column of non-null, unique values that are in a strict, total order. Meaning,
keys[i]is _ordered beforekeys[i+1]for alli in [0,n-1)wherenis the number of keys.The output column has a child indices column that is of integer type and with the same size as the input column. The indices column will be of type
indices_type. The result is undefined if theindices_typeis not large enough for the indices values.The null mask and null count are copied from the input column to the output column.
c = [429, 111, 213, 111, 213, 429, 213] d = encode(c) d now has keys [111, 213, 429] and indices [2, 0, 1, 0, 1, 2, 1]
- Throws:
std::invalid_argument – if indices type is not a signed integer type
std::invalid_argument – if the column to encode is already a DICTIONARY type
- Parameters:
column – The column to dictionary encode
indices_type – The integer type to use for the indices
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
Returns a dictionary column
-
std::unique_ptr<column> decode(dictionary_column_view const &dictionary_column, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a column by gathering the keys from the provided dictionary_column into a new column using the indices from that column.
d1 = {["a", "c", "d"], [2, 0, 1, 0]} s = decode(d1) s is now ["d", "a", "c", "a"]- Parameters:
dictionary_column – Existing dictionary column
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column with type matching the dictionary_column’s keys
-
std::unique_ptr<column> encode(column_view const &column, data_type indices_type = data_type{type_id::INT32}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#