Dictionary Encode#

group Encoding

Functions

std::unique_ptr<column> encode(column_view const &column, data_type indices_type = data_type{type_id::INT32}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Construct a dictionary column by dictionary encoding an existing column.

The output column is a DICTIONARY type with a keys column of non-null, unique values that are in a strict, total order. Meaning, keys[i] is _ordered before keys[i+1] for all i in [0,n-1) where n is the number of keys.

The output column has a child indices column that is of integer type and with the same size as the input column. The indices column will be of type indices_type. The result is undefined if the indices_type is not large enough for the indices values.

The null mask and null count are copied from the input column to the output column.

c = [429, 111, 213, 111, 213, 429, 213]
d = encode(c)
d now has keys [111, 213, 429] and indices [2, 0, 1, 0, 1, 2, 1]

Throws:
  • std::invalid_argument – if indices type is not a signed integer type

  • std::invalid_argument – if the column to encode is already a DICTIONARY type

Parameters:
  • column – The column to dictionary encode

  • indices_type – The integer type to use for the indices

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

Returns a dictionary column

std::unique_ptr<column> decode(dictionary_column_view const &dictionary_column, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create a column by gathering the keys from the provided dictionary_column into a new column using the indices from that column.

d1 = {["a", "c", "d"], [2, 0, 1, 0]}
s = decode(d1)
s is now ["d", "a", "c", "a"]
Parameters:
  • dictionary_column – Existing dictionary column

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with type matching the dictionary_column’s keys