Dictionary Encode#

group dictionary_encode

Functions

std::unique_ptr<column> encode(column_view const &column, data_type indices_type = data_type{type_id::UINT32}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Construct a dictionary column by dictionary encoding an existing column.

The output column is a DICTIONARY type with a keys column of non-null, unique values that are in a strict, total order. Meaning, keys[i] is _ordered before keys[i+1] for all i in [0,n-1) where n is the number of keys.

The output column has a child indices column that is of integer type and with the same size as the input column.

The null mask and null count are copied from the input column to the output column.

c = [429, 111, 213, 111, 213, 429, 213]
d = encode(c)
d now has keys [111, 213, 429] and indices [2, 0, 1, 0, 1, 2, 1]

Throws:
Parameters:
  • column – The column to dictionary encode

  • indices_type – The integer type to use for the indices

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

Returns a dictionary column

std::unique_ptr<column> decode(dictionary_column_view const &dictionary_column, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create a column by gathering the keys from the provided dictionary_column into a new column using the indices from that column.

d1 = {["a", "c", "d"], [2, 0, 1, 0]}
s = decode(d1)
s is now ["d", "a", "c", "a"]
Parameters:
  • dictionary_column – Existing dictionary column

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with type matching the dictionary_column’s keys