Dictionary Update#

group dictionary_update

Functions

std::unique_ptr<column> add_keys(dictionary_column_view const &dictionary_column, column_view const &new_keys, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create a new dictionary column by adding the new keys elements to the existing dictionary_column.

The indices are updated if any of the new keys are sorted before any of the existing dictionary elements.

d1 = { keys=["a", "c", "d"], indices=[2, 0, 1, 0, 1]}
d2 = add_keys( d1, ["b", "c"] )
d2 is now {keys=["a", "b", "c", "d"], indices=[3, 0, 2, 0, 2]}

The output column will have the same number of rows as the input column. Null entries from the input column are copied to the output column. No new null entries are created by this operation.

Throws:
  • cudf::logic_error – if the new_keys type does not match the keys type in the dictionary_column.

  • cudf::logic_error – if the new_keys contain nulls.

Parameters:
  • dictionary_column – Existing dictionary column.

  • new_keys – New keys to incorporate into the dictionary_column.

  • stream – CUDA stream used for device memory operations and kernel launches.

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

New dictionary column.

std::unique_ptr<column> remove_keys(dictionary_column_view const &dictionary_column, column_view const &keys_to_remove, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create a new dictionary column by removing the specified keys from the existing dictionary_column.

The output column will have the same number of rows as the input column. Null entries from the input column and copied to the output column. The indices are updated to the new positions of the remaining keys. Any indices pointing to removed keys sets that row to null.

d1 = {keys=["a", "c", "d"], indices=[2, 0, 1, 0, 2]}
d2 = remove_keys( d1, ["b", "c"] )
d2 is now {keys=["a", "d"], indices=[1, 0, x, 0, 1], valids=[1, 1, 0, 1, 1]}
Note that “a” has been removed so output row[2] becomes null.

Throws:
  • cudf::logic_error – if the keys_to_remove type does not match the keys type in the dictionary_column.

  • cudf::logic_error – if the keys_to_remove contain nulls.

Parameters:
  • dictionary_column – Existing dictionary column.

  • keys_to_remove – The keys to remove from the dictionary_column.

  • stream – CUDA stream used for device memory operations and kernel launches.

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

New dictionary column.

std::unique_ptr<column> remove_unused_keys(dictionary_column_view const &dictionary_column, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create a new dictionary column by removing any keys that are not referenced by any of the indices.

The indices are updated to the new position values of the remaining keys.

d1 = {["a","c","d"],[2,0,2,0]}
d2 = remove_unused_keys(d1)
d2 is now {["a","d"],[1,0,1,0]}
Parameters:
  • dictionary_column – Existing dictionary column.

  • stream – CUDA stream used for device memory operations and kernel launches.

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

New dictionary column.

std::unique_ptr<column> set_keys(dictionary_column_view const &dictionary_column, column_view const &keys, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create a new dictionary column by applying only the specified keys to the existing dictionary_column.

Any new elements found in the keys parameter are added to the output dictionary. Any existing keys not in the keys parameter are removed.

The number of rows in the output column will be the same as the number of rows in the input column. Existing null entries are copied to the output column. The indices are updated to reflect the position values of the new keys. Any indices pointing to removed keys sets those rows to null.

d1 = {keys=["a", "b", "c"], indices=[2, 0, 1, 2, 1]}
d2 = set_keys(existing_dict, ["b","c","d"])
d2 is now {keys=["b", "c", "d"], indices=[1, x, 0, 1, 0], valids=[1, 0, 1, 1, 1]}
Throws:
Parameters:
  • dictionary_column – Existing dictionary column.

  • keys – New keys to use for the output column. Must not contain nulls.

  • stream – CUDA stream used for device memory operations and kernel launches.

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

New dictionary column.

std::vector<std::unique_ptr<column>> match_dictionaries(cudf::host_span<dictionary_column_view const> input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create new dictionaries that have keys merged from the input dictionaries.

This will concatenate the keys for each dictionary and then call set_keys on each. The result is a vector of new dictionaries with a common set of keys.

Parameters:
  • input – Dictionary columns to match keys.

  • stream – CUDA stream used for device memory operations and kernel launches.

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

New dictionary columns.