Nvtext Normalize#

group Normalizing

Functions

std::unique_ptr<cudf::column> normalize_spaces(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns a new strings column by normalizing the whitespace in each string in the input column.

Normalizing a string replaces any number of whitespace character (character code-point <= ‘ ‘) runs with a single space ‘ ‘ and trims whitespace from the beginning and end of the string.

Example:
s = ["a b", "  c  d\n", "e \t f "]
t = normalize_spaces(s)
t is now ["a b","c d","e f"]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters:

input – Strings column to normalize
mr – Device memory resource used to allocate the returned column’s device memory
stream – CUDA stream used for device memory operations and kernel launches

Returns:

New strings columns of normalized strings.

std::unique_ptr<character_normalizer> create_character_normalizer(bool do_lower_case, cudf::strings_column_view const &special_tokens = cudf::strings_column_view(cudf::column_view{cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Create a normalizer object.

Creates a normalizer object which can be reused on multiple calls to nvtext::normalize_characters

See also

nvtext::character_normalizer

Parameters:

do_lower_case – If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
special_tokens – Individual tokens including [] brackets. Default is no special tokens.
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory

Returns:

Object to be used with nvtext::normalize_characters

std::unique_ptr<cudf::column> normalize_characters(cudf::strings_column_view const &input, character_normalizer const &normalizer, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Normalizes the text in input strings column.

cn = create_character_normalizer(true)
s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
s1 = normalize_characters(s,cn)
s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]

cn = create_character_normalizer(false)
s2 = normalize_characters(s,cn)
s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]