Files | |
| file | normalize.hpp |
Classes | |
| struct | nvtext::character_normalizer |
| Normalizer object to be used with nvtext::normalize_characters. More... | |
| std::unique_ptr<character_normalizer> nvtext::create_character_normalizer | ( | bool | do_lower_case, |
| cudf::strings_column_view const & | special_tokens = cudf::strings_column_view(cudf::column_view{ cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}), |
||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) |
Create a normalizer object.
Creates a normalizer object which can be reused on multiple calls to nvtext::normalize_characters
| do_lower_case | If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed. |
| special_tokens | Individual tokens including [] brackets. Default is no special tokens. |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the returned column's device memory |
| std::unique_ptr<cudf::column> nvtext::normalize_characters | ( | cudf::strings_column_view const & | input, |
| character_normalizer const & | normalizer, | ||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) |
Normalizes the text in input strings column.
A null input element at row i produces a corresponding null entry for row i in the output column.
| input | The input strings to normalize |
| normalizer | Normalizer to use for this function |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Memory resource to allocate any returned objects |
| std::unique_ptr<cudf::column> nvtext::normalize_spaces | ( | cudf::strings_column_view const & | input, |
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) |
Returns a new strings column by normalizing the whitespace in each string in the input column.
Normalizing a string replaces any number of whitespace character (character code-point <= ' ') runs with a single space ' ' and trims whitespace from the beginning and end of the string.
A null input element at row i produces a corresponding null entry for row i in the output column.
| input | Strings column to normalize |
| mr | Device memory resource used to allocate the returned column's device memory |
| stream | CUDA stream used for device memory operations and kernel launches |