Files | |
file | normalize.hpp |
Functions | |
std::unique_ptr< cudf::column > | nvtext::normalize_spaces (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) |
Returns a new strings column by normalizing the whitespace in each string in the input column. More... | |
std::unique_ptr< cudf::column > | nvtext::normalize_characters (cudf::strings_column_view const &input, bool do_lower_case, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) |
Normalizes strings characters for tokenizing. More... | |
std::unique_ptr<cudf::column> nvtext::normalize_characters | ( | cudf::strings_column_view const & | input, |
bool | do_lower_case, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Normalizes strings characters for tokenizing.
This uses the normalizer that is built into the nvtext::subword_tokenize function which includes:
"\t", "\n", "\r"
) to just space " "
The padding process here adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories
If do_lower_case = true
, lower-casing also removes the accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
This function requires about 16x the number of character bytes in the input strings column as working memory.
input | The input strings to normalize |
do_lower_case | If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Memory resource to allocate any returned objects |
std::unique_ptr<cudf::column> nvtext::normalize_spaces | ( | cudf::strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns a new strings column by normalizing the whitespace in each string in the input column.
Normalizing a string replaces any number of whitespace character (character code-point <= ' ') runs with a single space ' ' and trims whitespace from the beginning and end of the string.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
input | Strings column to normalize |
mr | Device memory resource used to allocate the returned column's device memory |
stream | CUDA stream used for device memory operations and kernel launches |