Normalizer object to be used with nvtext::normalize_characters. More...
#include <normalize.hpp>
Public Member Functions | |
| character_normalizer (bool do_lower_case, cudf::strings_column_view const &special_tokens, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) | |
| Normalizer object constructor. More... | |
Normalizer object to be used with nvtext::normalize_characters.
Use nvtext::create_normalizer to create this object.
This normalizer includes:
"\t", "\n", "\r") to just space " "The padding process adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories
If do_lower_case = true, lower-casing also removes any accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.
If special_tokens are included the padding after [ and before ] is not inserted if the characters between them match one of the given tokens. Also, the special_tokens are expected to include the [] characters at the beginning of and end of each string appropriately.
Definition at line 89 of file normalize.hpp.
| nvtext::character_normalizer::character_normalizer | ( | bool | do_lower_case, |
| cudf::strings_column_view const & | special_tokens, | ||
| rmm::cuda_stream_view | stream = cudf::get_default_stream(), |
||
| rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
| ) |
Normalizer object constructor.
This initializes and holds the character normalizing tables and settings.
| do_lower_case | If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed. |
| special_tokens | Each row is a token including the [] brackets. For example: [BOS], [EOS], [UNK], [SEP], [PAD], [CLS], [MASK] |
| stream | CUDA stream used for device memory operations and kernel launches |
| mr | Device memory resource used to allocate the returned column's device memory |