All Classes Namespaces Files Functions Variables Typedefs Enumerations Enumerator Friends Macros Modules Pages
Files | Classes | Functions
Normalizing

Files

file  normalize.hpp
 

Classes

struct  nvtext::character_normalizer
 Normalizer object to be used with nvtext::normalize_characters. More...
 

Functions

std::unique_ptr< cudf::columnnvtext::normalize_spaces (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns a new strings column by normalizing the whitespace in each string in the input column. More...
 
std::unique_ptr< cudf::columnnvtext::normalize_characters (cudf::strings_column_view const &input, bool do_lower_case, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Normalizes strings characters for tokenizing. More...
 
std::unique_ptr< character_normalizernvtext::create_character_normalizer (bool do_lower_case, cudf::strings_column_view const &special_tokens=cudf::strings_column_view(cudf::column_view{ cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Create a normalizer object. More...
 
std::unique_ptr< cudf::columnnvtext::normalize_characters (cudf::strings_column_view const &input, character_normalizer const &normalizer, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Normalizes the text in input strings column. More...
 

Detailed Description

Function Documentation

◆ create_character_normalizer()

std::unique_ptr<character_normalizer> nvtext::create_character_normalizer ( bool  do_lower_case,
cudf::strings_column_view const &  special_tokens = cudf::strings_column_view(cudf::column_viewcudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}),
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Create a normalizer object.

Creates a normalizer object which can be reused on multiple calls to nvtext::normalize_characters

See also
nvtext::character_normalizer
Parameters
do_lower_caseIf true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
special_tokensIndividual tokens including [] brackets. Default is no special tokens.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Object to be used with nvtext::normalize_characters

◆ normalize_characters() [1/2]

std::unique_ptr<cudf::column> nvtext::normalize_characters ( cudf::strings_column_view const &  input,
bool  do_lower_case,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Normalizes strings characters for tokenizing.

This uses the normalizer that is built into the nvtext::subword_tokenize function which includes:

  • adding padding around punctuation (unicode category starts with "P") as well as certain ASCII symbols like "^" and "$"
  • adding padding around the CJK Unicode block characters
  • changing whitespace (e.g. "\t", "\n", "\r") to just space " "
  • removing control characters (unicode categories "Cc" and "Cf")

The padding process here adds a single space before and after the character. Details on unicode category can be found here: https://unicodebook.readthedocs.io/unicode.html#categories

If do_lower_case = true, lower-casing also removes the accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.

s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
s1 = normalize_characters(s,true)
s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]
s2 = normalize_characters(s,false)
s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]

A null input element at row i produces a corresponding null entry for row i in the output column.

This function requires about 16x the number of character bytes in the input strings column as working memory.

Parameters
inputThe input strings to normalize
do_lower_caseIf true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
streamCUDA stream used for device memory operations and kernel launches
mrMemory resource to allocate any returned objects
Returns
Normalized strings column

◆ normalize_characters() [2/2]

std::unique_ptr<cudf::column> nvtext::normalize_characters ( cudf::strings_column_view const &  input,
character_normalizer const &  normalizer,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Normalizes the text in input strings column.

See also
nvtext::character_normalizer for details on the normalizer behavior
cn = create_character_normalizer(true)
s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
s1 = normalize_characters(s,cn)
s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]
cn = create_character_normalizer(false)
s2 = normalize_characters(s,cn)
s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters
inputThe input strings to normalize
normalizerNormalizer to use for this function
streamCUDA stream used for device memory operations and kernel launches
mrMemory resource to allocate any returned objects
Returns
Normalized strings column

◆ normalize_spaces()

Returns a new strings column by normalizing the whitespace in each string in the input column.

Normalizing a string replaces any number of whitespace character (character code-point <= ' ') runs with a single space ' ' and trims whitespace from the beginning and end of the string.

Example:
s = ["a b", " c d\n", "e \t f "]
t = normalize_spaces(s)
t is now ["a b","c d","e f"]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters
inputStrings column to normalize
mrDevice memory resource used to allocate the returned column's device memory
streamCUDA stream used for device memory operations and kernel launches
Returns
New strings columns of normalized strings.