Files | Classes | Functions
Normalizing

Files

file  normalize.hpp
 

Classes

struct  nvtext::character_normalizer
 Normalizer object to be used with nvtext::normalize_characters. More...
 

Functions

std::unique_ptr< cudf::columnnvtext::normalize_spaces (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns a new strings column by normalizing the whitespace in each string in the input column. More...
 
std::unique_ptr< character_normalizernvtext::create_character_normalizer (bool do_lower_case, cudf::strings_column_view const &special_tokens=cudf::strings_column_view(cudf::column_view{ cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Create a normalizer object. More...
 
std::unique_ptr< cudf::columnnvtext::normalize_characters (cudf::strings_column_view const &input, character_normalizer const &normalizer, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Normalizes the text in input strings column. More...
 

Detailed Description

Function Documentation

◆ create_character_normalizer()

std::unique_ptr<character_normalizer> nvtext::create_character_normalizer ( bool  do_lower_case,
cudf::strings_column_view const &  special_tokens = cudf::strings_column_view(cudf::column_viewcudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}),
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Create a normalizer object.

Creates a normalizer object which can be reused on multiple calls to nvtext::normalize_characters

See also
nvtext::character_normalizer
Parameters
do_lower_caseIf true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
special_tokensIndividual tokens including [] brackets. Default is no special tokens.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Object to be used with nvtext::normalize_characters

◆ normalize_characters()

std::unique_ptr<cudf::column> nvtext::normalize_characters ( cudf::strings_column_view const &  input,
character_normalizer const &  normalizer,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Normalizes the text in input strings column.

See also
nvtext::character_normalizer for details on the normalizer behavior
cn = create_character_normalizer(true)
s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
s1 = normalize_characters(s,cn)
s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]
cn = create_character_normalizer(false)
s2 = normalize_characters(s,cn)
s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters
inputThe input strings to normalize
normalizerNormalizer to use for this function
streamCUDA stream used for device memory operations and kernel launches
mrMemory resource to allocate any returned objects
Returns
Normalized strings column

◆ normalize_spaces()

Returns a new strings column by normalizing the whitespace in each string in the input column.

Normalizing a string replaces any number of whitespace character (character code-point <= ' ') runs with a single space ' ' and trims whitespace from the beginning and end of the string.

Example:
s = ["a b", " c d\n", "e \t f "]
t = normalize_spaces(s)
t is now ["a b","c d","e f"]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters
inputStrings column to normalize
mrDevice memory resource used to allocate the returned column's device memory
streamCUDA stream used for device memory operations and kernel launches
Returns
New strings columns of normalized strings.