Files
file	normalize.hpp
	APIs for normalizing whitespace and characters within strings columns.

Classes
struct	nvtext::character_normalizer
	Normalizer object to be used with nvtext::normalize_characters. More...

Functions
std::unique_ptr< cudf::column >	nvtext::normalize_spaces (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Returns a new strings column by normalizing the whitespace in each string in the input column. More...

std::unique_ptr< character_normalizer >	nvtext::create_character_normalizer (bool do_lower_case, cudf::strings_column_view const &special_tokens=cudf::strings_column_view(cudf::column_view{ cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Create a normalizer object. More...

std::unique_ptr< cudf::column >	nvtext::normalize_characters (cudf::strings_column_view const &input, character_normalizer const &normalizer, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Normalizes the text in input strings column. More...

Detailed Description

Function Documentation

◆ create_character_normalizer()

std::unique_ptr<character_normalizer> nvtext::create_character_normalizer	(	bool	do_lower_case,
		cudf::strings_column_view const &	special_tokens = `cudf::strings_column_view(cudf::column_view{ cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0})`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

#include <nvtext/normalize.hpp>

Create a normalizer object.

Creates a normalizer object which can be reused on multiple calls to nvtext::normalize_characters

See also: nvtext::character_normalizer

Parameters

do_lower_case	If true, upper-case characters are converted to lower-case and accents are stripped from those characters. If false, accented and upper-case characters are not transformed.
special_tokens	Individual tokens including `[]` brackets. Default is no special tokens.
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource used to allocate the returned column's device memory

Returns: Object to be used with nvtext::normalize_characters

◆ normalize_characters()

std::unique_ptr<cudf::column> nvtext::normalize_characters	(	cudf::strings_column_view const &	input,
		character_normalizer const &	normalizer,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

#include <nvtext/normalize.hpp>

Normalizes the text in input strings column.

See also: nvtext::character_normalizer for details on the normalizer behavior

cn = create_character_normalizer(true)
s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
s1 = normalize_characters(s,cn)
s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]
 
cn = create_character_normalizer(false)
s2 = normalize_characters(s,cn)
s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters

input	The input strings to normalize
normalizer	Normalizer to use for this function
stream	CUDA stream used for device memory operations and kernel launches
mr	Memory resource to allocate any returned objects

Returns: Normalized strings column

◆ normalize_spaces()

std::unique_ptr<cudf::column> nvtext::normalize_spaces	(	cudf::strings_column_view const &	input,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

#include <nvtext/normalize.hpp>

Returns a new strings column by normalizing the whitespace in each string in the input column.

Normalizing a string replaces any number of whitespace character (character code-point <= ' ') runs with a single space ' ' and trims whitespace from the beginning and end of the string.

Example:
s = ["a b", "  c  d\n", "e \t f "]
t = normalize_spaces(s)
t is now ["a b","c d","e f"]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters

input	Strings column to normalize
mr	Device memory resource used to allocate the returned column's device memory
stream	CUDA stream used for device memory operations and kernel launches

Returns: New strings columns of normalized strings.

Files

Classes

Functions

Detailed Description

Function Documentation

◆ create_character_normalizer()

◆ normalize_characters()

◆ normalize_spaces()