Files
file	nvtext/replace.hpp

Functions
std::unique_ptr< cudf::column >	nvtext::replace_tokens (cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Replaces specified tokens with corresponding replacement strings. More...

std::unique_ptr< cudf::column >	nvtext::filter_tokens (cudf::strings_column_view const &input, cudf::size_type min_token_length, cudf::string_scalar const &replacement=cudf::string_scalar{""}, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Removes tokens whose lengths are less than a specified number of characters. More...

Detailed Description

Function Documentation

◆ filter_tokens()

std::unique_ptr<cudf::column> nvtext::filter_tokens	(	cudf::strings_column_view const &	input,
		cudf::size_type	min_token_length,
		cudf::string_scalar const &	replacement = `cudf::string_scalar{""}`,
		cudf::string_scalar const &	delimiter = `cudf::string_scalar{""}`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Removes tokens whose lengths are less than a specified number of characters.

Tokens identified in each string are removed from the corresponding output string. The removed tokens can be replaced by specifying a replacement string as well.

The delimiter may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.

Example:
s = ["this is me", "theme music"]
result = filter_tokens(s,3)
result is now ["this  ", "theme music"]

Note the first string in result still retains the space delimiters.

Example with a replacement string.

Example:
s = ["this is me", "theme music"]
result = filter_tokens(s,5,"---")
result is now ["--- --- ---", "theme music"]

The replacement string is allowed to be shorter than min_token_length.

Exceptions

cudf::logic_error if delimiter or replacement is invalid

Parameters

input	Strings column to replace
min_token_length	The minimum number of characters to retain a token in the output string
replacement	Optional replacement string to be used in place of removed tokens
delimiter	Characters used to separate each string into tokens. The default of empty string will identify tokens using whitespace.
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource used to allocate the returned column's device memory

Returns: New strings column of filtered strings

◆ replace_tokens()

std::unique_ptr<cudf::column> nvtext::replace_tokens	(	cudf::strings_column_view const &	input,
		cudf::strings_column_view const &	targets,
		cudf::strings_column_view const &	replacements,
		cudf::string_scalar const &	delimiter = `cudf::string_scalar{""}`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Replaces specified tokens with corresponding replacement strings.

Tokens are identified in each string and if any match the specified targets strings, they are replaced with corresponding replacements string such that if targets[i] is found, then it is replaced by replacements[i].

The delimiter may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.

Example:
s = ["this is me", "theme music"]
tgt = ["is", "me"]
rpl = ["+", "_"]
result = replace_tokens(s,tgt,rpl)
result is now ["this + _", "theme music"]

A null input element at row i produces a corresponding null entry for row i in the output column.

An empty string is allowed for a replacement string but the delimiters will not be removed.

Example:
s = ["this is me", "theme music"]
tgt = ["me", "this"]
rpl = ["", ""]
result = replace_tokens(s,tgt,rpl)
result is now [" is ", "theme music"]