Files | Functions
Replacing

Files

file  nvtext/replace.hpp
 

Functions

std::unique_ptr< cudf::columnnvtext::replace_tokens (cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Replaces specified tokens with corresponding replacement strings. More...
 
std::unique_ptr< cudf::columnnvtext::filter_tokens (cudf::strings_column_view const &input, cudf::size_type min_token_length, cudf::string_scalar const &replacement=cudf::string_scalar{""}, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Removes tokens whose lengths are less than a specified number of characters. More...
 

Detailed Description

Function Documentation

◆ filter_tokens()

std::unique_ptr<cudf::column> nvtext::filter_tokens ( cudf::strings_column_view const &  input,
cudf::size_type  min_token_length,
cudf::string_scalar const &  replacement = cudf::string_scalar{""},
cudf::string_scalar const &  delimiter = cudf::string_scalar{""},
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Removes tokens whose lengths are less than a specified number of characters.

Tokens identified in each string are removed from the corresponding output string. The removed tokens can be replaced by specifying a replacement string as well.

The delimiter may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.

Example:
s = ["this is me", "theme music"]
result = filter_tokens(s,3)
result is now ["this ", "theme music"]

Note the first string in result still retains the space delimiters.

Example with a replacement string.

Example:
s = ["this is me", "theme music"]
result = filter_tokens(s,5,"---")
result is now ["--- --- ---", "theme music"]

The replacement string is allowed to be shorter than min_token_length.

Exceptions
cudf::logic_errorif delimiter or replacement is invalid
Parameters
inputStrings column to replace
min_token_lengthThe minimum number of characters to retain a token in the output string
replacementOptional replacement string to be used in place of removed tokens
delimiterCharacters used to separate each string into tokens. The default of empty string will identify tokens using whitespace.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column of filtered strings

◆ replace_tokens()

std::unique_ptr<cudf::column> nvtext::replace_tokens ( cudf::strings_column_view const &  input,
cudf::strings_column_view const &  targets,
cudf::strings_column_view const &  replacements,
cudf::string_scalar const &  delimiter = cudf::string_scalar{""},
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Replaces specified tokens with corresponding replacement strings.

Tokens are identified in each string and if any match the specified targets strings, they are replaced with corresponding replacements string such that if targets[i] is found, then it is replaced by replacements[i].

The delimiter may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.

Example:
s = ["this is me", "theme music"]
tgt = ["is", "me"]
rpl = ["+", "_"]
result = replace_tokens(s,tgt,rpl)
result is now ["this + _", "theme music"]

A null input element at row i produces a corresponding null entry for row i in the output column.

An empty string is allowed for a replacement string but the delimiters will not be removed.

Example:
s = ["this is me", "theme music"]
tgt = ["me", "this"]
rpl = ["", ""]
result = replace_tokens(s,tgt,rpl)
result is now [" is ", "theme music"]

Note the first string in result still retains the space delimiters.

The replacements.size() must equal targets.size() unless replacements.size()==1. In this case, all matching targets strings will be replaced with the single replacements[0] string.

Exceptions
cudf::logic_errorif targets.size() != replacements.size() and if replacements.size() != 1
cudf::logic_errorif targets or replacements contain nulls
cudf::logic_errorif delimiter is invalid
Parameters
inputStrings column to replace
targetsStrings to compare against tokens found in input
replacementsReplacement strings for each string in targets
delimiterCharacters used to separate each string into tokens. The default of empty string will identify tokens using whitespace.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column with replaced strings