Nvtext Replace#
- group Replacing
Functions
-
std::unique_ptr<cudf::column> replace_tokens(cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Replaces specified tokens with corresponding replacement strings.
Tokens are identified in each string and if any match the specified
targetsstrings, they are replaced with correspondingreplacementsstring such that iftargets[i]is found, then it is replaced byreplacements[i].The
delimitermay be zero or more characters. If thedelimiteris empty, whitespace (character code-point <= ‘ ‘) is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.Example: s = ["this is me", "theme music"] tgt = ["is", "me"] rpl = ["+", "_"] result = replace_tokens(s,tgt,rpl) result is now ["this + _", "theme music"]
A null input element at row
iproduces a corresponding null entry for rowiin the output column.An empty string is allowed for a replacement string but the delimiters will not be removed.
Example: s = ["this is me", "theme music"] tgt = ["me", "this"] rpl = ["", ""] result = replace_tokens(s,tgt,rpl) result is now [" is ", "theme music"]
Note the first string in
resultstill retains the space delimiters.The
replacements.size()must equaltargets.size()unlessreplacements.size()==1. In this case, all matchingtargetsstrings will be replaced with the singlereplacements[0]string.- Throws:
cudf::logic_error – if
targets.size() != replacements.size()and ifreplacements.size() != 1cudf::logic_error – if targets or replacements contain nulls
cudf::logic_error – if delimiter is invalid
- Parameters:
input – Strings column to replace
targets – Strings to compare against tokens found in
inputreplacements – Replacement strings for each string in
targetsdelimiter – Characters used to separate each string into tokens. The default of empty string will identify tokens using whitespace.
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column with replaced strings
-
std::unique_ptr<cudf::column> filter_tokens(cudf::strings_column_view const &input, cudf::size_type min_token_length, cudf::string_scalar const &replacement = cudf::string_scalar{""}, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Removes tokens whose lengths are less than a specified number of characters.
Tokens identified in each string are removed from the corresponding output string. The removed tokens can be replaced by specifying a
replacementstring as well.The
delimitermay be zero or more characters. If thedelimiteris empty, whitespace (character code-point <= ‘ ‘) is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.Example: s = ["this is me", "theme music"] result = filter_tokens(s,3) result is now ["this ", "theme music"]
Note the first string in
resultstill retains the space delimiters.Example with a
replacementstring.Example: s = ["this is me", "theme music"] result = filter_tokens(s,5,"---") result is now ["--- --- ---", "theme music"]
The
replacementstring is allowed to be shorter than min_token_length.- Throws:
cudf::logic_error – if
delimiterorreplacementis invalid- Parameters:
input – Strings column to replace
min_token_length – The minimum number of characters to retain a token in the output string
replacement – Optional replacement string to be used in place of removed tokens
delimiter – Characters used to separate each string into tokens. The default of empty string will identify tokens using whitespace.
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column of filtered strings
-
std::unique_ptr<cudf::column> replace_tokens(cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#