Nvtext Replace#
- group nvtext_replace
Functions
-
std::unique_ptr<cudf::column> replace_tokens(cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Replaces specified tokens with corresponding replacement strings.
Tokens are identified in each string and if any match the specified
targets
strings, they are replaced with correspondingreplacements
string such that iftargets[i]
is found, then it is replaced byreplacements[i]
.The
delimiter
may be zero or more characters. If thedelimiter
is empty, whitespace (character code-point <= ‘ ‘) is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.Example: s = ["this is me", "theme music"] tgt = ["is", "me"] rpl = ["+", "_"] result = replace_tokens(s,tgt,rpl) result is now ["this + _", "theme music"]
A null input element at row
i
produces a corresponding null entry for rowi
in the output column.An empty string is allowed for a replacement string but the delimiters will not be removed.
Example: s = ["this is me", "theme music"] tgt = ["me", "this"] rpl = ["", ""] result = replace_tokens(s,tgt,rpl) result is now [" is ", "theme music"]
Note the first string in
result
still retains the space delimiters.The
replacements.size()
must equaltargets.size()
unlessreplacements.size()==1
. In this case, all matchingtargets
strings will be replaced with the singlereplacements[0]
string.- Throws:
cudf::logic_error – if
targets.size() != replacements.size()
and ifreplacements.size() != 1
cudf::logic_error – if targets or replacements contain nulls
cudf::logic_error – if delimiter is invalid
- Parameters:
input – Strings column to replace
targets – Strings to compare against tokens found in
input
replacements – Replacement strings for each string in
targets
delimiter – Characters used to separate each string into tokens. The default of empty string will identify tokens using whitespace.
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings columns of with replaced strings
-
std::unique_ptr<cudf::column> filter_tokens(cudf::strings_column_view const &input, cudf::size_type min_token_length, cudf::string_scalar const &replacement = cudf::string_scalar{""}, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Removes tokens whose lengths are less than a specified number of characters.
Tokens identified in each string are removed from the corresponding output string. The removed tokens can be replaced by specifying a
replacement
string as well.The
delimiter
may be zero or more characters. If thedelimiter
is empty, whitespace (character code-point <= ‘ ‘) is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.Example: s = ["this is me", "theme music"] result = filter_tokens(s,3) result is now ["this ", "theme music"]
Note the first string in
result
still retains the space delimiters.Example with a
replacement
string.Example: s = ["this is me", "theme music"] result = filter_tokens(s,5,"---") result is now ["--- --- ---", "theme music"]
The
replacement
string is allowed to be shorter than min_token_length.- Throws:
cudf::logic_error – if
delimiter
orreplacement
is invalid- Parameters:
input – Strings column to replace
min_token_length – The minimum number of characters to retain a token in the output string
replacement – Optional replacement string to be used in place of removed tokens
delimiter – Characters used to separate each string into tokens. The default of empty string will identify tokens using whitespace.
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings columns of with replaced strings
-
std::unique_ptr<cudf::column> replace_tokens(cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#