Files | |
file | nvtext/replace.hpp |
Functions | |
std::unique_ptr< cudf::column > | nvtext::replace_tokens (cudf::strings_column_view const &input, cudf::strings_column_view const &targets, cudf::strings_column_view const &replacements, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) |
Replaces specified tokens with corresponding replacement strings. More... | |
std::unique_ptr< cudf::column > | nvtext::filter_tokens (cudf::strings_column_view const &input, cudf::size_type min_token_length, cudf::string_scalar const &replacement=cudf::string_scalar{""}, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) |
Removes tokens whose lengths are less than a specified number of characters. More... | |
std::unique_ptr<cudf::column> nvtext::filter_tokens | ( | cudf::strings_column_view const & | input, |
cudf::size_type | min_token_length, | ||
cudf::string_scalar const & | replacement = cudf::string_scalar{""} , |
||
cudf::string_scalar const & | delimiter = cudf::string_scalar{""} , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Removes tokens whose lengths are less than a specified number of characters.
Tokens identified in each string are removed from the corresponding output string. The removed tokens can be replaced by specifying a replacement
string as well.
The delimiter
may be zero or more characters. If the delimiter
is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.
Note the first string in result
still retains the space delimiters.
Example with a replacement
string.
The replacement
string is allowed to be shorter than min_token_length.
cudf::logic_error | if delimiter or replacement is invalid |
input | Strings column to replace |
min_token_length | The minimum number of characters to retain a token in the output string |
replacement | Optional replacement string to be used in place of removed tokens |
delimiter | Characters used to separate each string into tokens. The default of empty string will identify tokens using whitespace. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::replace_tokens | ( | cudf::strings_column_view const & | input, |
cudf::strings_column_view const & | targets, | ||
cudf::strings_column_view const & | replacements, | ||
cudf::string_scalar const & | delimiter = cudf::string_scalar{""} , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Replaces specified tokens with corresponding replacement strings.
Tokens are identified in each string and if any match the specified targets
strings, they are replaced with corresponding replacements
string such that if targets[i]
is found, then it is replaced by replacements[i]
.
The delimiter
may be zero or more characters. If the delimiter
is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
An empty string is allowed for a replacement string but the delimiters will not be removed.
Note the first string in result
still retains the space delimiters.
The replacements.size()
must equal targets.size()
unless replacements.size()==1
. In this case, all matching targets
strings will be replaced with the single replacements[0]
string.
cudf::logic_error | if targets.size() != replacements.size() and if replacements.size() != 1 |
cudf::logic_error | if targets or replacements contain nulls |
cudf::logic_error | if delimiter is invalid |
input | Strings column to replace |
targets | Strings to compare against tokens found in input |
replacements | Replacement strings for each string in targets |
delimiter | Characters used to separate each string into tokens. The default of empty string will identify tokens using whitespace. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |