Strings Replace#

group strings_replace

Functions

std::unique_ptr<column> replace(strings_column_view const &input, string_scalar const &target, string_scalar const &repl, cudf::size_type maxrepl = -1, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Replaces target string within each string with the specified replacement string.

This function searches each string in the column for the target string. If found, the target string is replaced by the repl string within the input string. If not found, the output entry is just a copy of the corresponding input string.

Specifying an empty string for repl will essentially remove the target string if found in each string.

Null string entries will return null output string entries.

Example:
s = ["hello", "goodbye"]
r1 = replace(s,"o","OOO")
r1 is now ["hellOOO","gOOOOOOdbye"]
r2 = replace(s,"oo","")
r2 is now ["hello","gdbye"]
Throws:

cudf::logic_error – if target is an empty string.

Parameters:
  • input – Strings column for this operation

  • target – String to search for within each string

  • repl – Replacement string if target is found

  • maxrepl – Maximum times to replace if target appears multiple times in the input string. Default of -1 specifies replace all occurrences of target in each string.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> replace_slice(strings_column_view const &input, string_scalar const &repl = string_scalar(""), size_type start = 0, size_type stop = -1, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

This function replaces each string in the column with the provided repl string within the [start,stop) character position range.

Null string entries will return null output string entries.

Position values are 0-based meaning position 0 is the first character of each string.

This function can be used to insert a string into specific position by specifying the same position value for start and stop. The repl string can be appended to each string by specifying -1 for both start and stop.

Example:
s = ["abcdefghij","0123456789"]
r = s.replace_slice(s,2,5,"z")
r is now ["abzfghij", "01z56789"]
Throws:

cudf::logic_error – if start is greater than stop.

Parameters:
  • input – Strings column for this operation.

  • repl – Replacement string for specified positions found. Default is empty string.

  • start – Start position where repl will be added. Default is 0, first character position.

  • stop – End position (exclusive) to use for replacement. Default of -1 specifies the end of each string.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> replace_multiple(strings_column_view const &input, strings_column_view const &targets, strings_column_view const &repls, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Replaces substrings matching a list of targets with the corresponding replacement strings.

For each string in strings, the list of targets is searched within that string. If a target string is found, it is replaced by the corresponding entry in the repls column. All occurrences found in each string are replaced.

This does not use regex to match targets in the string. Empty string targets are ignored.

Null string entries will return null output string entries.

The repls argument can optionally contain a single string. In this case, all matching target substrings will be replaced by that single string.

Example:
s = ["hello", "goodbye"]
tgts = ["e","o"]
repls = ["EE","OO"]
r1 = replace(s,tgts,repls)
r1 is now ["hEEllO", "gOOOOdbyEE"]
tgts = ["e","oo"]
repls = ["33",""]
r2 = replace(s,tgts,repls)
r2 is now ["h33llo", "gdby33"]
Throws:
  • cudf::logic_error – if targets and repls are different sizes except if repls is a single string.

  • cudf::logic_error – if targets or repls contain null entries.

Parameters:
  • input – Strings column for this operation

  • targets – Strings to search for in each string

  • repls – Corresponding replacement strings for target strings

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> replace(strings_column_view const &input, strings_column_view const &targets, strings_column_view const &repls, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Replaces substrings matching a list of targets with the corresponding replacement strings.

For each string in strings, the list of targets is searched within that string. If a target string is found, it is replaced by the corresponding entry in the repls column. All occurrences found in each string are replaced.

This does not use regex to match targets in the string. Empty string targets are ignored.

Null string entries will return null output string entries.

The repls argument can optionally contain a single string. In this case, all matching target substrings will be replaced by that single string.

Example:
s = ["hello", "goodbye"]
tgts = ["e","o"]
repls = ["EE","OO"]
r1 = replace(s,tgts,repls)
r1 is now ["hEEllO", "gOOOOdbyEE"]
tgts = ["e","oo"]
repls = ["33",""]
r2 = replace(s,tgts,repls)
r2 is now ["h33llo", "gdby33"]

Deprecated:

since 24.08

Throws:
  • cudf::logic_error – if targets and repls are different sizes except if repls is a single string.

  • cudf::logic_error – if targets or repls contain null entries.

Parameters:
  • input – Strings column for this operation

  • targets – Strings to search for in each string

  • repls – Corresponding replacement strings for target strings

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> replace_re(strings_column_view const &input, regex_program const &prog, string_scalar const &replacement = string_scalar(""), std::optional<size_type> max_replace_count = std::nullopt, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

For each string, replaces any character sequence matching the given regex with the provided replacement string.

Any null string entries return corresponding null output column entries.

See the Regex Features page for details on patterns supported by this API.

Parameters:
  • input – Strings instance for this operation

  • prog – Regex program instance

  • replacement – The string used to replace the matched sequence in each string. Default is an empty string.

  • max_replace_count – The maximum number of times to replace the matched pattern within each string. Default replaces every substring that is matched.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> replace_re(strings_column_view const &input, std::vector<std::string> const &patterns, strings_column_view const &replacements, regex_flags const flags = regex_flags::DEFAULT, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

For each string, replaces any character sequence matching the given patterns with the corresponding string in the replacements column.

Any null string entries return corresponding null output column entries.

See the Regex Features page for details on patterns supported by this API.

Parameters:
  • input – Strings instance for this operation

  • patterns – The regular expression patterns to search within each string

  • replacements – The strings used for replacement

  • flags – Regex flags for interpreting special characters in the patterns

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> replace_with_backrefs(strings_column_view const &input, regex_program const &prog, std::string_view replacement, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

For each string, replaces any character sequence matching the given regex using the replacement template for back-references.

Any null string entries return corresponding null output column entries.

See the Regex Features page for details on patterns supported by this API.

Throws:

cudf::logic_error – if capture index values in replacement are not in range 0-99, and also if the index exceeds the group count specified in the pattern

Parameters:
  • input – Strings instance for this operation

  • prog – Regex program instance

  • replacement – The replacement template for creating the output string

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column