Strings Modify#

group strings_modify

Enums

enum class side_type#

Direction identifier for cudf::strings::strip and cudf::strings::pad functions.

Values:

enumerator LEFT#

strip/pad characters from the beginning of the string

enumerator RIGHT#

strip/pad characters from the end of the string

enumerator BOTH#

strip/pad characters from the beginning and end of the string

enum class filter_type : bool#

Removes or keeps the specified character ranges in cudf::strings::filter_characters.

Values:

enumerator KEEP#

All characters but those specified are removed.

enumerator REMOVE#

Only the specified characters are removed.

Functions

std::unique_ptr<column> pad(strings_column_view const &input, size_type width, side_type side = side_type::RIGHT, std::string_view fill_char = " ", rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Add padding to each string using a provided character.

If the string is already width or more characters, no padding is performed. Also, no strings are truncated.

Null string entries result in corresponding null entries in the output column.

Example:
s = ['aa','bbb','cccc','ddddd']
r = pad(s,4)
r is now ['aa  ','bbb ','cccc','ddddd']
Parameters:
  • input – Strings instance for this operation

  • width – The minimum number of characters for each string

  • side – Where to place the padding characters; Default is pad right (left justify)

  • fill_char – Single UTF-8 character to use for padding; Default is the space character

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with padded strings

std::unique_ptr<column> zfill(strings_column_view const &input, size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Add ‘0’ as padding to the left of each string.

This is equivalent to pad(width,left,’0)` but preserves the sign character if it appears in the first position.

If the string is already width or more characters, no padding is performed. No strings are truncated.

Null rows in the input result in corresponding null rows in the output column.

Example:
s = ['1234','-9876','+0.34','-342567', '2+2']
r = zfill(s,6)
r is now ['001234','-09876','+00.34','-342567', '0002+2']
Parameters:
  • input – Strings instance for this operation

  • width – The minimum number of characters for each string

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of strings

std::unique_ptr<column> reverse(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Reverses the characters within each string.

Any null string entries return corresponding null output column entries.

Example:
s = ["abcdef", "12345", "", "A"]
r = reverse(s)
r is now ["fedcba", "54321", "", "A"]
Parameters:
  • input – Strings column for this operation

  • mr – Device memory resource used to allocate the returned column’s device memory

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

New strings column

std::unique_ptr<column> strip(strings_column_view const &input, side_type side = side_type::BOTH, string_scalar const &to_strip = string_scalar(""), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Removes the specified characters from the beginning or end (or both) of each string.

The to_strip parameter can contain one or more characters. All characters in to_strip are removed from the input strings.

If to_strip is the empty string, whitespace characters are removed. Whitespace is considered the space character plus control characters like tab and line feed.

Any null string entries return corresponding null output column entries.

Example:
s = [" aaa ", "_bbbb ", "__cccc  ", "ddd", " ee _ff gg_"]
r = strip(s,both," _")
r is now ["aaa", "bbbb", "cccc", "ddd", "ee _ff gg"]
Throws:

cudf::logic_error – if to_strip is invalid.

Parameters:
  • input – Strings column for this operation

  • side – Indicates characters are to be stripped from the beginning, end, or both of each string; Default is both

  • to_strip – UTF-8 encoded characters to strip from each string; Default is empty string which indicates strip whitespace characters

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

New strings column.

std::unique_ptr<column> translate(strings_column_view const &input, std::vector<std::pair<char_utf8, char_utf8>> const &chars_table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Translates individual characters within each string.

This can also be used to remove a character by specifying 0 for the corresponding table entry.

Null string entries result in null entries in the output column.

Example:
s = ["aa","bbb","cccc","abcd"]
t = [['a','A'],['b',''],['d':'Q']]
r = translate(s,t)
r is now ["AA", "", "cccc", "AcQ"]
Parameters:
  • input – Strings instance for this operation

  • chars_table – Table of UTF-8 character mappings

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with padded strings

std::unique_ptr<column> filter_characters(strings_column_view const &input, std::vector<std::pair<cudf::char_utf8, cudf::char_utf8>> characters_to_filter, filter_type keep_characters = filter_type::KEEP, string_scalar const &replacement = string_scalar(""), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Removes ranges of characters from each string in a strings column.

This can also be used to keep only the specified character ranges and remove all others from each string.

Example:
s = ["aeiou", "AEIOU", "0123456789", "bcdOPQ5"]
f = [{'M','Z'}, {'a','l'}, {'4','6'}]
r1 = filter_characters(s, f)
r1 is now ["aei", "OU", "456", "bcdOPQ5"]
r2 = filter_characters(s, f, REMOVE)
r2 is now ["ou", "AEI", "0123789", ""]
r3 = filter_characters(s, f, KEEP, "*")
r3 is now ["aei**", "***OU", "****456***", "bcdOPQ5"]

Null string entries result in null entries in the output column.

Throws:

cudf::logic_error – if replacement is invalid

Parameters:
  • input – Strings instance for this operation

  • characters_to_filter – Table of character ranges to filter on

  • keep_characters – If true, the characters_to_filter are retained and all other characters are removed

  • replacement – Optional replacement string for each character removed

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with filtered strings

std::unique_ptr<column> wrap(strings_column_view const &input, size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Wraps strings onto multiple lines shorter than width by replacing appropriate white space with new-line characters (ASCII 0x0A).

For each string row in the input column longer than width, the corresponding output string row will have newline characters inserted so that each line is no more than width characters. Attempts to use existing white space locations to split the strings, but may split non-white-space sequences if necessary.

Any null string entries return corresponding null output column entries.

Example 1:

width = 3
input_string_tbl = [ "12345", "thesé", nullptr, "ARE THE", "tést strings", "" ];

wrapped_string_tbl = wrap(input_string_tbl, width)
wrapped_string_tbl = [ "12345", "thesé", nullptr, "ARE\nTHE", "tést\nstrings", "" ]

Example 2:

width = 12;
input_string_tbl = ["the quick brown fox jumped over the lazy brown dog", "hello, world"]

wrapped_string_tbl = wrap(input_string_tbl, width)
wrapped_string_tbl = ["the quick\nbrown fox\njumped over\nthe lazy\nbrown dog", "hello, world"]

Parameters:
  • input – String column

  • width – Maximum character width of a line within each string

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

Column of wrapped strings