Strings Modify#
- group strings_modify
Enums
-
enum class side_type#
Direction identifier for cudf::strings::strip and cudf::strings::pad functions.
Values:
-
enumerator LEFT#
strip/pad characters from the beginning of the string
-
enumerator RIGHT#
strip/pad characters from the end of the string
-
enumerator BOTH#
strip/pad characters from the beginning and end of the string
-
enumerator LEFT#
-
enum class filter_type : bool#
Removes or keeps the specified character ranges in cudf::strings::filter_characters.
Values:
-
enumerator KEEP#
All characters but those specified are removed.
-
enumerator REMOVE#
Only the specified characters are removed.
-
enumerator KEEP#
Functions
-
std::unique_ptr<column> pad(strings_column_view const &input, size_type width, side_type side = side_type::RIGHT, std::string_view fill_char = " ", rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Add padding to each string using a provided character.
If the string is already
width
or more characters, no padding is performed. Also, no strings are truncated.Null string entries result in corresponding null entries in the output column.
Example: s = ['aa','bbb','cccc','ddddd'] r = pad(s,4) r is now ['aa ','bbb ','cccc','ddddd']
- Parameters:
input – Strings instance for this operation
width – The minimum number of characters for each string
side – Where to place the padding characters; Default is pad right (left justify)
fill_char – Single UTF-8 character to use for padding; Default is the space character
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column with padded strings
-
std::unique_ptr<column> zfill(strings_column_view const &input, size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Add ‘0’ as padding to the left of each string.
This is equivalent to pad(width,left,’0)` but preserves the sign character if it appears in the first position.
If the string is already width or more characters, no padding is performed. No strings are truncated.
Null rows in the input result in corresponding null rows in the output column.
Example: s = ['1234','-9876','+0.34','-342567', '2+2'] r = zfill(s,6) r is now ['001234','-09876','+00.34','-342567', '0002+2']
- Parameters:
input – Strings instance for this operation
width – The minimum number of characters for each string
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of strings
-
std::unique_ptr<column> reverse(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Reverses the characters within each string.
Any null string entries return corresponding null output column entries.
Example: s = ["abcdef", "12345", "", "A"] r = reverse(s) r is now ["fedcba", "54321", "", "A"]
- Parameters:
input – Strings column for this operation
mr – Device memory resource used to allocate the returned column’s device memory
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
New strings column
-
std::unique_ptr<column> strip(strings_column_view const &input, side_type side = side_type::BOTH, string_scalar const &to_strip = string_scalar(""), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Removes the specified characters from the beginning or end (or both) of each string.
The to_strip parameter can contain one or more characters. All characters in
to_strip
are removed from the input strings.If
to_strip
is the empty string, whitespace characters are removed. Whitespace is considered the space character plus control characters like tab and line feed.Any null string entries return corresponding null output column entries.
Example: s = [" aaa ", "_bbbb ", "__cccc ", "ddd", " ee _ff gg_"] r = strip(s,both," _") r is now ["aaa", "bbbb", "cccc", "ddd", "ee _ff gg"]
- Throws:
cudf::logic_error – if
to_strip
is invalid.- Parameters:
input – Strings column for this operation
side – Indicates characters are to be stripped from the beginning, end, or both of each string; Default is both
to_strip – UTF-8 encoded characters to strip from each string; Default is empty string which indicates strip whitespace characters
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory.
- Returns:
New strings column.
-
std::unique_ptr<column> translate(strings_column_view const &input, std::vector<std::pair<char_utf8, char_utf8>> const &chars_table, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Translates individual characters within each string.
This can also be used to remove a character by specifying 0 for the corresponding table entry.
Null string entries result in null entries in the output column.
Example: s = ["aa","bbb","cccc","abcd"] t = [['a','A'],['b',''],['d':'Q']] r = translate(s,t) r is now ["AA", "", "cccc", "AcQ"]
- Parameters:
input – Strings instance for this operation
chars_table – Table of UTF-8 character mappings
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column with padded strings
-
std::unique_ptr<column> filter_characters(strings_column_view const &input, std::vector<std::pair<cudf::char_utf8, cudf::char_utf8>> characters_to_filter, filter_type keep_characters = filter_type::KEEP, string_scalar const &replacement = string_scalar(""), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Removes ranges of characters from each string in a strings column.
This can also be used to keep only the specified character ranges and remove all others from each string.
Example: s = ["aeiou", "AEIOU", "0123456789", "bcdOPQ5"] f = [{'M','Z'}, {'a','l'}, {'4','6'}] r1 = filter_characters(s, f) r1 is now ["aei", "OU", "456", "bcdOPQ5"] r2 = filter_characters(s, f, REMOVE) r2 is now ["ou", "AEI", "0123789", ""] r3 = filter_characters(s, f, KEEP, "*") r3 is now ["aei**", "***OU", "****456***", "bcdOPQ5"]
Null string entries result in null entries in the output column.
- Throws:
cudf::logic_error – if
replacement
is invalid- Parameters:
input – Strings instance for this operation
characters_to_filter – Table of character ranges to filter on
keep_characters – If true, the
characters_to_filter
are retained and all other characters are removedreplacement – Optional replacement string for each character removed
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column with filtered strings
-
std::unique_ptr<column> wrap(strings_column_view const &input, size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Wraps strings onto multiple lines shorter than
width
by replacing appropriate white space with new-line characters (ASCII 0x0A).For each string row in the input column longer than
width
, the corresponding output string row will have newline characters inserted so that each line is no more thanwidth characters
. Attempts to use existing white space locations to split the strings, but may split non-white-space sequences if necessary.Any null string entries return corresponding null output column entries.
Example 1:
width = 3 input_string_tbl = [ "12345", "thesé", nullptr, "ARE THE", "tést strings", "" ]; wrapped_string_tbl = wrap(input_string_tbl, width) wrapped_string_tbl = [ "12345", "thesé", nullptr, "ARE\nTHE", "tést\nstrings", "" ]
Example 2:
width = 12; input_string_tbl = ["the quick brown fox jumped over the lazy brown dog", "hello, world"] wrapped_string_tbl = wrap(input_string_tbl, width) wrapped_string_tbl = ["the quick\nbrown fox\njumped over\nthe lazy\nbrown dog", "hello, world"]
- Parameters:
input – String column
width – Maximum character width of a line within each string
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
Column of wrapped strings
-
enum class side_type#