Strings column APIs. More...
Enumerations | |
enum | string_character_types : uint32_t { DECIMAL = 1 << 0, NUMERIC = 1 << 1, DIGIT = 1 << 2, ALPHA = 1 << 3, SPACE = 1 << 4, UPPER = 1 << 5, LOWER = 1 << 6, ALPHANUM = DECIMAL | NUMERIC | DIGIT | ALPHA, CASE_TYPES = UPPER | LOWER, ALL_TYPES = ALPHANUM | CASE_TYPES | SPACE } |
Character type values. These types can be or'd to check for any combination of types. More... | |
enum | pad_side { pad_side::LEFT, pad_side::RIGHT, pad_side::BOTH } |
Pad types for the pad method specify where the pad character should be placed. More... | |
enum | strip_type { LEFT, RIGHT, BOTH } |
Direction identifier for strip() function. | |
enum | filter_type : bool { KEEP, REMOVE } |
Removes or keeps the specified character ranges in cudf::strings::filter_characters. | |
Functions | |
std::unique_ptr< column > | count_characters (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns an integer numeric column containing the length of each string in characters. More... | |
std::unique_ptr< column > | count_bytes (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a numeric column containing the length of each string in bytes. More... | |
std::unique_ptr< column > | code_points (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Creates a numeric column with code point values (integers) for each character of each string. More... | |
std::unique_ptr< column > | capitalize (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of capitalized strings. More... | |
std::unique_ptr< column > | title (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Modifies first character after spaces to uppercase and lower-cases the rest. More... | |
std::unique_ptr< column > | to_lower (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Converts a column of strings to lower case. More... | |
std::unique_ptr< column > | to_upper (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Converts a column of strings to upper case. More... | |
std::unique_ptr< column > | swapcase (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of strings converting lower case characters to upper case and vice versa. More... | |
string_character_types | operator| (string_character_types lhs, string_character_types rhs) |
string_character_types & | operator|= (string_character_types &lhs, string_character_types rhs) |
std::unique_ptr< column > | all_characters_of_type (strings_column_view const &strings, string_character_types types, string_character_types verify_types=string_character_types::ALL_TYPES, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a boolean column identifying strings entries in which all characters are of the type specified. More... | |
std::unique_ptr< column > | filter_characters_of_type (strings_column_view const &strings, string_character_types types_to_remove, string_scalar const &replacement=string_scalar(""), string_character_types types_to_keep=string_character_types::ALL_TYPES, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Filter specific character types from a column of strings. More... | |
std::unique_ptr< column > | is_integer (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a boolean column identifying strings in which all characters are valid for conversion to integers. More... | |
bool | all_integer (strings_column_view const &strings) |
Returns true if all strings contain characters that are valid for conversion to integers. More... | |
std::unique_ptr< column > | is_float (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a boolean column identifying strings in which all characters are valid for conversion to floats. More... | |
bool | all_float (strings_column_view const &strings) |
Returns true if all strings contain characters that are valid for conversion to floats. More... | |
std::unique_ptr< column > | concatenate (table_view const &strings_columns, string_scalar const &separator=string_scalar(""), string_scalar const &narep=string_scalar("", false), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Row-wise concatenates the given list of strings columns and returns a single strings column result. More... | |
std::unique_ptr< column > | join_strings (strings_column_view const &strings, string_scalar const &separator=string_scalar(""), string_scalar const &narep=string_scalar("", false), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Concatenates all strings in the column into one new string delimited by an optional separator string. More... | |
std::unique_ptr< column > | concatenate (table_view const &strings_columns, strings_column_view const &separators, string_scalar const &separator_narep=string_scalar("", false), string_scalar const &col_narep=string_scalar("", false), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Concatenates a list of strings columns using separators for each row and returns the result as a strings column. More... | |
std::unique_ptr< column > | contains_re (strings_column_view const &strings, std::string const &pattern, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a boolean column identifying rows which match the given regex pattern. More... | |
std::unique_ptr< column > | matches_re (strings_column_view const &strings, std::string const &pattern, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a boolean column identifying rows which matching the given regex pattern but only at the beginning the string. More... | |
std::unique_ptr< column > | count_re (strings_column_view const &strings, std::string const &pattern, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns the number of times the given regex pattern matches in each string. More... | |
std::unique_ptr< column > | to_booleans (strings_column_view const &strings, string_scalar const &true_string=string_scalar("true"), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new BOOL8 column by parsing boolean values from the strings in the provided strings column. More... | |
std::unique_ptr< column > | from_booleans (column_view const &booleans, string_scalar const &true_string=string_scalar("true"), string_scalar const &false_string=string_scalar("false"), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new strings column converting the boolean values from the provided column into strings. More... | |
std::unique_ptr< column > | to_timestamps (strings_column_view const &strings, data_type timestamp_type, std::string const &format, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new timestamp column converting a strings column into timestamps using the provided format pattern. More... | |
std::unique_ptr< column > | is_timestamp (strings_column_view const &strings, std::string const &format, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Verifies the given strings column can be parsed to timestamps using the provided format pattern. More... | |
std::unique_ptr< column > | from_timestamps (column_view const ×tamps, std::string const &format="%Y-%m-%dT%H:%M:%SZ", rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new strings column converting a timestamp column into strings using the provided format pattern. More... | |
std::unique_ptr< column > | to_durations (strings_column_view const &strings, data_type duration_type, std::string const &format, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new duration column converting a strings column into durations using the provided format pattern. More... | |
std::unique_ptr< column > | from_durations (column_view const &durations, std::string const &format="%D days %H:%M:%S", rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new strings column converting a duration column into strings using the provided format pattern. More... | |
std::unique_ptr< column > | to_floats (strings_column_view const &strings, data_type output_type, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new numeric column by parsing float values from each string in the provided strings column. More... | |
std::unique_ptr< column > | from_floats (column_view const &floats, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new strings column converting the float values from the provided column into strings. More... | |
std::unique_ptr< column > | to_integers (strings_column_view const &strings, data_type output_type, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new integer numeric column parsing integer values from the provided strings column. More... | |
std::unique_ptr< column > | from_integers (column_view const &integers, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new strings column converting the integer values from the provided column into strings. More... | |
std::unique_ptr< column > | hex_to_integers (strings_column_view const &strings, data_type output_type, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new integer numeric column parsing hexadecimal values from the provided strings column. More... | |
std::unique_ptr< column > | is_hex (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a boolean column identifying strings in which all characters are valid for conversion to integers from hex. More... | |
std::unique_ptr< column > | ipv4_to_integers (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Converts IPv4 addresses into integers. More... | |
std::unique_ptr< column > | integers_to_ipv4 (column_view const &integers, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Converts integers into IPv4 addresses as strings. More... | |
std::unique_ptr< column > | is_ipv4 (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a boolean column identifying strings in which all characters are valid for conversion to integers from IPv4 format. More... | |
std::unique_ptr< column > | url_encode (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Decodes each string using URL encoding. More... | |
std::unique_ptr< column > | url_decode (strings_column_view const &strings, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Encodes each string using URL encoding. More... | |
std::unique_ptr< table > | extract (strings_column_view const &strings, std::string const &pattern, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a vector of strings columns for each matching group specified in the given regular expression pattern. More... | |
std::unique_ptr< column > | find (strings_column_view const &strings, string_scalar const &target, size_type start=0, size_type stop=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of character position values where the target string is first found in each string of the provided column. More... | |
std::unique_ptr< column > | rfind (strings_column_view const &strings, string_scalar const &target, size_type start=0, size_type stop=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of character position values where the target string is first found searching from the end of each string. More... | |
std::unique_ptr< column > | contains (strings_column_view const &strings, string_scalar const &target, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of boolean values for each string where true indicates the target string was found within that string in the provided column. More... | |
std::unique_ptr< column > | contains (strings_column_view const &strings, strings_column_view const &targets, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of boolean values for each string where true indicates the corresponding target string was found within that string in the provided column. More... | |
std::unique_ptr< column > | starts_with (strings_column_view const &strings, string_scalar const &target, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of boolean values for each string where true indicates the target string was found at the beginning of that string in the provided column. More... | |
std::unique_ptr< column > | starts_with (strings_column_view const &strings, strings_column_view const &targets, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of boolean values for each string where true indicates corresponding string in target column was found at the beginning of that string in the provided column. More... | |
std::unique_ptr< column > | ends_with (strings_column_view const &strings, string_scalar const &target, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of boolean values for each string where true indicates the target string was found at the end of that string in the provided column. More... | |
std::unique_ptr< column > | ends_with (strings_column_view const &strings, strings_column_view const &targets, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column of boolean values for each string where true indicates corresponding string in target column was found at the end of that string in the provided column. More... | |
std::unique_ptr< column > | find_multiple (strings_column_view const &strings, strings_column_view const &targets, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a column with character position values where each of the target strings are found in each string. More... | |
std::unique_ptr< table > | findall_re (strings_column_view const &strings, std::string const &pattern, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a table of strings columns for each matching occurrence of the regex pattern within each string. More... | |
std::unique_ptr< column > | pad (strings_column_view const &strings, size_type width, pad_side side=cudf::strings::pad_side::RIGHT, std::string const &fill_char=" ", rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Add padding to each string using a provided character. More... | |
std::unique_ptr< column > | zfill (strings_column_view const &strings, size_type width, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Add '0' as padding to the left of each string. More... | |
std::unique_ptr< column > | replace (strings_column_view const &strings, string_scalar const &target, string_scalar const &repl, int32_t maxrepl=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Replaces target string within each string with the specified replacement string. More... | |
std::unique_ptr< column > | replace_slice (strings_column_view const &strings, string_scalar const &repl=string_scalar(""), size_type start=0, size_type stop=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
This function replaces each string in the column with the provided repl string within the [start,stop) character position range. More... | |
std::unique_ptr< column > | replace (strings_column_view const &strings, strings_column_view const &targets, strings_column_view const &repls, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Replaces substrings matching a list of targets with the corresponding replacement strings. More... | |
std::unique_ptr< column > | replace_nulls (strings_column_view const &strings, string_scalar const &repl=string_scalar(""), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Replaces any null string entries with the given string. More... | |
std::unique_ptr< column > | replace_re (strings_column_view const &strings, std::string const &pattern, string_scalar const &repl=string_scalar(""), size_type maxrepl=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
For each string, replaces any character sequence matching the given pattern with the provided replacement string. More... | |
std::unique_ptr< column > | replace_re (strings_column_view const &strings, std::vector< std::string > const &patterns, strings_column_view const &repls, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
For each string, replaces any character sequence matching the given patterns with the corresponding string in the repls column. More... | |
std::unique_ptr< column > | replace_with_backrefs (strings_column_view const &strings, std::string const &pattern, std::string const &repl, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
For each string, replaces any character sequence matching the given pattern using the repl template for back-references. More... | |
std::unique_ptr< table > | partition (strings_column_view const &strings, string_scalar const &delimiter=string_scalar(""), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a set of 3 columns by splitting each string using the specified delimiter. More... | |
std::unique_ptr< table > | rpartition (strings_column_view const &strings, string_scalar const &delimiter=string_scalar(""), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a set of 3 columns by splitting each string using the specified delimiter starting from the end of each string. More... | |
std::unique_ptr< table > | split (strings_column_view const &strings_column, string_scalar const &delimiter=string_scalar(""), size_type maxsplit=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a list of columns by splitting each string using the specified delimiter. More... | |
std::unique_ptr< table > | rsplit (strings_column_view const &strings_column, string_scalar const &delimiter=string_scalar(""), size_type maxsplit=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a list of columns by splitting each string using the specified delimiter starting from the end of each string. More... | |
std::unique_ptr< column > | split_record (strings_column_view const &strings, string_scalar const &delimiter=string_scalar(""), size_type maxsplit=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Splits individual strings elements into a list of strings. More... | |
std::unique_ptr< column > | rsplit_record (strings_column_view const &strings, string_scalar const &delimiter=string_scalar(""), size_type maxsplit=-1, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Splits individual strings elements into a list of strings starting from the end of each string. More... | |
void | print (strings_column_view const &strings, size_type start=0, size_type end=-1, size_type max_width=-1, const char *delimiter="\n") |
Prints the strings to stdout. More... | |
std::pair< rmm::device_vector< char >, rmm::device_vector< size_type > > | create_offsets (strings_column_view const &strings, rmm::cuda_stream_view stream=rmm::cuda_stream_default, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Create output per Arrow strings format. More... | |
std::unique_ptr< column > | strip (strings_column_view const &strings, strip_type stype=strip_type::BOTH, string_scalar const &to_strip=string_scalar(""), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Removes the specified characters from the beginning or end (or both) of each string. More... | |
std::unique_ptr< column > | slice_strings (strings_column_view const &strings, numeric_scalar< size_type > const &start=numeric_scalar< size_type >(0, false), numeric_scalar< size_type > const &stop=numeric_scalar< size_type >(0, false), numeric_scalar< size_type > const &step=numeric_scalar< size_type >(1), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new strings column that contains substrings of the strings in the provided column. More... | |
std::unique_ptr< column > | slice_strings (strings_column_view const &strings, column_view const &starts, column_view const &stops, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Returns a new strings column that contains substrings of the strings in the provided column using unique ranges for each string. More... | |
std::unique_ptr< column > | slice_strings (strings_column_view const &strings, string_scalar const &delimiter, size_type count, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Slices a column of strings by using a delimiter as a slice point. More... | |
std::unique_ptr< column > | slice_strings (strings_column_view const &strings, strings_column_view const &delimiter_strings, size_type count, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Slices a column of strings by using a delimiter column as slice points. More... | |
std::unique_ptr< column > | translate (strings_column_view const &strings, std::vector< std::pair< char_utf8, char_utf8 >> const &chars_table, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Translates individual characters within each string. More... | |
std::unique_ptr< column > | filter_characters (strings_column_view const &strings, std::vector< std::pair< cudf::char_utf8, cudf::char_utf8 >> characters_to_filter, filter_type keep_characters=filter_type::KEEP, string_scalar const &replacement=string_scalar(""), rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Removes ranges of characters from each string in a strings column. More... | |
std::unique_ptr< column > | wrap (strings_column_view const &strings, size_type width, rmm::mr::device_memory_resource *mr=rmm::mr::get_current_device_resource()) |
Wraps strings onto multiple lines shorter than width by replacing appropriate white space with new-line characters (ASCII 0x0A). More... | |
Strings column APIs.
std::pair<rmm::device_vector<char>, rmm::device_vector<size_type> > cudf::strings::create_offsets | ( | strings_column_view const & | strings, |
rmm::cuda_stream_view | stream = rmm::cuda_stream_default , |
||
rmm::mr::device_memory_resource * | mr = rmm::mr::get_current_device_resource() |
||
) |
Create output per Arrow strings format.
The return pair is the vector of chars and the vector of offsets.
strings | Strings instance for this operation. |
stream | CUDA stream used for device memory operations and kernel launches. |
mr | Device memory resource used to allocate the returned device_vectors. |
void cudf::strings::print | ( | strings_column_view const & | strings, |
size_type | start = 0 , |
||
size_type | end = -1 , |
||
size_type | max_width = -1 , |
||
const char * | delimiter = "\n" |
||
) |
Prints the strings to stdout.
strings | Strings instance for this operation. |
start | Index of first string to print. |
end | Index of last string to print. Specify -1 for all strings. |
max_width | Maximum number of characters to print per string. Specify -1 to print all characters. |
delimiter | The chars to print between each string. Default is new-line character. |