Strings Convert#
- group strings_convert
Functions
-
std::unique_ptr<column> to_booleans(strings_column_view const &input, string_scalar const &true_string, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new BOOL8 column by parsing boolean values from the strings in the provided strings column.
Any null entries will result in corresponding null entries in the output column.
- Parameters:
input – Strings instance for this operation
true_string – String to expect for true. Non-matching strings are false
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New BOOL8 column converted from strings
-
std::unique_ptr<column> from_booleans(column_view const &booleans, string_scalar const &true_string, string_scalar const &false_string, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new strings column converting the boolean values from the provided column into strings.
Any null entries will result in corresponding null entries in the output column.
- Throws:
cudf::logic_error – if the input column is not BOOL8 type.
- Parameters:
booleans – Boolean column to convert
true_string – String to use for true in the output column
false_string – String to use for false in the output column
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column
-
std::unique_ptr<column> to_timestamps(strings_column_view const &input, data_type timestamp_type, std::string_view format, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new timestamp column converting a strings column into timestamps using the provided format pattern.
The format pattern can include the following specifiers: “%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z”
Specifier
Description
%d
Day of the month: 01-31
%m
Month of the year: 01-12
%y
Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999]
%Y
Year with century: 0001-9999
%H
24-hour of the day: 00-23
%I
12-hour of the day: 01-12
%M
Minute of the hour: 00-59
%S
Second of the minute: 00-59. Leap second is not supported.
%f
6-digit microsecond: 000000-999999
%z
UTC offset with format ±HHMM Example +0500
%j
Day of the year: 001-366
%p
Only ‘AM’, ‘PM’ or ‘am’, ‘pm’ are recognized
%W
Week of the year with Monday as the first day of the week: 00-53
%w
Day of week: 0-6 = Sunday-Saturday
%U
Week of the year with Sunday as the first day of the week: 00-53
%u
Day of week: 1-7 = Monday-Sunday
Other specifiers are not currently supported.
Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry’s timestamp value is undefined.
Any null string entry will result in a corresponding null row in the output column.
The resulting time units are specified by the
timestamp_type
parameter. The time units are independent of the number of digits parsed by the “%f” specifier. The “%f” supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use “%3f” for milliseconds, “%6f” for microseconds and “%9f” for nanoseconds.Although leap second is not supported for “%S”, no checking is performed on the value. The cudf::strings::is_timestamp can be used to verify the valid range of values.
If “%W”/”%w” (or “%U/%u”) and “%m”/”%d” are both specified, the “%W”/U and “%w”/u values take precedent when computing the date part of the timestamp result.
- Throws:
cudf::logic_error – if timestamp_type is not a timestamp type.
- Parameters:
input – Strings instance for this operation
timestamp_type – The timestamp type used for creating the output column
format – String specifying the timestamp format in strings
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New datetime column
-
std::unique_ptr<column> is_timestamp(strings_column_view const &input, std::string_view format, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Verifies the given strings column can be parsed to timestamps using the provided format pattern.
The format pattern can include the following specifiers: “%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z”
Specifier
Description
%d
Day of the month: 01-31
%m
Month of the year: 01-12
%y
Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999]
%Y
Year with century: 0001-9999
%H
24-hour of the day: 00-23
%I
12-hour of the day: 01-12
%M
Minute of the hour: 00-59
%S
Second of the minute: 00-59. Leap second is not supported.
%f
6-digit microsecond: 000000-999999
%z
UTC offset with format ±HHMM Example +0500
%j
Day of the year: 001-366
%p
Only ‘AM’, ‘PM’ or ‘am’, ‘pm’ are recognized
%W
Week of the year with Monday as the first day of the week: 00-53
%w
Day of week: 0-6 = Sunday-Saturday
%U
Week of the year with Sunday as the first day of the week: 00-53
%u
Day of week: 1-7 = Monday-Sunday
Other specifiers are not currently supported. The “%f” supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use “%3f” for milliseconds, “%6f” for microseconds and “%9f” for nanoseconds.
Any null string entry will result in a corresponding null row in the output column.
This will return a column of type BOOL8 where a
true
row indicates the corresponding input string can be parsed correctly with the given format.- Parameters:
input – Strings instance for this operation
format – String specifying the timestamp format in strings
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New BOOL8 column
-
std::unique_ptr<column> from_timestamps(column_view const ×tamps, std::string_view format = "%Y-%m-%dT%H:%M:%SZ", strings_column_view const &names = strings_column_view(column_view{data_type{type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new strings column converting a timestamp column into strings using the provided format pattern.
The format pattern can include the following specifiers: “%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z,%Z”
Specifier
Description
%d
Day of the month: 01-31
%m
Month of the year: 01-12
%y
Year without century: 00-99
%Y
Year with century: 0001-9999
%H
24-hour of the day: 00-23
%I
12-hour of the day: 01-12
%M
Minute of the hour: 00-59
%S
Second of the minute: 00-59
%f
6-digit microsecond: 000000-999999
%z
Always outputs “+0000”
%Z
Always outputs “UTC”
%j
Day of the year: 001-366
%u
ISO weekday where Monday is 1 and Sunday is 7
%w
Weekday where Sunday is 0 and Saturday is 6
%U
Week of the year with Sunday as the first day: 00-53
%W
Week of the year with Monday as the first day: 00-53
%V
Week of the year per ISO-8601 format: 01-53
%G
Year based on the ISO-8601 weeks: 0000-9999
%p
AM/PM from
timestamp_names::am_str/pm_str
%a
Weekday abbreviation from the
names
parameter%A
Weekday from the
names
parameter%b
Month name abbreviation from the
names
parameter%B
Month name from the
names
parameterAdditional descriptions can be found here: https://en.cppreference.com/w/cpp/chrono/system_clock/formatter
No checking is done for invalid formats or invalid timestamp values. All timestamps values are formatted to UTC.
Any null input entry will result in a corresponding null entry in the output column.
The time units of the input column do not influence the number of digits written by the “%f” specifier. The “%f” supports a precision value to write out numeric digits for the subsecond value. Specify the precision with a single integer value (1-9) between the “%” and the “f” as follows: use “%3f” for milliseconds, use “%6f” for microseconds and use “%9f” for nanoseconds. If the precision is higher than the units, then zeroes are padded to the right of the subsecond value. If the precision is lower than the units, the subsecond value may be truncated.
If the “%a”, “%A”, “%b”, “%B” specifiers are included in the format, the caller should provide the format names in the
names
strings column using the following as a guide:["AM", "PM", // specify the AM/PM strings "Sunday", "Monday", ..., "Saturday", // Weekday full names "Sun", "Mon", ..., "Sat", // Weekday abbreviated names "January", "February", ..., "December", // Month full names "Jan", "Feb", ..., "Dec"] // Month abbreviated names
The result is undefined if the format names are not provided for these specifiers.
These format names can be retrieved for specific locales using the
nl_langinfo
functions from C++clocale
(std) library or the Pythonlocale
library.The following code is an example of retrieving these strings from the locale using c++ std functions:
#include <clocale> #include <langinfo.h> // note: install language pack on Ubuntu using 'apt-get install language-pack-de' { // set to a German language locale for date settings std::setlocale(LC_TIME, "de_DE.UTF-8"); std::vector<std::string> names({nl_langinfo(AM_STR), nl_langinfo(PM_STR), nl_langinfo(DAY_1), nl_langinfo(DAY_2), nl_langinfo(DAY_3), nl_langinfo(DAY_4), nl_langinfo(DAY_5), nl_langinfo(DAY_6), nl_langinfo(DAY_7), nl_langinfo(ABDAY_1), nl_langinfo(ABDAY_2), nl_langinfo(ABDAY_3), nl_langinfo(ABDAY_4), nl_langinfo(ABDAY_5), nl_langinfo(ABDAY_6), nl_langinfo(ABDAY_7), nl_langinfo(MON_1), nl_langinfo(MON_2), nl_langinfo(MON_3), nl_langinfo(MON_4), nl_langinfo(MON_5), nl_langinfo(MON_6), nl_langinfo(MON_7), nl_langinfo(MON_8), nl_langinfo(MON_9), nl_langinfo(MON_10), nl_langinfo(MON_11), nl_langinfo(MON_12), nl_langinfo(ABMON_1), nl_langinfo(ABMON_2), nl_langinfo(ABMON_3), nl_langinfo(ABMON_4), nl_langinfo(ABMON_5), nl_langinfo(ABMON_6), nl_langinfo(ABMON_7), nl_langinfo(ABMON_8), nl_langinfo(ABMON_9), nl_langinfo(ABMON_10), nl_langinfo(ABMON_11), nl_langinfo(ABMON_12)}); std::setlocale(LC_TIME,""); // reset to default locale }
- Throws:
cudf::logic_error – if
timestamps
column parameter is not a timestamp type.cudf::logic_error – if the
format
string is emptycudf::logic_error – if
names.size()
is an invalid size. Must be 0 or 40 strings.
- Parameters:
timestamps – Timestamp values to convert
format – The string specifying output format. Default format is “%Y-%m-%dT%H:%M:%SZ”.
names – The string names to use for weekdays (“%a”, “%A”) and months (“%b”, “%B”) Default is an empty
strings_column_view
.stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column with formatted timestamps
-
std::unique_ptr<column> to_durations(strings_column_view const &input, data_type duration_type, std::string_view format, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new duration column converting a strings column into durations using the provided format pattern.
The format pattern can include the following specifiers: “%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS”
Specifier
Description
Range
%%
A literal % character
%
%n
A newline character
\n
%t
A horizontal tab character
\t
%D
Days
-2,147,483,648 to 2,147,483,647
%H
24-hour of the day
00 to 23
%I
12-hour of the day
00 to 11
%M
Minute of the hour
00 to 59
%S
Second of the minute
00 to 59.999999999
%OH
same as H but without sign
00 to 23
%OI
same as I but without sign
00 to 11
%OM
same as M but without sign
00 to 59
%OS
same as S but without sign
00 to 59
%p
AM/PM designations associated with a 12-hour clock
‘AM’ or ‘PM’
%R
Equivalent to “%H:%M”
%T
Equivalent to “%H:%M:%S”
%r
Equivalent to “%OI:%OM:%OS %p”
Other specifiers are not currently supported.
Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry’s duration value is undefined.
Any null string entry will result in a corresponding null row in the output column.
The resulting time units are specified by the
duration_type
parameter.- Throws:
cudf::logic_error – if duration_type is not a duration type.
- Parameters:
input – Strings instance for this operation
duration_type – The duration type used for creating the output column
format – String specifying the duration format in strings
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New duration column
-
std::unique_ptr<column> from_durations(column_view const &durations, std::string_view format = "%D days %H:%M:%S", rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new strings column converting a duration column into strings using the provided format pattern.
The format pattern can include the following specifiers: “%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS”
Specifier
Description
Range
%%
A literal % character
%
%n
A newline character
\n
%t
A horizontal tab character
\t
%D
Days
-2,147,483,648 to 2,147,483,647
%H
24-hour of the day
00 to 23
%I
12-hour of the day
00 to 11
%M
Minute of the hour
00 to 59
%S
Second of the minute
00 to 59.999999999
%OH
same as H but without sign
00 to 23
%OI
same as I but without sign
00 to 11
%OM
same as M but without sign
00 to 59
%OS
same as S but without sign
00 to 59
%p
AM/PM designations associated with a 12-hour clock
‘AM’ or ‘PM’
%R
Equivalent to “%H:%M”
%T
Equivalent to “%H:%M:%S”
%r
Equivalent to “%OI:%OM:%OS %p”
No checking is done for invalid formats or invalid duration values. Formatting sticks to specifications of
std::formatter<std::chrono::duration>
as much as possible.Any null input entry will result in a corresponding null entry in the output column.
The time units of the input column influence the number of digits in decimal of seconds. It uses 3 digits for milliseconds, 6 digits for microseconds and 9 digits for nanoseconds. If duration value is negative, only one negative sign is written to output string. The specifiers with signs are “%H,%I,%M,%S,%R,%T”.
- Throws:
cudf::logic_error – if
durations
column parameter is not a duration type.- Parameters:
durations – Duration values to convert
format – The string specifying output format. Default format is “”D days H:M:S”.
mr – Device memory resource used to allocate the returned column’s device memory
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
New strings column with formatted durations
-
std::unique_ptr<column> to_fixed_point(strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new fixed-point column parsing decimal values from the provided strings column.
Any null entries result in corresponding null entries in the output column.
The expected format is
[sign][integer][.][fraction]
, where the sign is either not present,-
or+
, The decimal point[.]
may or may not be present, andinteger
andfraction
are comprised of zero or more digits in [0-9]. An invalid data format results in undefined behavior in the corresponding output row result.Example: s = ['123', '-876', '543.2', '-0.12'] datatype = {DECIMAL32, scale=-2} fp = to_fixed_point(s, datatype) fp is [123400, -87600, 54320, -12]
Overflow of the resulting value type is not checked. The scale in the
output_type
is used for setting the integer component.- Throws:
cudf::logic_error – if
output_type
is not a fixed-point decimal type.- Parameters:
input – Strings instance for this operation
output_type – Type of fixed-point column to return including the scale value
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of
output_type
-
std::unique_ptr<column> from_fixed_point(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new strings column converting the fixed-point values into a strings column.
Any null entries result in corresponding null entries in the output column.
For each value, a string is created in base-10 decimal. Negative numbers include a ‘-’ prefix in the output string. The column’s scale value is used to place the decimal point. A negative scale value may add padded zeros after the decimal point.
Example: fp is [110, 222, 3330, -440, -1] with scale = -2 s = from_fixed_point(fp) s is now ['1.10', '2.22', '33.30', '-4.40', '-0.01']
- Throws:
cudf::logic_error – if the
input
column is not a fixed-point decimal type.- Parameters:
input – Fixed-point column to convert
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column
-
std::unique_ptr<column> is_fixed_point(strings_column_view const &input, data_type decimal_type = data_type{type_id::DECIMAL64}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a boolean column identifying strings in which all characters are valid for conversion to fixed-point.
The sign and the exponent is optional. The decimal point may only appear once. Also, the integer component must fit within the size limits of the underlying fixed-point storage type. The value of the integer component is based on the scale of the
decimal_type
provided.Example: s = ['123', '-456', '', '1.2.3', '+17E30', '12.34', '.789', '-0.005] b = is_fixed_point(s) b is [true, true, false, false, true, true, true, true]
Any null entries result in corresponding null entries in the output column.
- Throws:
cudf::logic_error – if the
decimal_type
is not a fixed-point decimal type.- Parameters:
input – Strings instance for this operation
decimal_type – Fixed-point type (with scale) used only for checking overflow
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of boolean results for each string
-
std::unique_ptr<column> to_floats(strings_column_view const &strings, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new numeric column by parsing float values from each string in the provided strings column.
Any null entries will result in corresponding null entries in the output column.
Only characters [0-9] plus a prefix ‘-’ and ‘+’ and decimal ‘.’ are recognized. Additionally, scientific notation is also supported (e.g. “-1.78e+5”).
- Throws:
cudf::logic_error – if output_type is not float type.
- Parameters:
strings – Strings instance for this operation
output_type – Type of float numeric column to return
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column with floats converted from strings
-
std::unique_ptr<column> from_floats(column_view const &floats, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new strings column converting the float values from the provided column into strings.
Any null entries will result in corresponding null entries in the output column.
For each float, a string is created in base-10 decimal. Negative numbers will include a ‘-’ prefix. Numbers producing more than 10 significant digits will produce a string that includes scientific notation (e.g. “-1.78e+15”).
- Throws:
cudf::logic_error – if floats column is not float type.
- Parameters:
floats – Numeric column to convert
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column with floats as strings
-
std::unique_ptr<column> is_float(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a boolean column identifying strings in which all characters are valid for conversion to floats.
The output row entry will be set to
true
if the corresponding string element has at least one character in [-+0-9eE.].Example: s = ['123', '-456', '', 'A', '+7', '8.9' '3.7e+5'] b = s.is_float(s) b is [true, true, false, false, true, true, true]
Any null row results in a null entry for that row in the output column.
- Parameters:
input – Strings instance for this operation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of boolean results for each string
-
std::unique_ptr<column> to_integers(strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new integer numeric column parsing integer values from the provided strings column.
Any null entries will result in corresponding null entries in the output column.
Only characters [0-9] plus a prefix ‘-’ and ‘+’ are recognized. When any other character is encountered, the parsing ends for that string and the current digits are converted into an integer.
Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.
- Throws:
cudf::logic_error – if output_type is not integral type.
- Parameters:
input – Strings instance for this operation
output_type – Type of integer numeric column to return
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column with integers converted from strings
-
std::unique_ptr<column> from_integers(column_view const &integers, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new strings column converting the integer values from the provided column into strings.
Any null entries will result in corresponding null entries in the output column.
For each integer, a string is created in base-10 decimal. Negative numbers will include a ‘-’ prefix.
- Throws:
cudf::logic_error – if integers column is not integral type.
- Parameters:
integers – Numeric column to convert
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column with integers as strings
-
std::unique_ptr<column> is_integer(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a boolean column identifying strings in which all characters are valid for conversion to integers.
The output row entry will be set to
true
if the corresponding string element have all characters in [-+0-9]. The optional sign character must only be in the first position. Notice that the integer value is not checked to be within its storage limits. For strict integer type check, use the otheris_integer()
API which acceptsdata_type
argument.Example: s = ['123', '-456', '', 'A', '+7'] b = s.is_integer(s) b is [true, true, false, false, true]
Any null row results in a null entry for that row in the output column.
- Parameters:
input – Strings instance for this operation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of boolean results for each string
-
std::unique_ptr<column> is_integer(strings_column_view const &input, data_type int_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a boolean column identifying strings in which all characters are valid for conversion to integers.
The output row entry will be set to
true
if the corresponding string element has all characters in [-+0-9]. The optional sign character must only be in the first position. Also, the integer component must fit within the size limits of the underlying storage type, which is provided by the int_type parameter.Example: s = ['123456', '-456', '', 'A', '+7'] output1 = s.is_integer(s, data_type{type_id::INT32}) output1 is [true, true, false, false, true] output2 = s.is_integer(s, data_type{type_id::INT8}) output2 is [false, false, false, false, true]
Any null row results in a null entry for that row in the output column.
- Parameters:
input – Strings instance for this operation
int_type – Integer type used for checking underflow and overflow
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of boolean results for each string
-
std::unique_ptr<column> hex_to_integers(strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new integer numeric column parsing hexadecimal values from the provided strings column.
Any null entries will result in corresponding null entries in the output column.
Only characters [0-9] and [A-F] are recognized. When any other character is encountered, the parsing ends for that string. No interpretation is made on the sign of the integer.
Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.
- Throws:
cudf::logic_error – if output_type is not integral type.
- Parameters:
input – Strings instance for this operation
output_type – Type of integer numeric column to return
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column with integers converted from strings
-
std::unique_ptr<column> is_hex(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a boolean column identifying strings in which all characters are valid for conversion to integers from hex.
The output row entry will be set to
true
if the corresponding string element has at least one character in [0-9A-Za-z]. Also, the string may start with ‘0x’.Example: s = ['123', '-456', '', 'AGE', '+17EA', '0x9EF' '123ABC'] b = is_hex(s) b is [true, false, false, false, false, true, true]
Any null row results in a null entry for that row in the output column.
- Parameters:
input – Strings instance for this operation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of boolean results for each string
-
std::unique_ptr<column> integers_to_hex(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a new strings column converting integer columns to hexadecimal characters.
Any null entries will result in corresponding null entries in the output column.
The output character set is ‘0’-‘9’ and ‘A’-‘F’. The output string width will be a multiple of 2 depending on the size of the integer type. A single leading zero is applied to the first non-zero output byte if it less than 0x10.
Example: input = [1234, -1, 0, 27, 342718233] // int32 type input column s = integers_to_hex(input) s is [ '04D2', 'FFFFFFFF', '00', '1B', '146D7719']
The example above shows an
INT32
type column where each integer is 4 bytes. Leading zeros are suppressed unless filling out a complete byte as in 1234 -> ‘04D2instead of
000004D2or
4D2`.- Throws:
cudf::logic_error – if the input column is not integral type.
- Parameters:
input – Integer column to convert to hex
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column with hexadecimal characters
-
std::unique_ptr<column> ipv4_to_integers(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Converts IPv4 addresses into integers.
The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].
The four sets of digits are converted to integers and placed in 8-bit fields inside the resulting integer.
i0.i1.i2.i3 -> (i0 << 24) | (i1 << 16) | (i2 << 8) | (i3)
No checking is done on the format. If a string is not in IPv4 format, the resulting integer is undefined.
The resulting 32-bit integer is placed in an int64_t to avoid setting the sign-bit in an int32_t type. This could be changed if cudf supported a UINT32 type in the future.
Any null entries will result in corresponding null entries in the output column.
- Parameters:
input – Strings instance for this operation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New INT64 column converted from strings
-
std::unique_ptr<column> integers_to_ipv4(column_view const &integers, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Converts integers into IPv4 addresses as strings.
The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].
Each input integer is dissected into four integers by dividing the input into 8-bit sections. These sub-integers are then converted into [0-9] characters and placed between ‘.’ characters.
No checking is done on the input integer value. Only the lower 32-bits are used.
Any null entries will result in corresponding null entries in the output column.
- Throws:
cudf::logic_error – if the input column is not INT64 type.
- Parameters:
integers – Integer (INT64) column to convert
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column
-
std::unique_ptr<column> is_ipv4(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns a boolean column identifying strings in which all characters are valid for conversion to integers from IPv4 format.
The output row entry will be set to
true
if the corresponding string element has the following formatxxx.xxx.xxx.xxx
wherexxx
is integer digits between 0-255.Example: s = ['123.255.0.7', '127.0.0.1', '', '1.2.34' '123.456.789.10'] b = s.is_ipv4(s) b is [true, true, false, false, true]
Any null row results in a null entry for that row in the output column.
- Parameters:
input – Strings instance for this operation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of boolean results for each string
-
std::unique_ptr<column> format_list_column(lists_column_view const &input, string_scalar const &na_rep = string_scalar(""), strings_column_view const &separators = strings_column_view(column_view{data_type{type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Convert a list column of strings into a formatted strings column.
The
separators
column should contain 3 strings elements in the following order:element separator (default is comma
,
)left-hand enclosure (default is
[
)right-hand enclosure (default is
]
)
l1 = { [[a,b,c], [d,e]], [[f,g], [h]] } s1 = format_list_column(l1) s1 is now ["[[a,b,c],[d,e]]", "[[f,g],[h]]"] l2 = { [[a,b,c], [d,e]], [NULL], [[f,g], NULL, [h]] } s2 = format_list_column(l1, '-', [':', '{', '}']) s2 is now ["{{a:b:c}:{d:e}}", "{-}", "{{f:g}:-:{h}}"]
- Throws:
cudf::logic_error – if the input column is not a LIST type with a STRING child.
- Parameters:
input – Lists column to format
na_rep – Replacement string for null elements
separators – Strings to use for enclosing list components and separating elements
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column
-
std::unique_ptr<column> url_encode(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Decodes each string using URL encoding.
Converts mostly non-ascii characters and control characters into UTF-8 hex code-points prefixed with ‘’. For example, the space character must be converted to characters ‘%20’ where the ‘20’ indicates the hex value for space in UTF-8. Likewise, multi-byte characters are converted to multiple hex characters. For example, the é character is converted to characters ‘C3A9’ where ‘C3A9’ is the UTF-8 bytes 0xC3A9 for this character.
Any null entries will result in corresponding null entries in the output column.
- Parameters:
input – Strings instance for this operation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column
-
std::unique_ptr<column> url_decode(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Encodes each string using URL encoding.
Converts all character sequences starting with ‘’ into character code-points interpreting the 2 following characters as hex values to create the code-point. For example, the sequence ‘%20’ is converted into byte (0x20) which is a single space character. Another example converts ‘C3A9’ into 2 sequential bytes (0xc3 and 0xa9 respectively) which is the é character. Overall, 3 characters are converted into one char byte whenever a ‘%’ (single percent) character is encountered in the string.
Any null entries will result in corresponding null entries in the output column.
- Parameters:
input – Strings instance for this operation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings column
-
std::unique_ptr<column> to_booleans(strings_column_view const &input, string_scalar const &true_string, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#