Files | |
file | convert_booleans.hpp |
file | convert_datetime.hpp |
file | convert_durations.hpp |
file | convert_fixed_point.hpp |
file | convert_floats.hpp |
file | convert_integers.hpp |
file | convert_ipv4.hpp |
file | convert_lists.hpp |
file | convert_urls.hpp |
std::unique_ptr<column> cudf::strings::format_list_column | ( | lists_column_view const & | input, |
string_scalar const & | na_rep = string_scalar("") , |
||
strings_column_view const & | separators = strings_column_view(column_view{ data_type{type_id::STRING}, 0, nullptr, nullptr, 0}) , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Convert a list column of strings into a formatted strings column.
The separators
column should contain 3 strings elements in the following order:
,
)[
)]
)cudf::logic_error | if the input column is not a LIST type with a STRING child. |
input | Lists column to format |
na_rep | Replacement string for null elements |
separators | Strings to use for enclosing list components and separating elements |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::from_booleans | ( | column_view const & | booleans, |
string_scalar const & | true_string, | ||
string_scalar const & | false_string, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new strings column converting the boolean values from the provided column into strings.
Any null entries will result in corresponding null entries in the output column.
cudf::logic_error | if the input column is not BOOL8 type. |
booleans | Boolean column to convert |
true_string | String to use for true in the output column |
false_string | String to use for false in the output column |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::from_durations | ( | column_view const & | durations, |
std::string_view | format = "%D days %H:%M:%S" , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new strings column converting a duration column into strings using the provided format pattern.
The format pattern can include the following specifiers: "%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS"
Specifier | Description | Range |
---|---|---|
%% | A literal % character | % |
%n | A newline character | \n |
%t | A horizontal tab character | \t |
%D | Days | -2,147,483,648 to 2,147,483,647 |
%H | 24-hour of the day | 00 to 23 |
%I | 12-hour of the day | 00 to 11 |
%M | Minute of the hour | 00 to 59 |
%S | Second of the minute | 00 to 59.999999999 |
%OH | same as H but without sign | 00 to 23 |
%OI | same as I but without sign | 00 to 11 |
%OM | same as M but without sign | 00 to 59 |
%OS | same as S but without sign | 00 to 59 |
%p | AM/PM designations associated with a 12-hour clock | 'AM' or 'PM' |
%R | Equivalent to "%H:%M" | |
%T | Equivalent to "%H:%M:%S" | |
%r | Equivalent to "%OI:%OM:%OS %p" |
No checking is done for invalid formats or invalid duration values. Formatting sticks to specifications of std::formatter<std::chrono::duration>
as much as possible.
Any null input entry will result in a corresponding null entry in the output column.
The time units of the input column influence the number of digits in decimal of seconds. It uses 3 digits for milliseconds, 6 digits for microseconds and 9 digits for nanoseconds. If duration value is negative, only one negative sign is written to output string. The specifiers with signs are "%H,%I,%M,%S,%R,%T".
cudf::logic_error | if durations column parameter is not a duration type. |
durations | Duration values to convert |
format | The string specifying output format. Default format is ""D days H:M:S". |
mr | Device memory resource used to allocate the returned column's device memory |
stream | CUDA stream used for device memory operations and kernel launches |
std::unique_ptr<column> cudf::strings::from_fixed_point | ( | column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new strings column converting the fixed-point values into a strings column.
Any null entries result in corresponding null entries in the output column.
For each value, a string is created in base-10 decimal. Negative numbers include a '-' prefix in the output string. The column's scale value is used to place the decimal point. A negative scale value may add padded zeros after the decimal point.
cudf::logic_error | if the input column is not a fixed-point decimal type. |
input | Fixed-point column to convert |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::from_floats | ( | column_view const & | floats, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new strings column converting the float values from the provided column into strings.
Any null entries will result in corresponding null entries in the output column.
For each float, a string is created in base-10 decimal. Negative numbers will include a '-' prefix. Numbers producing more than 10 significant digits will produce a string that includes scientific notation (e.g. "-1.78e+15").
cudf::logic_error | if floats column is not float type. |
floats | Numeric column to convert |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::from_integers | ( | column_view const & | integers, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new strings column converting the integer values from the provided column into strings.
Any null entries will result in corresponding null entries in the output column.
For each integer, a string is created in base-10 decimal. Negative numbers will include a '-' prefix.
cudf::logic_error | if integers column is not integral type. |
integers | Numeric column to convert |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::from_timestamps | ( | column_view const & | timestamps, |
std::string_view | format = "%Y-%m-%dT%H:%M:%SZ" , |
||
strings_column_view const & | names = strings_column_view(column_view{ data_type{type_id::STRING}, 0, nullptr, nullptr, 0}) , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new strings column converting a timestamp column into strings using the provided format pattern.
The format pattern can include the following specifiers: "%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z,%Z"
Specifier | Description |
---|---|
%d | Day of the month: 01-31 |
%m | Month of the year: 01-12 |
%y | Year without century: 00-99 |
%Y | Year with century: 0001-9999 |
%H | 24-hour of the day: 00-23 |
%I | 12-hour of the day: 01-12 |
%M | Minute of the hour: 00-59 |
%S | Second of the minute: 00-59 |
%f | 6-digit microsecond: 000000-999999 |
%z | Always outputs "+0000" |
%Z | Always outputs "UTC" |
%j | Day of the year: 001-366 |
%u | ISO weekday where Monday is 1 and Sunday is 7 |
%w | Weekday where Sunday is 0 and Saturday is 6 |
%U | Week of the year with Sunday as the first day: 00-53 |
%W | Week of the year with Monday as the first day: 00-53 |
%V | Week of the year per ISO-8601 format: 01-53 |
%G | Year based on the ISO-8601 weeks: 0000-9999 |
%p | AM/PM from timestamp_names::am_str/pm_str |
%a | Weekday abbreviation from the names parameter |
%A | Weekday from the names parameter |
%b | Month name abbreviation from the names parameter |
%B | Month name from the names parameter |
Additional descriptions can be found here: https://en.cppreference.com/w/cpp/chrono/system_clock/formatter
No checking is done for invalid formats or invalid timestamp values. All timestamps values are formatted to UTC.
Any null input entry will result in a corresponding null entry in the output column.
The time units of the input column do not influence the number of digits written by the "%f" specifier. The "%f" supports a precision value to write out numeric digits for the subsecond value. Specify the precision with a single integer value (1-9) between the "%" and the "f" as follows: use "%3f" for milliseconds, use "%6f" for microseconds and use "%9f" for nanoseconds. If the precision is higher than the units, then zeroes are padded to the right of the subsecond value. If the precision is lower than the units, the subsecond value may be truncated.
If the "%a", "%A", "%b", "%B" specifiers are included in the format, the caller should provide the format names in the names
strings column using the following as a guide:
The result is undefined if the format names are not provided for these specifiers.
These format names can be retrieved for specific locales using the nl_langinfo
functions from C++ clocale
(std) library or the Python locale
library.
The following code is an example of retrieving these strings from the locale using c++ std functions:
cudf::logic_error | if timestamps column parameter is not a timestamp type. |
cudf::logic_error | if the format string is empty |
cudf::logic_error | if names.size() is an invalid size. Must be 0 or 40 strings. |
timestamps | Timestamp values to convert |
format | The string specifying output format. Default format is "%Y-%m-%dT%H:%M:%SZ". |
names | The string names to use for weekdays ("%a", "%A") and months ("%b", "%B") Default is an empty strings_column_view . |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::hex_to_integers | ( | strings_column_view const & | input, |
data_type | output_type, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new integer numeric column parsing hexadecimal values from the provided strings column.
Any null entries will result in corresponding null entries in the output column.
Only characters [0-9] and [A-F] are recognized. When any other character is encountered, the parsing ends for that string. No interpretation is made on the sign of the integer.
Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.
cudf::logic_error | if output_type is not integral type. |
input | Strings instance for this operation |
output_type | Type of integer numeric column to return |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::integers_to_hex | ( | column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new strings column converting integer columns to hexadecimal characters.
Any null entries will result in corresponding null entries in the output column.
The output character set is '0'-'9' and 'A'-'F'. The output string width will be a multiple of 2 depending on the size of the integer type. A single leading zero is applied to the first non-zero output byte if it less than 0x10.
The example above shows an INT32
type column where each integer is 4 bytes. Leading zeros are suppressed unless filling out a complete byte as in ‘1234 -> '04D2’instead of
000004D2or
4D2`.
cudf::logic_error | if the input column is not integral type. |
input | Integer column to convert to hex |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::integers_to_ipv4 | ( | column_view const & | integers, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Converts integers into IPv4 addresses as strings.
The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].
Each input integer is dissected into four integers by dividing the input into 8-bit sections. These sub-integers are then converted into [0-9] characters and placed between '.' characters.
No checking is done on the input integer value. Only the lower 32-bits are used.
Any null entries will result in corresponding null entries in the output column.
cudf::logic_error | if the input column is not INT64 type. |
integers | Integer (INT64) column to convert |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::ipv4_to_integers | ( | strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Converts IPv4 addresses into integers.
The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].
The four sets of digits are converted to integers and placed in 8-bit fields inside the resulting integer.
No checking is done on the format. If a string is not in IPv4 format, the resulting integer is undefined.
The resulting 32-bit integer is placed in an int64_t to avoid setting the sign-bit in an int32_t type. This could be changed if cudf supported a UINT32 type in the future.
Any null entries will result in corresponding null entries in the output column.
input | Strings instance for this operation |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::is_fixed_point | ( | strings_column_view const & | input, |
data_type | decimal_type = data_type{type_id::DECIMAL64} , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a boolean column identifying strings in which all characters are valid for conversion to fixed-point.
The sign and the exponent is optional. The decimal point may only appear once. Also, the integer component must fit within the size limits of the underlying fixed-point storage type. The value of the integer component is based on the scale of the decimal_type
provided.
Any null entries result in corresponding null entries in the output column.
cudf::logic_error | if the decimal_type is not a fixed-point decimal type. |
input | Strings instance for this operation |
decimal_type | Fixed-point type (with scale) used only for checking overflow |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::is_float | ( | strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a boolean column identifying strings in which all characters are valid for conversion to floats.
The output row entry will be set to true
if the corresponding string element has at least one character in [-+0-9eE.].
Any null row results in a null entry for that row in the output column.
input | Strings instance for this operation |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::is_hex | ( | strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a boolean column identifying strings in which all characters are valid for conversion to integers from hex.
The output row entry will be set to true
if the corresponding string element has at least one character in [0-9A-Za-z]. Also, the string may start with '0x'.
Any null row results in a null entry for that row in the output column.
input | Strings instance for this operation |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::is_integer | ( | strings_column_view const & | input, |
data_type | int_type, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a boolean column identifying strings in which all characters are valid for conversion to integers.
The output row entry will be set to true
if the corresponding string element has all characters in [-+0-9]. The optional sign character must only be in the first position. Also, the integer component must fit within the size limits of the underlying storage type, which is provided by the int_type parameter.
Any null row results in a null entry for that row in the output column.
input | Strings instance for this operation |
int_type | Integer type used for checking underflow and overflow |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::is_integer | ( | strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a boolean column identifying strings in which all characters are valid for conversion to integers.
The output row entry will be set to true
if the corresponding string element have all characters in [-+0-9]. The optional sign character must only be in the first position. Notice that the integer value is not checked to be within its storage limits. For strict integer type check, use the other is_integer()
API which accepts data_type
argument.
Any null row results in a null entry for that row in the output column.
input | Strings instance for this operation |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::is_ipv4 | ( | strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a boolean column identifying strings in which all characters are valid for conversion to integers from IPv4 format.
The output row entry will be set to true
if the corresponding string element has the following format xxx.xxx.xxx.xxx
where xxx
is integer digits between 0-255.
Any null row results in a null entry for that row in the output column.
input | Strings instance for this operation |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::is_timestamp | ( | strings_column_view const & | input, |
std::string_view | format, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Verifies the given strings column can be parsed to timestamps using the provided format pattern.
The format pattern can include the following specifiers: "%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z"
Specifier | Description |
---|---|
%d | Day of the month: 01-31 |
%m | Month of the year: 01-12 |
%y | Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999] |
%Y | Year with century: 0001-9999 |
%H | 24-hour of the day: 00-23 |
%I | 12-hour of the day: 01-12 |
%M | Minute of the hour: 00-59 |
%S | Second of the minute: 00-59. Leap second is not supported. |
%f | 6-digit microsecond: 000000-999999 |
%z | UTC offset with format ±HHMM Example +0500 |
%j | Day of the year: 001-366 |
%p | Only 'AM', 'PM' or 'am', 'pm' are recognized |
%W | Week of the year with Monday as the first day of the week: 00-53 |
%w | Day of week: 0-6 = Sunday-Saturday |
%U | Week of the year with Sunday as the first day of the week: 00-53 |
%u | Day of week: 1-7 = Monday-Sunday |
Other specifiers are not currently supported. The "%f" supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use "%3f" for milliseconds, "%6f" for microseconds and "%9f" for nanoseconds.
Any null string entry will result in a corresponding null row in the output column.
This will return a column of type BOOL8 where a true
row indicates the corresponding input string can be parsed correctly with the given format.
input | Strings instance for this operation |
format | String specifying the timestamp format in strings |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::to_booleans | ( | strings_column_view const & | input, |
string_scalar const & | true_string, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new BOOL8 column by parsing boolean values from the strings in the provided strings column.
Any null entries will result in corresponding null entries in the output column.
input | Strings instance for this operation |
true_string | String to expect for true. Non-matching strings are false |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::to_durations | ( | strings_column_view const & | input, |
data_type | duration_type, | ||
std::string_view | format, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new duration column converting a strings column into durations using the provided format pattern.
The format pattern can include the following specifiers: "%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS"
Specifier | Description | Range |
---|---|---|
%% | A literal % character | % |
%n | A newline character | \n |
%t | A horizontal tab character | \t |
%D | Days | -2,147,483,648 to 2,147,483,647 |
%H | 24-hour of the day | 00 to 23 |
%I | 12-hour of the day | 00 to 11 |
%M | Minute of the hour | 00 to 59 |
%S | Second of the minute | 00 to 59.999999999 |
%OH | same as H but without sign | 00 to 23 |
%OI | same as I but without sign | 00 to 11 |
%OM | same as M but without sign | 00 to 59 |
%OS | same as S but without sign | 00 to 59 |
%p | AM/PM designations associated with a 12-hour clock | 'AM' or 'PM' |
%R | Equivalent to "%H:%M" | |
%T | Equivalent to "%H:%M:%S" | |
%r | Equivalent to "%OI:%OM:%OS %p" |
Other specifiers are not currently supported.
Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry's duration value is undefined.
Any null string entry will result in a corresponding null row in the output column.
The resulting time units are specified by the duration_type
parameter.
cudf::logic_error | if duration_type is not a duration type. |
input | Strings instance for this operation |
duration_type | The duration type used for creating the output column |
format | String specifying the duration format in strings |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::to_fixed_point | ( | strings_column_view const & | input, |
data_type | output_type, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new fixed-point column parsing decimal values from the provided strings column.
Any null entries result in corresponding null entries in the output column.
The expected format is [sign][integer][.][fraction]
, where the sign is either not present, -
or +
, The decimal point [.]
may or may not be present, and integer
and fraction
are comprised of zero or more digits in [0-9]. An invalid data format results in undefined behavior in the corresponding output row result.
Overflow of the resulting value type is not checked. The scale in the output_type
is used for setting the integer component.
cudf::logic_error | if output_type is not a fixed-point decimal type. |
input | Strings instance for this operation |
output_type | Type of fixed-point column to return including the scale value |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
output_type
std::unique_ptr<column> cudf::strings::to_floats | ( | strings_column_view const & | strings, |
data_type | output_type, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new numeric column by parsing float values from each string in the provided strings column.
Any null entries will result in corresponding null entries in the output column.
Only characters [0-9] plus a prefix '-' and '+' and decimal '.' are recognized. Additionally, scientific notation is also supported (e.g. "-1.78e+5").
cudf::logic_error | if output_type is not float type. |
strings | Strings instance for this operation |
output_type | Type of float numeric column to return |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::to_integers | ( | strings_column_view const & | input, |
data_type | output_type, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new integer numeric column parsing integer values from the provided strings column.
Any null entries will result in corresponding null entries in the output column.
Only characters [0-9] plus a prefix '-' and '+' are recognized. When any other character is encountered, the parsing ends for that string and the current digits are converted into an integer.
Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.
cudf::logic_error | if output_type is not integral type. |
input | Strings instance for this operation |
output_type | Type of integer numeric column to return |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::to_timestamps | ( | strings_column_view const & | input, |
data_type | timestamp_type, | ||
std::string_view | format, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Returns a new timestamp column converting a strings column into timestamps using the provided format pattern.
The format pattern can include the following specifiers: "%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z"
Specifier | Description |
---|---|
%d | Day of the month: 01-31 |
%m | Month of the year: 01-12 |
%y | Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999] |
%Y | Year with century: 0001-9999 |
%H | 24-hour of the day: 00-23 |
%I | 12-hour of the day: 01-12 |
%M | Minute of the hour: 00-59 |
%S | Second of the minute: 00-59. Leap second is not supported. |
%f | 6-digit microsecond: 000000-999999 |
%z | UTC offset with format ±HHMM Example +0500 |
%j | Day of the year: 001-366 |
%p | Only 'AM', 'PM' or 'am', 'pm' are recognized |
%W | Week of the year with Monday as the first day of the week: 00-53 |
%w | Day of week: 0-6 = Sunday-Saturday |
%U | Week of the year with Sunday as the first day of the week: 00-53 |
%u | Day of week: 1-7 = Monday-Sunday |
Other specifiers are not currently supported.
Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry's timestamp value is undefined.
Any null string entry will result in a corresponding null row in the output column.
The resulting time units are specified by the timestamp_type
parameter. The time units are independent of the number of digits parsed by the "%f" specifier. The "%f" supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use "%3f" for milliseconds, "%6f" for microseconds and "%9f" for nanoseconds.
Although leap second is not supported for "%S", no checking is performed on the value. The cudf::strings::is_timestamp can be used to verify the valid range of values.
If "%W"/"%w" (or "%U/%u") and "%m"/"%d" are both specified, the "%W"/U and "%w"/u values take precedent when computing the date part of the timestamp result.
cudf::logic_error | if timestamp_type is not a timestamp type. |
input | Strings instance for this operation |
timestamp_type | The timestamp type used for creating the output column |
format | String specifying the timestamp format in strings |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::url_decode | ( | strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Encodes each string using URL encoding.
Converts all character sequences starting with '' into character code-points interpreting the 2 following characters as hex values to create the code-point. For example, the sequence '%20' is converted into byte (0x20) which is a single space character. Another example converts 'C3A9' into 2 sequential bytes (0xc3 and 0xa9 respectively) which is the é character. Overall, 3 characters are converted into one char byte whenever a '%' (single percent) character is encountered in the string.
Any null entries will result in corresponding null entries in the output column.
input | Strings instance for this operation |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<column> cudf::strings::url_encode | ( | strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = rmm::mr::get_current_device_resource() |
||
) |
Decodes each string using URL encoding.
Converts mostly non-ascii characters and control characters into UTF-8 hex code-points prefixed with ''. For example, the space character must be converted to characters '%20' where the '20' indicates the hex value for space in UTF-8. Likewise, multi-byte characters are converted to multiple hex characters. For example, the é character is converted to characters 'C3A9' where 'C3A9' is the UTF-8 bytes 0xC3A9 for this character.
Any null entries will result in corresponding null entries in the output column.
input | Strings instance for this operation |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |