Files | Functions
Converting

Files

file  convert_booleans.hpp
 
file  convert_datetime.hpp
 
file  convert_durations.hpp
 
file  convert_fixed_point.hpp
 
file  convert_floats.hpp
 
file  convert_integers.hpp
 
file  convert_ipv4.hpp
 
file  convert_lists.hpp
 
file  convert_urls.hpp
 

Functions

std::unique_ptr< columncudf::strings::to_booleans (strings_column_view const &input, string_scalar const &true_string, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new BOOL8 column by parsing boolean values from the strings in the provided strings column. More...
 
std::unique_ptr< columncudf::strings::from_booleans (column_view const &booleans, string_scalar const &true_string, string_scalar const &false_string, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new strings column converting the boolean values from the provided column into strings. More...
 
std::unique_ptr< columncudf::strings::to_timestamps (strings_column_view const &input, data_type timestamp_type, std::string_view format, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new timestamp column converting a strings column into timestamps using the provided format pattern. More...
 
std::unique_ptr< columncudf::strings::is_timestamp (strings_column_view const &input, std::string_view format, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Verifies the given strings column can be parsed to timestamps using the provided format pattern. More...
 
std::unique_ptr< columncudf::strings::from_timestamps (column_view const &timestamps, std::string_view format="%Y-%m-%dT%H:%M:%SZ", strings_column_view const &names=strings_column_view(column_view{ data_type{type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new strings column converting a timestamp column into strings using the provided format pattern. More...
 
std::unique_ptr< columncudf::strings::to_durations (strings_column_view const &input, data_type duration_type, std::string_view format, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new duration column converting a strings column into durations using the provided format pattern. More...
 
std::unique_ptr< columncudf::strings::from_durations (column_view const &durations, std::string_view format="%D days %H:%M:%S", rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new strings column converting a duration column into strings using the provided format pattern. More...
 
std::unique_ptr< columncudf::strings::to_fixed_point (strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new fixed-point column parsing decimal values from the provided strings column. More...
 
std::unique_ptr< columncudf::strings::from_fixed_point (column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new strings column converting the fixed-point values into a strings column. More...
 
std::unique_ptr< columncudf::strings::is_fixed_point (strings_column_view const &input, data_type decimal_type=data_type{type_id::DECIMAL64}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a boolean column identifying strings in which all characters are valid for conversion to fixed-point. More...
 
std::unique_ptr< columncudf::strings::to_floats (strings_column_view const &strings, data_type output_type, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new numeric column by parsing float values from each string in the provided strings column. More...
 
std::unique_ptr< columncudf::strings::from_floats (column_view const &floats, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new strings column converting the float values from the provided column into strings. More...
 
std::unique_ptr< columncudf::strings::is_float (strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a boolean column identifying strings in which all characters are valid for conversion to floats. More...
 
std::unique_ptr< columncudf::strings::to_integers (strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new integer numeric column parsing integer values from the provided strings column. More...
 
std::unique_ptr< columncudf::strings::from_integers (column_view const &integers, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new strings column converting the integer values from the provided column into strings. More...
 
std::unique_ptr< columncudf::strings::is_integer (strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a boolean column identifying strings in which all characters are valid for conversion to integers. More...
 
std::unique_ptr< columncudf::strings::is_integer (strings_column_view const &input, data_type int_type, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a boolean column identifying strings in which all characters are valid for conversion to integers. More...
 
std::unique_ptr< columncudf::strings::hex_to_integers (strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new integer numeric column parsing hexadecimal values from the provided strings column. More...
 
std::unique_ptr< columncudf::strings::is_hex (strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a boolean column identifying strings in which all characters are valid for conversion to integers from hex. More...
 
std::unique_ptr< columncudf::strings::integers_to_hex (column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a new strings column converting integer columns to hexadecimal characters. More...
 
std::unique_ptr< columncudf::strings::ipv4_to_integers (strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Converts IPv4 addresses into integers. More...
 
std::unique_ptr< columncudf::strings::integers_to_ipv4 (column_view const &integers, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Converts integers into IPv4 addresses as strings. More...
 
std::unique_ptr< columncudf::strings::is_ipv4 (strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Returns a boolean column identifying strings in which all characters are valid for conversion to integers from IPv4 format. More...
 
std::unique_ptr< columncudf::strings::format_list_column (lists_column_view const &input, string_scalar const &na_rep=string_scalar(""), strings_column_view const &separators=strings_column_view(column_view{ data_type{type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Convert a list column of strings into a formatted strings column. More...
 
std::unique_ptr< columncudf::strings::url_encode (strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Decodes each string using URL encoding. More...
 
std::unique_ptr< columncudf::strings::url_decode (strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=rmm::mr::get_current_device_resource())
 Encodes each string using URL encoding. More...
 

Detailed Description

Function Documentation

◆ format_list_column()

std::unique_ptr<column> cudf::strings::format_list_column ( lists_column_view const &  input,
string_scalar const &  na_rep = string_scalar(""),
strings_column_view const &  separators = strings_column_view(column_viewdata_type{type_id::STRING}, 0, nullptr, nullptr, 0}),
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Convert a list column of strings into a formatted strings column.

The separators column should contain 3 strings elements in the following order:

  • element separator (default is comma ,)
  • left-hand enclosure (default is [)
  • right-hand enclosure (default is ])
l1 = { [[a,b,c], [d,e]], [[f,g], [h]] }
s1 = format_list_column(l1)
s1 is now ["[[a,b,c],[d,e]]", "[[f,g],[h]]"]
l2 = { [[a,b,c], [d,e]], [NULL], [[f,g], NULL, [h]] }
s2 = format_list_column(l1, '-', [':', '{', '}'])
s2 is now ["{{a:b:c}:{d:e}}", "{-}", "{{f:g}:-:{h}}"]
Exceptions
cudf::logic_errorif the input column is not a LIST type with a STRING child.
Parameters
inputLists column to format
na_repReplacement string for null elements
separatorsStrings to use for enclosing list components and separating elements
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column

◆ from_booleans()

std::unique_ptr<column> cudf::strings::from_booleans ( column_view const &  booleans,
string_scalar const &  true_string,
string_scalar const &  false_string,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new strings column converting the boolean values from the provided column into strings.

Any null entries will result in corresponding null entries in the output column.

Exceptions
cudf::logic_errorif the input column is not BOOL8 type.
Parameters
booleansBoolean column to convert
true_stringString to use for true in the output column
false_stringString to use for false in the output column
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column

◆ from_durations()

std::unique_ptr<column> cudf::strings::from_durations ( column_view const &  durations,
std::string_view  format = "%D days %H:%M:%S",
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new strings column converting a duration column into strings using the provided format pattern.

The format pattern can include the following specifiers: "%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS"

Specifier Description Range
%% A literal % character %
%n A newline character \n
%t A horizontal tab character \t
%D Days -2,147,483,648 to 2,147,483,647
%H 24-hour of the day 00 to 23
%I 12-hour of the day 00 to 11
%M Minute of the hour 00 to 59
%S Second of the minute 00 to 59.999999999
%OH same as H but without sign 00 to 23
%OI same as I but without sign 00 to 11
%OM same as M but without sign 00 to 59
%OS same as S but without sign 00 to 59
%p AM/PM designations associated with a 12-hour clock 'AM' or 'PM'
%R Equivalent to "%H:%M"
%T Equivalent to "%H:%M:%S"
%r Equivalent to "%OI:%OM:%OS %p"

No checking is done for invalid formats or invalid duration values. Formatting sticks to specifications of std::formatter<std::chrono::duration> as much as possible.

Any null input entry will result in a corresponding null entry in the output column.

The time units of the input column influence the number of digits in decimal of seconds. It uses 3 digits for milliseconds, 6 digits for microseconds and 9 digits for nanoseconds. If duration value is negative, only one negative sign is written to output string. The specifiers with signs are "%H,%I,%M,%S,%R,%T".

Exceptions
cudf::logic_errorif durations column parameter is not a duration type.
Parameters
durationsDuration values to convert
formatThe string specifying output format. Default format is ""D days H:M:S".
mrDevice memory resource used to allocate the returned column's device memory
streamCUDA stream used for device memory operations and kernel launches
Returns
New strings column with formatted durations

◆ from_fixed_point()

std::unique_ptr<column> cudf::strings::from_fixed_point ( column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new strings column converting the fixed-point values into a strings column.

Any null entries result in corresponding null entries in the output column.

For each value, a string is created in base-10 decimal. Negative numbers include a '-' prefix in the output string. The column's scale value is used to place the decimal point. A negative scale value may add padded zeros after the decimal point.

Example:
fp is [110, 222, 3330, -440, -1] with scale = -2
s = from_fixed_point(fp)
s is now ['1.10', '2.22', '33.30', '-4.40', '-0.01']
Exceptions
cudf::logic_errorif the input column is not a fixed-point decimal type.
Parameters
inputFixed-point column to convert
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column

◆ from_floats()

std::unique_ptr<column> cudf::strings::from_floats ( column_view const &  floats,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new strings column converting the float values from the provided column into strings.

Any null entries will result in corresponding null entries in the output column.

For each float, a string is created in base-10 decimal. Negative numbers will include a '-' prefix. Numbers producing more than 10 significant digits will produce a string that includes scientific notation (e.g. "-1.78e+15").

Exceptions
cudf::logic_errorif floats column is not float type.
Parameters
floatsNumeric column to convert
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column with floats as strings

◆ from_integers()

std::unique_ptr<column> cudf::strings::from_integers ( column_view const &  integers,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new strings column converting the integer values from the provided column into strings.

Any null entries will result in corresponding null entries in the output column.

For each integer, a string is created in base-10 decimal. Negative numbers will include a '-' prefix.

Exceptions
cudf::logic_errorif integers column is not integral type.
Parameters
integersNumeric column to convert
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column with integers as strings

◆ from_timestamps()

std::unique_ptr<column> cudf::strings::from_timestamps ( column_view const &  timestamps,
std::string_view  format = "%Y-%m-%dT%H:%M:%SZ",
strings_column_view const &  names = strings_column_view(column_viewdata_type{type_id::STRING}, 0, nullptr, nullptr, 0}),
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new strings column converting a timestamp column into strings using the provided format pattern.

The format pattern can include the following specifiers: "%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z,%Z"

Specifier Description
%d Day of the month: 01-31
%m Month of the year: 01-12
%y Year without century: 00-99
%Y Year with century: 0001-9999
%H 24-hour of the day: 00-23
%I 12-hour of the day: 01-12
%M Minute of the hour: 00-59
%S Second of the minute: 00-59
%f 6-digit microsecond: 000000-999999
%z Always outputs "+0000"
%Z Always outputs "UTC"
%j Day of the year: 001-366
%u ISO weekday where Monday is 1 and Sunday is 7
%w Weekday where Sunday is 0 and Saturday is 6
%U Week of the year with Sunday as the first day: 00-53
%W Week of the year with Monday as the first day: 00-53
%V Week of the year per ISO-8601 format: 01-53
%G Year based on the ISO-8601 weeks: 0000-9999
%p AM/PM from timestamp_names::am_str/pm_str
%a Weekday abbreviation from the names parameter
%A Weekday from the names parameter
%b Month name abbreviation from the names parameter
%B Month name from the names parameter

Additional descriptions can be found here: https://en.cppreference.com/w/cpp/chrono/system_clock/formatter

No checking is done for invalid formats or invalid timestamp values. All timestamps values are formatted to UTC.

Any null input entry will result in a corresponding null entry in the output column.

The time units of the input column do not influence the number of digits written by the "%f" specifier. The "%f" supports a precision value to write out numeric digits for the subsecond value. Specify the precision with a single integer value (1-9) between the "%" and the "f" as follows: use "%3f" for milliseconds, use "%6f" for microseconds and use "%9f" for nanoseconds. If the precision is higher than the units, then zeroes are padded to the right of the subsecond value. If the precision is lower than the units, the subsecond value may be truncated.

If the "%a", "%A", "%b", "%B" specifiers are included in the format, the caller should provide the format names in the names strings column using the following as a guide:

["AM", "PM", // specify the AM/PM strings
"Sunday", "Monday", ..., "Saturday", // Weekday full names
"Sun", "Mon", ..., "Sat", // Weekday abbreviated names
"January", "February", ..., "December", // Month full names
"Jan", "Feb", ..., "Dec"] // Month abbreviated names

The result is undefined if the format names are not provided for these specifiers.

These format names can be retrieved for specific locales using the nl_langinfo functions from C++ clocale (std) library or the Python locale library.

The following code is an example of retrieving these strings from the locale using c++ std functions:

#include <clocale>
#include <langinfo.h>
// note: install language pack on Ubuntu using 'apt-get install language-pack-de'
{
// set to a German language locale for date settings
std::setlocale(LC_TIME, "de_DE.UTF-8");
std::vector<std::string> names({nl_langinfo(AM_STR), nl_langinfo(PM_STR),
nl_langinfo(DAY_1), nl_langinfo(DAY_2), nl_langinfo(DAY_3), nl_langinfo(DAY_4),
nl_langinfo(DAY_5), nl_langinfo(DAY_6), nl_langinfo(DAY_7),
nl_langinfo(ABDAY_1), nl_langinfo(ABDAY_2), nl_langinfo(ABDAY_3), nl_langinfo(ABDAY_4),
nl_langinfo(ABDAY_5), nl_langinfo(ABDAY_6), nl_langinfo(ABDAY_7),
nl_langinfo(MON_1), nl_langinfo(MON_2), nl_langinfo(MON_3), nl_langinfo(MON_4),
nl_langinfo(MON_5), nl_langinfo(MON_6), nl_langinfo(MON_7), nl_langinfo(MON_8),
nl_langinfo(MON_9), nl_langinfo(MON_10), nl_langinfo(MON_11), nl_langinfo(MON_12),
nl_langinfo(ABMON_1), nl_langinfo(ABMON_2), nl_langinfo(ABMON_3), nl_langinfo(ABMON_4),
nl_langinfo(ABMON_5), nl_langinfo(ABMON_6), nl_langinfo(ABMON_7), nl_langinfo(ABMON_8),
nl_langinfo(ABMON_9), nl_langinfo(ABMON_10), nl_langinfo(ABMON_11), nl_langinfo(ABMON_12)});
std::setlocale(LC_TIME,""); // reset to default locale
}
Exceptions
cudf::logic_errorif timestamps column parameter is not a timestamp type.
cudf::logic_errorif the format string is empty
cudf::logic_errorif names.size() is an invalid size. Must be 0 or 40 strings.
Parameters
timestampsTimestamp values to convert
formatThe string specifying output format. Default format is "%Y-%m-%dT%H:%M:%SZ".
namesThe string names to use for weekdays ("%a", "%A") and months ("%b", "%B") Default is an empty strings_column_view.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column with formatted timestamps

◆ hex_to_integers()

std::unique_ptr<column> cudf::strings::hex_to_integers ( strings_column_view const &  input,
data_type  output_type,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new integer numeric column parsing hexadecimal values from the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Only characters [0-9] and [A-F] are recognized. When any other character is encountered, the parsing ends for that string. No interpretation is made on the sign of the integer.

Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.

Exceptions
cudf::logic_errorif output_type is not integral type.
Parameters
inputStrings instance for this operation
output_typeType of integer numeric column to return
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column with integers converted from strings

◆ integers_to_hex()

std::unique_ptr<column> cudf::strings::integers_to_hex ( column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new strings column converting integer columns to hexadecimal characters.

Any null entries will result in corresponding null entries in the output column.

The output character set is '0'-'9' and 'A'-'F'. The output string width will be a multiple of 2 depending on the size of the integer type. A single leading zero is applied to the first non-zero output byte if it less than 0x10.

Example:
input = [1234, -1, 0, 27, 342718233] // int32 type input column
s = integers_to_hex(input)
s is [ '04D2', 'FFFFFFFF', '00', '1B', '146D7719']

The example above shows an INT32 type column where each integer is 4 bytes. Leading zeros are suppressed unless filling out a complete byte as in ‘1234 -> '04D2’instead of000004D2or4D2`.

Exceptions
cudf::logic_errorif the input column is not integral type.
Parameters
inputInteger column to convert to hex
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column with hexadecimal characters

◆ integers_to_ipv4()

std::unique_ptr<column> cudf::strings::integers_to_ipv4 ( column_view const &  integers,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Converts integers into IPv4 addresses as strings.

The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].

Each input integer is dissected into four integers by dividing the input into 8-bit sections. These sub-integers are then converted into [0-9] characters and placed between '.' characters.

No checking is done on the input integer value. Only the lower 32-bits are used.

Any null entries will result in corresponding null entries in the output column.

Exceptions
cudf::logic_errorif the input column is not INT64 type.
Parameters
integersInteger (INT64) column to convert
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column

◆ ipv4_to_integers()

std::unique_ptr<column> cudf::strings::ipv4_to_integers ( strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Converts IPv4 addresses into integers.

The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].

The four sets of digits are converted to integers and placed in 8-bit fields inside the resulting integer.

i0.i1.i2.i3 -> (i0 << 24) | (i1 << 16) | (i2 << 8) | (i3)

No checking is done on the format. If a string is not in IPv4 format, the resulting integer is undefined.

The resulting 32-bit integer is placed in an int64_t to avoid setting the sign-bit in an int32_t type. This could be changed if cudf supported a UINT32 type in the future.

Any null entries will result in corresponding null entries in the output column.

Parameters
inputStrings instance for this operation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New INT64 column converted from strings

◆ is_fixed_point()

std::unique_ptr<column> cudf::strings::is_fixed_point ( strings_column_view const &  input,
data_type  decimal_type = data_type{type_id::DECIMAL64},
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a boolean column identifying strings in which all characters are valid for conversion to fixed-point.

The sign and the exponent is optional. The decimal point may only appear once. Also, the integer component must fit within the size limits of the underlying fixed-point storage type. The value of the integer component is based on the scale of the decimal_type provided.

Example:
s = ['123', '-456', '', '1.2.3', '+17E30', '12.34', '.789', '-0.005]
b = is_fixed_point(s)
b is [true, true, false, false, true, true, true, true]

Any null entries result in corresponding null entries in the output column.

Exceptions
cudf::logic_errorif the decimal_type is not a fixed-point decimal type.
Parameters
inputStrings instance for this operation
decimal_typeFixed-point type (with scale) used only for checking overflow
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of boolean results for each string

◆ is_float()

std::unique_ptr<column> cudf::strings::is_float ( strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a boolean column identifying strings in which all characters are valid for conversion to floats.

The output row entry will be set to true if the corresponding string element has at least one character in [-+0-9eE.].

Example:
s = ['123', '-456', '', 'A', '+7', '8.9' '3.7e+5']
b = s.is_float(s)
b is [true, true, false, false, true, true, true]

Any null row results in a null entry for that row in the output column.

Parameters
inputStrings instance for this operation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of boolean results for each string

◆ is_hex()

std::unique_ptr<column> cudf::strings::is_hex ( strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a boolean column identifying strings in which all characters are valid for conversion to integers from hex.

The output row entry will be set to true if the corresponding string element has at least one character in [0-9A-Za-z]. Also, the string may start with '0x'.

Example:
s = ['123', '-456', '', 'AGE', '+17EA', '0x9EF' '123ABC']
b = is_hex(s)
b is [true, false, false, false, false, true, true]

Any null row results in a null entry for that row in the output column.

Parameters
inputStrings instance for this operation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of boolean results for each string

◆ is_integer() [1/2]

std::unique_ptr<column> cudf::strings::is_integer ( strings_column_view const &  input,
data_type  int_type,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a boolean column identifying strings in which all characters are valid for conversion to integers.

The output row entry will be set to true if the corresponding string element has all characters in [-+0-9]. The optional sign character must only be in the first position. Also, the integer component must fit within the size limits of the underlying storage type, which is provided by the int_type parameter.

Example:
s = ['123456', '-456', '', 'A', '+7']
output1 = s.is_integer(s, data_type{type_id::INT32})
output1 is [true, true, false, false, true]
output2 = s.is_integer(s, data_type{type_id::INT8})
output2 is [false, false, false, false, true]

Any null row results in a null entry for that row in the output column.

Parameters
inputStrings instance for this operation
int_typeInteger type used for checking underflow and overflow
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of boolean results for each string

◆ is_integer() [2/2]

std::unique_ptr<column> cudf::strings::is_integer ( strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a boolean column identifying strings in which all characters are valid for conversion to integers.

The output row entry will be set to true if the corresponding string element have all characters in [-+0-9]. The optional sign character must only be in the first position. Notice that the integer value is not checked to be within its storage limits. For strict integer type check, use the other is_integer() API which accepts data_type argument.

Example:
s = ['123', '-456', '', 'A', '+7']
b = s.is_integer(s)
b is [true, true, false, false, true]

Any null row results in a null entry for that row in the output column.

Parameters
inputStrings instance for this operation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of boolean results for each string

◆ is_ipv4()

std::unique_ptr<column> cudf::strings::is_ipv4 ( strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a boolean column identifying strings in which all characters are valid for conversion to integers from IPv4 format.

The output row entry will be set to true if the corresponding string element has the following format xxx.xxx.xxx.xxx where xxx is integer digits between 0-255.

Example:
s = ['123.255.0.7', '127.0.0.1', '', '1.2.34' '123.456.789.10']
b = s.is_ipv4(s)
b is [true, true, false, false, true]

Any null row results in a null entry for that row in the output column.

Parameters
inputStrings instance for this operation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of boolean results for each string

◆ is_timestamp()

std::unique_ptr<column> cudf::strings::is_timestamp ( strings_column_view const &  input,
std::string_view  format,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Verifies the given strings column can be parsed to timestamps using the provided format pattern.

The format pattern can include the following specifiers: "%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z"

Specifier Description
%d Day of the month: 01-31
%m Month of the year: 01-12
%y Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999]
%Y Year with century: 0001-9999
%H 24-hour of the day: 00-23
%I 12-hour of the day: 01-12
%M Minute of the hour: 00-59
%S Second of the minute: 00-59. Leap second is not supported.
%f 6-digit microsecond: 000000-999999
%z UTC offset with format ±HHMM Example +0500
%j Day of the year: 001-366
%p Only 'AM', 'PM' or 'am', 'pm' are recognized
%W Week of the year with Monday as the first day of the week: 00-53
%w Day of week: 0-6 = Sunday-Saturday
%U Week of the year with Sunday as the first day of the week: 00-53
%u Day of week: 1-7 = Monday-Sunday

Other specifiers are not currently supported. The "%f" supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use "%3f" for milliseconds, "%6f" for microseconds and "%9f" for nanoseconds.

Any null string entry will result in a corresponding null row in the output column.

This will return a column of type BOOL8 where a true row indicates the corresponding input string can be parsed correctly with the given format.

Parameters
inputStrings instance for this operation
formatString specifying the timestamp format in strings
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New BOOL8 column

◆ to_booleans()

std::unique_ptr<column> cudf::strings::to_booleans ( strings_column_view const &  input,
string_scalar const &  true_string,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new BOOL8 column by parsing boolean values from the strings in the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Parameters
inputStrings instance for this operation
true_stringString to expect for true. Non-matching strings are false
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New BOOL8 column converted from strings

◆ to_durations()

std::unique_ptr<column> cudf::strings::to_durations ( strings_column_view const &  input,
data_type  duration_type,
std::string_view  format,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new duration column converting a strings column into durations using the provided format pattern.

The format pattern can include the following specifiers: "%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS"

Specifier Description Range
%% A literal % character %
%n A newline character \n
%t A horizontal tab character \t
%D Days -2,147,483,648 to 2,147,483,647
%H 24-hour of the day 00 to 23
%I 12-hour of the day 00 to 11
%M Minute of the hour 00 to 59
%S Second of the minute 00 to 59.999999999
%OH same as H but without sign 00 to 23
%OI same as I but without sign 00 to 11
%OM same as M but without sign 00 to 59
%OS same as S but without sign 00 to 59
%p AM/PM designations associated with a 12-hour clock 'AM' or 'PM'
%R Equivalent to "%H:%M"
%T Equivalent to "%H:%M:%S"
%r Equivalent to "%OI:%OM:%OS %p"

Other specifiers are not currently supported.

Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry's duration value is undefined.

Any null string entry will result in a corresponding null row in the output column.

The resulting time units are specified by the duration_type parameter.

Exceptions
cudf::logic_errorif duration_type is not a duration type.
Parameters
inputStrings instance for this operation
duration_typeThe duration type used for creating the output column
formatString specifying the duration format in strings
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New duration column

◆ to_fixed_point()

std::unique_ptr<column> cudf::strings::to_fixed_point ( strings_column_view const &  input,
data_type  output_type,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new fixed-point column parsing decimal values from the provided strings column.

Any null entries result in corresponding null entries in the output column.

The expected format is [sign][integer][.][fraction], where the sign is either not present, - or +, The decimal point [.] may or may not be present, and integer and fraction are comprised of zero or more digits in [0-9]. An invalid data format results in undefined behavior in the corresponding output row result.

Example:
s = ['123', '-876', '543.2', '-0.12']
datatype = {DECIMAL32, scale=-2}
fp = to_fixed_point(s, datatype)
fp is [123400, -87600, 54320, -12]

Overflow of the resulting value type is not checked. The scale in the output_type is used for setting the integer component.

Exceptions
cudf::logic_errorif output_type is not a fixed-point decimal type.
Parameters
inputStrings instance for this operation
output_typeType of fixed-point column to return including the scale value
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of output_type

◆ to_floats()

std::unique_ptr<column> cudf::strings::to_floats ( strings_column_view const &  strings,
data_type  output_type,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new numeric column by parsing float values from each string in the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Only characters [0-9] plus a prefix '-' and '+' and decimal '.' are recognized. Additionally, scientific notation is also supported (e.g. "-1.78e+5").

Exceptions
cudf::logic_errorif output_type is not float type.
Parameters
stringsStrings instance for this operation
output_typeType of float numeric column to return
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column with floats converted from strings

◆ to_integers()

std::unique_ptr<column> cudf::strings::to_integers ( strings_column_view const &  input,
data_type  output_type,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new integer numeric column parsing integer values from the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Only characters [0-9] plus a prefix '-' and '+' are recognized. When any other character is encountered, the parsing ends for that string and the current digits are converted into an integer.

Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.

Exceptions
cudf::logic_errorif output_type is not integral type.
Parameters
inputStrings instance for this operation
output_typeType of integer numeric column to return
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column with integers converted from strings

◆ to_timestamps()

std::unique_ptr<column> cudf::strings::to_timestamps ( strings_column_view const &  input,
data_type  timestamp_type,
std::string_view  format,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Returns a new timestamp column converting a strings column into timestamps using the provided format pattern.

The format pattern can include the following specifiers: "%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z"

Specifier Description
%d Day of the month: 01-31
%m Month of the year: 01-12
%y Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999]
%Y Year with century: 0001-9999
%H 24-hour of the day: 00-23
%I 12-hour of the day: 01-12
%M Minute of the hour: 00-59
%S Second of the minute: 00-59. Leap second is not supported.
%f 6-digit microsecond: 000000-999999
%z UTC offset with format ±HHMM Example +0500
%j Day of the year: 001-366
%p Only 'AM', 'PM' or 'am', 'pm' are recognized
%W Week of the year with Monday as the first day of the week: 00-53
%w Day of week: 0-6 = Sunday-Saturday
%U Week of the year with Sunday as the first day of the week: 00-53
%u Day of week: 1-7 = Monday-Sunday

Other specifiers are not currently supported.

Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry's timestamp value is undefined.

Any null string entry will result in a corresponding null row in the output column.

The resulting time units are specified by the timestamp_type parameter. The time units are independent of the number of digits parsed by the "%f" specifier. The "%f" supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use "%3f" for milliseconds, "%6f" for microseconds and "%9f" for nanoseconds.

Although leap second is not supported for "%S", no checking is performed on the value. The cudf::strings::is_timestamp can be used to verify the valid range of values.

If "%W"/"%w" (or "%U/%u") and "%m"/"%d" are both specified, the "%W"/U and "%w"/u values take precedent when computing the date part of the timestamp result.

Exceptions
cudf::logic_errorif timestamp_type is not a timestamp type.
Parameters
inputStrings instance for this operation
timestamp_typeThe timestamp type used for creating the output column
formatString specifying the timestamp format in strings
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New datetime column

◆ url_decode()

std::unique_ptr<column> cudf::strings::url_decode ( strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Encodes each string using URL encoding.

Converts all character sequences starting with '' into character code-points interpreting the 2 following characters as hex values to create the code-point. For example, the sequence '%20' is converted into byte (0x20) which is a single space character. Another example converts 'C3A9' into 2 sequential bytes (0xc3 and 0xa9 respectively) which is the é character. Overall, 3 characters are converted into one char byte whenever a '%' (single percent) character is encountered in the string.

Any null entries will result in corresponding null entries in the output column.

Parameters
inputStrings instance for this operation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column

◆ url_encode()

std::unique_ptr<column> cudf::strings::url_encode ( strings_column_view const &  input,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = rmm::mr::get_current_device_resource() 
)

Decodes each string using URL encoding.

Converts mostly non-ascii characters and control characters into UTF-8 hex code-points prefixed with ''. For example, the space character must be converted to characters '%20' where the '20' indicates the hex value for space in UTF-8. Likewise, multi-byte characters are converted to multiple hex characters. For example, the é character is converted to characters 'C3A9' where 'C3A9' is the UTF-8 bytes 0xC3A9 for this character.

Any null entries will result in corresponding null entries in the output column.

Parameters
inputStrings instance for this operation
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column