Strings Convert#

group strings_convert

Functions

std::unique_ptr<column> to_booleans(strings_column_view const &input, string_scalar const &true_string, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new BOOL8 column by parsing boolean values from the strings in the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Parameters:
  • input – Strings instance for this operation

  • true_string – String to expect for true. Non-matching strings are false

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New BOOL8 column converted from strings

std::unique_ptr<column> from_booleans(column_view const &booleans, string_scalar const &true_string, string_scalar const &false_string, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new strings column converting the boolean values from the provided column into strings.

Any null entries will result in corresponding null entries in the output column.

Throws:

cudf::logic_error – if the input column is not BOOL8 type.

Parameters:
  • booleans – Boolean column to convert

  • true_string – String to use for true in the output column

  • false_string – String to use for false in the output column

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> to_timestamps(strings_column_view const &input, data_type timestamp_type, std::string_view format, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new timestamp column converting a strings column into timestamps using the provided format pattern.

The format pattern can include the following specifiers: “%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z”

Specifier

Description

%d

Day of the month: 01-31

%m

Month of the year: 01-12

%y

Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999]

%Y

Year with century: 0001-9999

%H

24-hour of the day: 00-23

%I

12-hour of the day: 01-12

%M

Minute of the hour: 00-59

%S

Second of the minute: 00-59. Leap second is not supported.

%f

6-digit microsecond: 000000-999999

%z

UTC offset with format ±HHMM Example +0500

%j

Day of the year: 001-366

%p

Only ‘AM’, ‘PM’ or ‘am’, ‘pm’ are recognized

%W

Week of the year with Monday as the first day of the week: 00-53

%w

Day of week: 0-6 = Sunday-Saturday

%U

Week of the year with Sunday as the first day of the week: 00-53

%u

Day of week: 1-7 = Monday-Sunday

Other specifiers are not currently supported.

Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry’s timestamp value is undefined.

Any null string entry will result in a corresponding null row in the output column.

The resulting time units are specified by the timestamp_type parameter. The time units are independent of the number of digits parsed by the “%f” specifier. The “%f” supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use “%3f” for milliseconds, “%6f” for microseconds and “%9f” for nanoseconds.

Although leap second is not supported for “%S”, no checking is performed on the value. The cudf::strings::is_timestamp can be used to verify the valid range of values.

If “%W”/”%w” (or “%U/%u”) and “%m”/”%d” are both specified, the “%W”/U and “%w”/u values take precedent when computing the date part of the timestamp result.

Throws:

cudf::logic_error – if timestamp_type is not a timestamp type.

Parameters:
  • input – Strings instance for this operation

  • timestamp_type – The timestamp type used for creating the output column

  • format – String specifying the timestamp format in strings

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New datetime column

std::unique_ptr<column> is_timestamp(strings_column_view const &input, std::string_view format, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Verifies the given strings column can be parsed to timestamps using the provided format pattern.

The format pattern can include the following specifiers: “%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z”

Specifier

Description

%d

Day of the month: 01-31

%m

Month of the year: 01-12

%y

Year without century: 00-99. [0,68] maps to [2000,2068] and [69,99] maps to [1969,1999]

%Y

Year with century: 0001-9999

%H

24-hour of the day: 00-23

%I

12-hour of the day: 01-12

%M

Minute of the hour: 00-59

%S

Second of the minute: 00-59. Leap second is not supported.

%f

6-digit microsecond: 000000-999999

%z

UTC offset with format ±HHMM Example +0500

%j

Day of the year: 001-366

%p

Only ‘AM’, ‘PM’ or ‘am’, ‘pm’ are recognized

%W

Week of the year with Monday as the first day of the week: 00-53

%w

Day of week: 0-6 = Sunday-Saturday

%U

Week of the year with Sunday as the first day of the week: 00-53

%u

Day of week: 1-7 = Monday-Sunday

Other specifiers are not currently supported. The “%f” supports a precision value to read the numeric digits. Specify the precision with a single integer value (1-9) as follows: use “%3f” for milliseconds, “%6f” for microseconds and “%9f” for nanoseconds.

Any null string entry will result in a corresponding null row in the output column.

This will return a column of type BOOL8 where a true row indicates the corresponding input string can be parsed correctly with the given format.

Parameters:
  • input – Strings instance for this operation

  • format – String specifying the timestamp format in strings

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New BOOL8 column

std::unique_ptr<column> from_timestamps(column_view const &timestamps, std::string_view format = "%Y-%m-%dT%H:%M:%SZ", strings_column_view const &names = strings_column_view(column_view{data_type{type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new strings column converting a timestamp column into strings using the provided format pattern.

The format pattern can include the following specifiers: “%Y,%y,%m,%d,%H,%I,%p,%M,%S,%f,%z,%Z”

Specifier

Description

%d

Day of the month: 01-31

%m

Month of the year: 01-12

%y

Year without century: 00-99

%Y

Year with century: 0001-9999

%H

24-hour of the day: 00-23

%I

12-hour of the day: 01-12

%M

Minute of the hour: 00-59

%S

Second of the minute: 00-59

%f

6-digit microsecond: 000000-999999

%z

Always outputs “+0000”

%Z

Always outputs “UTC”

%j

Day of the year: 001-366

%u

ISO weekday where Monday is 1 and Sunday is 7

%w

Weekday where Sunday is 0 and Saturday is 6

%U

Week of the year with Sunday as the first day: 00-53

%W

Week of the year with Monday as the first day: 00-53

%V

Week of the year per ISO-8601 format: 01-53

%G

Year based on the ISO-8601 weeks: 0000-9999

%p

AM/PM from timestamp_names::am_str/pm_str

%a

Weekday abbreviation from the names parameter

%A

Weekday from the names parameter

%b

Month name abbreviation from the names parameter

%B

Month name from the names parameter

Additional descriptions can be found here: https://en.cppreference.com/w/cpp/chrono/system_clock/formatter

No checking is done for invalid formats or invalid timestamp values. All timestamps values are formatted to UTC.

Any null input entry will result in a corresponding null entry in the output column.

The time units of the input column do not influence the number of digits written by the “%f” specifier. The “%f” supports a precision value to write out numeric digits for the subsecond value. Specify the precision with a single integer value (1-9) between the “%” and the “f” as follows: use “%3f” for milliseconds, use “%6f” for microseconds and use “%9f” for nanoseconds. If the precision is higher than the units, then zeroes are padded to the right of the subsecond value. If the precision is lower than the units, the subsecond value may be truncated.

If the “%a”, “%A”, “%b”, “%B” specifiers are included in the format, the caller should provide the format names in the names strings column using the following as a guide:

["AM", "PM",                             // specify the AM/PM strings
 "Sunday", "Monday", ..., "Saturday",    // Weekday full names
 "Sun", "Mon", ..., "Sat",               // Weekday abbreviated names
 "January", "February", ..., "December", // Month full names
 "Jan", "Feb", ..., "Dec"]               // Month abbreviated names

The result is undefined if the format names are not provided for these specifiers.

These format names can be retrieved for specific locales using the nl_langinfo functions from C++ clocale (std) library or the Python locale library.

The following code is an example of retrieving these strings from the locale using c++ std functions:

#include <clocale>
#include <langinfo.h>

// note: install language pack on Ubuntu using 'apt-get install language-pack-de'
{
  // set to a German language locale for date settings
  std::setlocale(LC_TIME, "de_DE.UTF-8");

  std::vector<std::string> names({nl_langinfo(AM_STR), nl_langinfo(PM_STR),
    nl_langinfo(DAY_1), nl_langinfo(DAY_2), nl_langinfo(DAY_3), nl_langinfo(DAY_4),
     nl_langinfo(DAY_5), nl_langinfo(DAY_6), nl_langinfo(DAY_7),
    nl_langinfo(ABDAY_1), nl_langinfo(ABDAY_2), nl_langinfo(ABDAY_3), nl_langinfo(ABDAY_4),
     nl_langinfo(ABDAY_5), nl_langinfo(ABDAY_6), nl_langinfo(ABDAY_7),
    nl_langinfo(MON_1), nl_langinfo(MON_2), nl_langinfo(MON_3), nl_langinfo(MON_4),
     nl_langinfo(MON_5), nl_langinfo(MON_6), nl_langinfo(MON_7), nl_langinfo(MON_8),
     nl_langinfo(MON_9), nl_langinfo(MON_10), nl_langinfo(MON_11), nl_langinfo(MON_12),
    nl_langinfo(ABMON_1), nl_langinfo(ABMON_2), nl_langinfo(ABMON_3), nl_langinfo(ABMON_4),
     nl_langinfo(ABMON_5), nl_langinfo(ABMON_6), nl_langinfo(ABMON_7), nl_langinfo(ABMON_8),
     nl_langinfo(ABMON_9), nl_langinfo(ABMON_10), nl_langinfo(ABMON_11), nl_langinfo(ABMON_12)});

  std::setlocale(LC_TIME,""); // reset to default locale
}
Throws:
Parameters:
  • timestamps – Timestamp values to convert

  • format – The string specifying output format. Default format is “%Y-%m-%dT%H:%M:%SZ”.

  • names – The string names to use for weekdays (“%a”, “%A”) and months (“%b”, “%B”) Default is an empty strings_column_view.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column with formatted timestamps

std::unique_ptr<column> to_durations(strings_column_view const &input, data_type duration_type, std::string_view format, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new duration column converting a strings column into durations using the provided format pattern.

The format pattern can include the following specifiers: “%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS”

Specifier

Description

Range

%%

A literal % character

%

%n

A newline character

\n

%t

A horizontal tab character

\t

%D

Days

-2,147,483,648 to 2,147,483,647

%H

24-hour of the day

00 to 23

%I

12-hour of the day

00 to 11

%M

Minute of the hour

00 to 59

%S

Second of the minute

00 to 59.999999999

%OH

same as H but without sign

00 to 23

%OI

same as I but without sign

00 to 11

%OM

same as M but without sign

00 to 59

%OS

same as S but without sign

00 to 59

%p

AM/PM designations associated with a 12-hour clock

‘AM’ or ‘PM’

%R

Equivalent to “%H:%M”

%T

Equivalent to “%H:%M:%S”

%r

Equivalent to “%OI:%OM:%OS %p”

Other specifiers are not currently supported.

Invalid formats are not checked. If the string contains unexpected or insufficient characters, that output row entry’s duration value is undefined.

Any null string entry will result in a corresponding null row in the output column.

The resulting time units are specified by the duration_type parameter.

Throws:

cudf::logic_error – if duration_type is not a duration type.

Parameters:
  • input – Strings instance for this operation

  • duration_type – The duration type used for creating the output column

  • format – String specifying the duration format in strings

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New duration column

std::unique_ptr<column> from_durations(column_view const &durations, std::string_view format = "%D days %H:%M:%S", rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new strings column converting a duration column into strings using the provided format pattern.

The format pattern can include the following specifiers: “%%,%n,%t,%D,%H,%I,%M,%S,%p,%R,%T,%r,%OH,%OI,%OM,%OS”

Specifier

Description

Range

%%

A literal % character

%

%n

A newline character

\n

%t

A horizontal tab character

\t

%D

Days

-2,147,483,648 to 2,147,483,647

%H

24-hour of the day

00 to 23

%I

12-hour of the day

00 to 11

%M

Minute of the hour

00 to 59

%S

Second of the minute

00 to 59.999999999

%OH

same as H but without sign

00 to 23

%OI

same as I but without sign

00 to 11

%OM

same as M but without sign

00 to 59

%OS

same as S but without sign

00 to 59

%p

AM/PM designations associated with a 12-hour clock

‘AM’ or ‘PM’

%R

Equivalent to “%H:%M”

%T

Equivalent to “%H:%M:%S”

%r

Equivalent to “%OI:%OM:%OS %p”

No checking is done for invalid formats or invalid duration values. Formatting sticks to specifications of std::formatter<std::chrono::duration> as much as possible.

Any null input entry will result in a corresponding null entry in the output column.

The time units of the input column influence the number of digits in decimal of seconds. It uses 3 digits for milliseconds, 6 digits for microseconds and 9 digits for nanoseconds. If duration value is negative, only one negative sign is written to output string. The specifiers with signs are “%H,%I,%M,%S,%R,%T”.

Throws:

cudf::logic_error – if durations column parameter is not a duration type.

Parameters:
  • durations – Duration values to convert

  • format – The string specifying output format. Default format is “”D days H:M:S”.

  • mr – Device memory resource used to allocate the returned column’s device memory

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

New strings column with formatted durations

std::unique_ptr<column> to_fixed_point(strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new fixed-point column parsing decimal values from the provided strings column.

Any null entries result in corresponding null entries in the output column.

The expected format is [sign][integer][.][fraction], where the sign is either not present, - or +, The decimal point [.] may or may not be present, and integer and fraction are comprised of zero or more digits in [0-9]. An invalid data format results in undefined behavior in the corresponding output row result.

Example:
s = ['123', '-876', '543.2', '-0.12']
datatype = {DECIMAL32, scale=-2}
fp = to_fixed_point(s, datatype)
fp is [123400, -87600, 54320, -12]

Overflow of the resulting value type is not checked. The scale in the output_type is used for setting the integer component.

Throws:

cudf::logic_error – if output_type is not a fixed-point decimal type.

Parameters:
  • input – Strings instance for this operation

  • output_type – Type of fixed-point column to return including the scale value

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of output_type

std::unique_ptr<column> from_fixed_point(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new strings column converting the fixed-point values into a strings column.

Any null entries result in corresponding null entries in the output column.

For each value, a string is created in base-10 decimal. Negative numbers include a ‘-’ prefix in the output string. The column’s scale value is used to place the decimal point. A negative scale value may add padded zeros after the decimal point.

Example:
fp is [110, 222, 3330, -440, -1] with scale = -2
s = from_fixed_point(fp)
s is now ['1.10', '2.22', '33.30', '-4.40', '-0.01']
Throws:

cudf::logic_error – if the input column is not a fixed-point decimal type.

Parameters:
  • input – Fixed-point column to convert

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> is_fixed_point(strings_column_view const &input, data_type decimal_type = data_type{type_id::DECIMAL64}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a boolean column identifying strings in which all characters are valid for conversion to fixed-point.

The sign and the exponent is optional. The decimal point may only appear once. Also, the integer component must fit within the size limits of the underlying fixed-point storage type. The value of the integer component is based on the scale of the decimal_type provided.

Example:
s = ['123', '-456', '', '1.2.3', '+17E30', '12.34', '.789', '-0.005]
b = is_fixed_point(s)
b is [true, true, false, false, true, true, true, true]

Any null entries result in corresponding null entries in the output column.

Throws:

cudf::logic_error – if the decimal_type is not a fixed-point decimal type.

Parameters:
  • input – Strings instance for this operation

  • decimal_type – Fixed-point type (with scale) used only for checking overflow

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of boolean results for each string

std::unique_ptr<column> to_floats(strings_column_view const &strings, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new numeric column by parsing float values from each string in the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Only characters [0-9] plus a prefix ‘-’ and ‘+’ and decimal ‘.’ are recognized. Additionally, scientific notation is also supported (e.g. “-1.78e+5”).

Throws:

cudf::logic_error – if output_type is not float type.

Parameters:
  • strings – Strings instance for this operation

  • output_type – Type of float numeric column to return

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with floats converted from strings

std::unique_ptr<column> from_floats(column_view const &floats, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new strings column converting the float values from the provided column into strings.

Any null entries will result in corresponding null entries in the output column.

For each float, a string is created in base-10 decimal. Negative numbers will include a ‘-’ prefix. Numbers producing more than 10 significant digits will produce a string that includes scientific notation (e.g. “-1.78e+15”).

Throws:

cudf::logic_error – if floats column is not float type.

Parameters:
  • floats – Numeric column to convert

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column with floats as strings

std::unique_ptr<column> is_float(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a boolean column identifying strings in which all characters are valid for conversion to floats.

The output row entry will be set to true if the corresponding string element has at least one character in [-+0-9eE.].

Example:
s = ['123', '-456', '', 'A', '+7', '8.9' '3.7e+5']
b = s.is_float(s)
b is [true, true, false, false, true, true, true]

Any null row results in a null entry for that row in the output column.

Parameters:
  • input – Strings instance for this operation

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of boolean results for each string

std::unique_ptr<column> to_integers(strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new integer numeric column parsing integer values from the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Only characters [0-9] plus a prefix ‘-’ and ‘+’ are recognized. When any other character is encountered, the parsing ends for that string and the current digits are converted into an integer.

Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.

Throws:

cudf::logic_error – if output_type is not integral type.

Parameters:
  • input – Strings instance for this operation

  • output_type – Type of integer numeric column to return

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with integers converted from strings

std::unique_ptr<column> from_integers(column_view const &integers, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new strings column converting the integer values from the provided column into strings.

Any null entries will result in corresponding null entries in the output column.

For each integer, a string is created in base-10 decimal. Negative numbers will include a ‘-’ prefix.

Throws:

cudf::logic_error – if integers column is not integral type.

Parameters:
  • integers – Numeric column to convert

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column with integers as strings

std::unique_ptr<column> is_integer(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a boolean column identifying strings in which all characters are valid for conversion to integers.

The output row entry will be set to true if the corresponding string element have all characters in [-+0-9]. The optional sign character must only be in the first position. Notice that the integer value is not checked to be within its storage limits. For strict integer type check, use the other is_integer() API which accepts data_type argument.

Example:
s = ['123', '-456', '', 'A', '+7']
b = s.is_integer(s)
b is [true, true, false, false, true]

Any null row results in a null entry for that row in the output column.

Parameters:
  • input – Strings instance for this operation

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of boolean results for each string

std::unique_ptr<column> is_integer(strings_column_view const &input, data_type int_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a boolean column identifying strings in which all characters are valid for conversion to integers.

The output row entry will be set to true if the corresponding string element has all characters in [-+0-9]. The optional sign character must only be in the first position. Also, the integer component must fit within the size limits of the underlying storage type, which is provided by the int_type parameter.

Example:
s = ['123456', '-456', '', 'A', '+7']

output1 = s.is_integer(s, data_type{type_id::INT32})
output1 is [true, true, false, false, true]

output2 = s.is_integer(s, data_type{type_id::INT8})
output2 is [false, false, false, false, true]

Any null row results in a null entry for that row in the output column.

Parameters:
  • input – Strings instance for this operation

  • int_type – Integer type used for checking underflow and overflow

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of boolean results for each string

std::unique_ptr<column> hex_to_integers(strings_column_view const &input, data_type output_type, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new integer numeric column parsing hexadecimal values from the provided strings column.

Any null entries will result in corresponding null entries in the output column.

Only characters [0-9] and [A-F] are recognized. When any other character is encountered, the parsing ends for that string. No interpretation is made on the sign of the integer.

Overflow of the resulting integer type is not checked. Each string is converted using an int64 type and then cast to the target integer type before storing it into the output column. If the resulting integer type is too small to hold the value, the stored value will be undefined.

Throws:

cudf::logic_error – if output_type is not integral type.

Parameters:
  • input – Strings instance for this operation

  • output_type – Type of integer numeric column to return

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with integers converted from strings

std::unique_ptr<column> is_hex(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a boolean column identifying strings in which all characters are valid for conversion to integers from hex.

The output row entry will be set to true if the corresponding string element has at least one character in [0-9A-Za-z]. Also, the string may start with ‘0x’.

Example:
s = ['123', '-456', '', 'AGE', '+17EA', '0x9EF' '123ABC']
b = is_hex(s)
b is [true, false, false, false, false, true, true]

Any null row results in a null entry for that row in the output column.

Parameters:
  • input – Strings instance for this operation

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of boolean results for each string

std::unique_ptr<column> integers_to_hex(column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a new strings column converting integer columns to hexadecimal characters.

Any null entries will result in corresponding null entries in the output column.

The output character set is ‘0’-‘9’ and ‘A’-‘F’. The output string width will be a multiple of 2 depending on the size of the integer type. A single leading zero is applied to the first non-zero output byte if it less than 0x10.

Example:
input = [1234, -1, 0, 27, 342718233] // int32 type input column
s = integers_to_hex(input)
s is [ '04D2', 'FFFFFFFF', '00', '1B', '146D7719']

The example above shows an INT32 type column where each integer is 4 bytes. Leading zeros are suppressed unless filling out a complete byte as in 1234 -> ‘04D2instead of000004D2or4D2`.

Throws:

cudf::logic_error – if the input column is not integral type.

Parameters:
  • input – Integer column to convert to hex

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column with hexadecimal characters

std::unique_ptr<column> ipv4_to_integers(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Converts IPv4 addresses into integers.

The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].

The four sets of digits are converted to integers and placed in 8-bit fields inside the resulting integer.

i0.i1.i2.i3 -> (i0 << 24) | (i1 << 16) | (i2 << 8) | (i3)

No checking is done on the format. If a string is not in IPv4 format, the resulting integer is undefined.

The resulting 32-bit integer is placed in an int64_t to avoid setting the sign-bit in an int32_t type. This could be changed if cudf supported a UINT32 type in the future.

Any null entries will result in corresponding null entries in the output column.

Parameters:
  • input – Strings instance for this operation

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New INT64 column converted from strings

std::unique_ptr<column> integers_to_ipv4(column_view const &integers, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Converts integers into IPv4 addresses as strings.

The IPv4 format is 1-3 character digits [0-9] between 3 dots (e.g. 123.45.67.890). Each section can have a value between [0-255].

Each input integer is dissected into four integers by dividing the input into 8-bit sections. These sub-integers are then converted into [0-9] characters and placed between ‘.’ characters.

No checking is done on the input integer value. Only the lower 32-bits are used.

Any null entries will result in corresponding null entries in the output column.

Throws:

cudf::logic_error – if the input column is not INT64 type.

Parameters:
  • integers – Integer (INT64) column to convert

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> is_ipv4(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Returns a boolean column identifying strings in which all characters are valid for conversion to integers from IPv4 format.

The output row entry will be set to true if the corresponding string element has the following format xxx.xxx.xxx.xxx where xxx is integer digits between 0-255.

Example:
s = ['123.255.0.7', '127.0.0.1', '', '1.2.34' '123.456.789.10']
b = s.is_ipv4(s)
b is [true, true, false, false, true]

Any null row results in a null entry for that row in the output column.

Parameters:
  • input – Strings instance for this operation

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of boolean results for each string

std::unique_ptr<column> format_list_column(lists_column_view const &input, string_scalar const &na_rep = string_scalar(""), strings_column_view const &separators = strings_column_view(column_view{data_type{type_id::STRING}, 0, nullptr, nullptr, 0}), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Convert a list column of strings into a formatted strings column.

The separators column should contain 3 strings elements in the following order:

  • element separator (default is comma ,)

  • left-hand enclosure (default is [)

  • right-hand enclosure (default is ])

l1 = { [[a,b,c], [d,e]], [[f,g], [h]] }
s1 = format_list_column(l1)
s1 is now ["[[a,b,c],[d,e]]", "[[f,g],[h]]"]

l2 = { [[a,b,c], [d,e]], [NULL], [[f,g], NULL, [h]] }
s2 = format_list_column(l1, '-', [':', '{', '}'])
s2 is now ["{{a:b:c}:{d:e}}", "{-}", "{{f:g}:-:{h}}"]
Throws:

cudf::logic_error – if the input column is not a LIST type with a STRING child.

Parameters:
  • input – Lists column to format

  • na_rep – Replacement string for null elements

  • separators – Strings to use for enclosing list components and separating elements

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> url_encode(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Decodes each string using URL encoding.

Converts mostly non-ascii characters and control characters into UTF-8 hex code-points prefixed with ‘’. For example, the space character must be converted to characters ‘%20’ where the ‘20’ indicates the hex value for space in UTF-8. Likewise, multi-byte characters are converted to multiple hex characters. For example, the é character is converted to characters ‘C3A9’ where ‘C3A9’ is the UTF-8 bytes 0xC3A9 for this character.

Any null entries will result in corresponding null entries in the output column.

Parameters:
  • input – Strings instance for this operation

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column

std::unique_ptr<column> url_decode(strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Encodes each string using URL encoding.

Converts all character sequences starting with ‘’ into character code-points interpreting the 2 following characters as hex values to create the code-point. For example, the sequence ‘%20’ is converted into byte (0x20) which is a single space character. Another example converts ‘C3A9’ into 2 sequential bytes (0xc3 and 0xa9 respectively) which is the é character. Overall, 3 characters are converted into one char byte whenever a ‘%’ (single percent) character is encountered in the string.

Any null entries will result in corresponding null entries in the output column.

Parameters:
  • input – Strings instance for this operation

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column