Strings Combine#

group strings_combine

Enums

enum class separator_on_nulls#

Setting for specifying how separators are added with null strings elements.

Values:

enumerator YES#

Always add separators between elements.

enumerator NO#

Do not add separators if an element is null.

enum class output_if_empty_list#

Setting for specifying what will be output from join_list_elements when an input list is empty.

Values:

enumerator EMPTY_STRING#

Empty list will result in empty string.

enumerator NULL_ELEMENT#

Empty list will result in a null.

Functions

std::unique_ptr<column> join_strings(strings_column_view const &input, string_scalar const &separator = string_scalar(""), string_scalar const &narep = string_scalar("", false), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Concatenates all strings in the column into one new string delimited by an optional separator string.

This returns a column with one string. Any null entries are ignored unless the narep parameter specifies a replacement string.

Example:
s = ['aa', null, '', 'zz' ]
r = join_strings(s,':','_')
r is ['aa:_::zz']
Throws:

cudf::logic_error – if separator is not valid.

Parameters:
  • input – Strings for this operation

  • separator – String that should inserted between each string. Default is an empty string.

  • narep – String to replace any null strings found. Default of invalid-scalar will ignore any null entries.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

New column containing one string.

std::unique_ptr<column> concatenate(table_view const &strings_columns, strings_column_view const &separators, string_scalar const &separator_narep = string_scalar("", false), string_scalar const &col_narep = string_scalar("", false), separator_on_nulls separate_nulls = separator_on_nulls::YES, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Concatenates a list of strings columns using separators for each row and returns the result as a strings column.

Each new string is created by concatenating the strings from the same row delimited by the row separator provided for that row. The following rules are applicable:

  • If row separator for a given row is null, output column for that row is null, unless there is a valid separator_narep

  • The separator is applied between two output row values if the separate_nulls is YES or only between valid rows if separate_nulls is NO.

  • If separator_narep and col_narep are both valid, the output column is always non nullable

Example:
c0   = ['aa', null, '',  'ee',  null, 'ff']
c1   = [null, 'cc', 'dd', null, null, 'gg']
c2   = ['bb', '',   null, null, null, 'hh']
sep  = ['::', '%%', '^^', '!',  '*',  null]
out = concatenate({c0, c1, c2}, sep)
// all rows have at least one null or sep[i]==null
out is [null, null, null, null, null, null]

sep_rep = '+'
out = concatenate({c0, c1, c2}, sep, sep_rep)
// all rows with at least one null output as null
out is [null, null, null, null, null, 'ff+gg+hh']

col_narep = '-'
sep_na = non-valid scalar
out = concatenate({c0, c1, c2}, sep, sep_na, col_narep)
// only the null entry in the sep column produces a null row
out is ['aa::-::bb', '-%%cc%%', '^^dd^^-', 'ee!-!-', '-*-*-', null]

col_narep = ''
out = concatenate({c0, c1, c2}, sep, sep_rep, col_narep, separator_on_nulls:NO)
// parameter suppresses separator for null rows
out is ['aa::bb', 'cc%%', '^^dd', 'ee', '', 'ff+gg+hh']
Throws:
  • cudf::logic_error – if no input columns are specified - table view is empty

  • cudf::logic_error – if input columns are not all strings columns.

  • cudf::logic_error – if the number of rows from separators and strings_columns do not match

Parameters:
  • strings_columns – List of strings columns to concatenate

  • separators – Strings column that provides the separator for a given row

  • separator_narep – String to replace a null separator for a given row. Default of invalid-scalar means no row separator value replacements.

  • col_narep – String that should be used in place of any null strings found in any column. Default of invalid-scalar means no null column value replacements.

  • separate_nulls – If YES, then the separator is included for null rows if col_narep is valid.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Resource for allocating device memory

Returns:

New column with concatenated results

std::unique_ptr<column> concatenate(table_view const &strings_columns, string_scalar const &separator = string_scalar(""), string_scalar const &narep = string_scalar("", false), separator_on_nulls separate_nulls = separator_on_nulls::YES, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Row-wise concatenates the given list of strings columns and returns a single strings column result.

Each new string is created by concatenating the strings from the same row delimited by the separator provided.

Any row with a null entry will result in the corresponding output row to be null entry unless a narep string is specified to be used in its place.

If separate_nulls is set to NO and narep is valid then separators are not added to the output between null elements. Otherwise, separators are always added if narep is valid.

More than one column must be specified in the input strings_columns table.

Example:
s1 = ['aa', null, '', 'dd']
s2 = ['', 'bb', 'cc', null]
out = concatenate({s1, s2})
out is ['aa', null, 'cc', null]

out = concatenate({s1, s2}, ':', '_')
out is ['aa:', '_:bb', ':cc', 'dd:_']

out = concatenate({s1, s2}, ':', '', separator_on_nulls::NO)
out is ['aa:', 'bb', ':cc', 'dd']
Throws:
Parameters:
  • strings_columns – List of string columns to concatenate

  • separator – String that should inserted between each string from each row. Default is an empty string.

  • narep – String to replace any null strings found in any column. Default of invalid-scalar means any null entry in any column will produces a null result for that row.

  • separate_nulls – If YES, then the separator is included for null rows if narep is valid

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column with concatenated results

std::unique_ptr<column> join_list_elements(lists_column_view const &lists_strings_column, strings_column_view const &separators, string_scalar const &separator_narep = string_scalar("", false), string_scalar const &string_narep = string_scalar("", false), separator_on_nulls separate_nulls = separator_on_nulls::YES, output_if_empty_list empty_list_policy = output_if_empty_list::EMPTY_STRING, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Given a lists column of strings (each row is a list of strings), concatenates the strings within each row and returns a single strings column result.

Each new string is created by concatenating the strings from the same row (same list element) delimited by the row separator provided in the separators strings column.

A null list row will always result in a null string in the output row. Any non-null list row having a null element will result in the corresponding output row to be null unless a valid string_narep scalar is provided to be used in its place. Any null row in the separators column will also result in a null output row unless a valid separator_narep scalar is provided to be used in place of the null separators.

If separate_nulls is set to NO and string_narep is valid then separators are not added to the output between null elements. Otherwise, separators are always added if string_narep is valid.

If empty_list_policy is set to EMPTY_STRING, any row that is an empty list will result in an empty output string. Otherwise, the output will be a null.

In the special case when the input list row contains all null elements, the output will be the same as in case of empty input list regardless of string_narep and separate_nulls values.

Example:
s = [ ['aa', 'bb', 'cc'], null, ['', 'dd'], ['ee', null], ['ff', 'gg'] ]
sep  = ['::', '%%',  '!',  '*',  null]

out = join_list_elements(s, sep)
out is ['aa::bb::cc', null, '!dd', null, null]

out = join_list_elements(s, sep, ':', '_')
out is ['aa::bb::cc', null,  '!dd', 'ee*_', 'ff:gg']

out = join_list_elements(s, sep, ':', '', separator_on_nulls::NO)
out is ['aa::bb::cc', null,  '!dd', 'ee', 'ff:gg']
Throws:
  • cudf::logic_error – if input column is not lists of strings column.

  • cudf::logic_error – if the number of rows from separators and lists_strings_column do not match

Parameters:
  • lists_strings_column – Column containing lists of strings to concatenate

  • separators – Strings column that provides separators for concatenation

  • separator_narep – String that should be used to replace a null separator. Default is an invalid-scalar denoting that rows containing null separator will result in a null string in the corresponding output rows.

  • string_narep – String to replace null strings in any non-null list row. Default is an invalid-scalar denoting that list rows containing null strings will result in a null string in the corresponding output rows.

  • separate_nulls – If YES, then the separator is included for null rows if narep is valid

  • empty_list_policy – If set to EMPTY_STRING, any input row that is an empty list will result in an empty string. Otherwise, it will result in a null.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column with concatenated results

std::unique_ptr<column> join_list_elements(lists_column_view const &lists_strings_column, string_scalar const &separator = string_scalar(""), string_scalar const &narep = string_scalar("", false), separator_on_nulls separate_nulls = separator_on_nulls::YES, output_if_empty_list empty_list_policy = output_if_empty_list::EMPTY_STRING, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#

Given a lists column of strings (each row is a list of strings), concatenates the strings within each row and returns a single strings column result.

Each new string is created by concatenating the strings from the same row (same list element) delimited by the separator provided.

A null list row will always result in a null string in the output row. Any non-null list row having a null element will result in the corresponding output row to be null unless a narep string is specified to be used in its place.

If separate_nulls is set to NO and narep is valid then separators are not added to the output between null elements. Otherwise, separators are always added if narep is valid.

If empty_list_policy is set to EMPTY_STRING, any row that is an empty list will result in an empty output string. Otherwise, the output will be a null.

In the special case when the input list row contains all null elements, the output will be the same as in case of empty input list regardless of narep and separate_nulls values.

Example:
s = [ ['aa', 'bb', 'cc'], null, ['', 'dd'], ['ee', null], ['ff'] ]

out = join_list_elements(s)
out is ['aabbcc', null, 'dd', null, 'ff']

out = join_list_elements(s, ':', '_')
out is ['aa:bb:cc', null,  ':dd', 'ee:_', 'ff']

out = join_list_elements(s, ':', '', separator_on_nulls::NO)
out is ['aa:bb:cc', null,  ':dd', 'ee', 'ff']
Throws:
Parameters:
  • lists_strings_column – Column containing lists of strings to concatenate

  • separator – String to insert between strings of each list row. Default is an empty string.

  • narep – String to replace null strings in any non-null list row. Default is an invalid-scalar denoting that list rows containing null strings will result in a null string in the corresponding output rows.

  • separate_nulls – If YES, then the separator is included for null rows if narep is valid

  • empty_list_policy – If set to EMPTY_STRING, any input row that is an empty list will result in an empty string. Otherwise, it will result in a null.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings column with concatenated results