Strings Types#

group strings_types

Enums

enum string_character_types#

Character type values. These types can be or’d to check for any combination of types.

This cannot be turned into an enum class because or’d entries can result in values that are not in the class. For example, combining NUMERIC|SPACE is a valid, reasonable combination but does not match to any explicitly named enumerator.

Values:

enumerator DECIMAL#

all decimal characters

enumerator NUMERIC#

all numeric characters

enumerator DIGIT#

all digit characters

enumerator ALPHA#

all alphabetic characters

enumerator SPACE#

all space characters

enumerator UPPER#

all upper case characters

enumerator LOWER#

all lower case characters

enumerator ALPHANUM#

all alphanumeric characters

enumerator CASE_TYPES#

all case-able characters

enumerator ALL_TYPES#

all character types

Functions

std::unique_ptr<column> all_characters_of_type(strings_column_view const &input, string_character_types types, string_character_types verify_types = string_character_types::ALL_TYPES, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource())#

Returns a boolean column identifying strings entries in which all characters are of the type specified.

The output row entry will be set to false if the corresponding string element is empty or has at least one character not of the specified type. If all characters fit the type then true is set in that output row entry.

To ignore all but specific types, set the verify_types to those types which should be checked. Otherwise, the default ALL_TYPES will verify all characters match types.

Example:
s = ['ab', 'a b', 'a7', 'a B']
b1 = s.all_characters_of_type(s,LOWER)
b1 is [true, false, false, false]
b2 = s.all_characters_of_type(s,LOWER,LOWER|UPPER)
b2 is [true, true, true, false]

Any null row results in a null entry for that row in the output column.

Parameters:
  • input – Strings instance for this operation

  • types – The character types to check in each string

  • verify_types – Only verify against these character types. Default ALL_TYPES means return true iff all characters match types.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New column of boolean results for each string

std::unique_ptr<column> filter_characters_of_type(strings_column_view const &input, string_character_types types_to_remove, string_scalar const &replacement = string_scalar(""), string_character_types types_to_keep = string_character_types::ALL_TYPES, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource())#

Filter specific character types from a column of strings.

To remove all characters of a specific type, set that type in types_to_remove and set types_to_keep to ALL_TYPES.

To filter out characters NOT of a select type, specify ALL_TYPES for types_to_remove and which types to not remove in types_to_keep.

Example:
s = ['ab', 'a b', 'a7bb', 'A7B234']
s1 = s.filter_characters_of_type(s,NUMERIC,"",ALL_TYPES)
s1 is ['ab', 'a b', 'abb', 'AB']
s2 = s.filter_characters_of_type(s,ALL_TYPES,"-",LOWER)
s2 is ['ab', 'a-b', 'a-bb', '------']

In s1 all NUMERIC types have been removed. In s2 all non-LOWER types have been replaced.

One but not both parameters types_to_remove and types_to_keep must be set to ALL_TYPES.

Any null row results in a null entry for that row in the output column.

Throws:

cudf::logic_error – if neither or both types_to_remove and types_to_keep are set to ALL_TYPES.

Parameters:
  • input – Strings instance for this operation

  • types_to_remove – The character types to check in each string. Use ALL_TYPES here to specify types_to_keep instead.

  • replacement – The replacement character to use when removing characters

  • types_to_keep – Default ALL_TYPES means all characters of types_to_remove will be filtered.

  • mr – Device memory resource used to allocate the returned column’s device memory

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

New column of boolean results for each string

constexpr string_character_types operator|(string_character_types lhs, string_character_types rhs)#

OR operator for combining string_character_types.

Parameters:
  • lhs – left-hand side of OR operation

  • rhs – right-hand side of OR operation

Returns:

combined string_character_types

constexpr string_character_types &operator|=(string_character_types &lhs, string_character_types rhs)#

Compound assignment OR operator for combining string_character_types.

Parameters:
  • lhs – left-hand side of OR operation

  • rhs – right-hand side of OR operation

Returns:

Reference to lhs after combining lhs and rhs