Unicode Limitations#

The strings column currently supports only UTF-8 characters internally. For functions that require character testing (e.g. cudf::strings::all_characters_of_type()) or case conversion (e.g. cudf::strings::capitalize(), etc) only the 16-bit Unicode 13.0 character code-points (0-65535) values are supported. Case conversion and character testing on characters above code-point 65535 are not supported.

Case conversions that are context-sensitive are not supported. Also, case conversions that result in multiple characters are not reversible. That is, adjacent individual characters will not be case converted to a single character. For example, converting character ß to upper case will result in the characters “SS”. But converting “SS” to lower case will produce “ss”.

Strings case and type APIs:

  • cudf::strings::all_characters_of_type()

  • cudf::strings::to_upper()

  • cudf::strings::to_lower()

  • cudf::strings::capitalize()

  • cudf::strings::title()

  • cudf::strings::swapcase()

Also, using regex patterns that use the shorthand character classes \d \D \w \W \s \S will include only appropriate characters with code-points between (0-65535).