StringMethods.normalize_characters(do_lower: bool = True) SeriesOrIndex#

Normalizes strings characters for tokenizing.

This uses the normalizer that is built into the subword_tokenize function which includes:

  • adding padding around punctuation (unicode category starts with “P”) as well as certain ASCII symbols like “^” and “$”

  • adding padding around the CJK Unicode block characters

  • changing whitespace (e.g. \t, \n, \r) to space

  • removing control characters (unicode categories “Cc” and “Cf”)

If do_lower_case = true, lower-casing also removes the accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.

do_lowerbool, Default is True

If set to True, characters will be lower-cased and accents will be removed. If False, accented and upper-case characters are not transformed.

Series or Index of object.


>>> import cudf
>>> ser = cudf.Series(["héllo, \tworld","ĂĆCĖÑTED","$99"])
>>> ser.str.normalize_characters()
0    hello ,  world
1          accented
2              $ 99
dtype: object
>>> ser.str.normalize_characters(do_lower=False)
0    héllo ,  world
1          ĂĆCĖÑTED
2              $ 99
dtype: object