- StringMethods.normalize_characters(do_lower: bool = True) SeriesOrIndex #
Normalizes strings characters for tokenizing.
This uses the normalizer that is built into the subword_tokenize function which includes:
adding padding around punctuation (unicode category starts with “P”) as well as certain ASCII symbols like “^” and “$”
adding padding around the CJK Unicode block characters
changing whitespace (e.g.
\r) to space
removing control characters (unicode categories “Cc” and “Cf”)
If do_lower_case = true, lower-casing also removes the accents. The accents cannot be removed from upper-case characters without lower-casing and lower-casing cannot be performed without also removing accents. However, if the accented character is already lower-case, then only the accent is removed.
- do_lowerbool, Default is True
If set to True, characters will be lower-cased and accents will be removed. If False, accented and upper-case characters are not transformed.
- Series or Index of object.
>>> import cudf >>> ser = cudf.Series(["héllo, \tworld","ĂĆCĖÑTED","$99"]) >>> ser.str.normalize_characters() 0 hello , world 1 accented 2 $ 99 dtype: object >>> ser.str.normalize_characters(do_lower=False) 0 héllo , world 1 ĂĆCĖÑTED 2 $ 99 dtype: object