cudf.core.column.string.StringMethods.ngrams_tokenize#
- StringMethods.ngrams_tokenize(n: int = 2, delimiter: str = ' ', separator: str = '_') SeriesOrIndex #
Generate the n-grams using tokens from each string. This will tokenize each string and then generate ngrams for each string.
- Parameters:
- nint, Default 2.
The degree of the n-gram (number of consecutive tokens).
- delimiterstr, Default is white-space.
The character used to locate the split points of each string.
- sepstr, Default is ‘_’.
The separator to use between tokens within an n-gram.
- Returns:
- Series or Index of object.
Examples
>>> import cudf >>> ser = cudf.Series(['this is the', 'best book']) >>> ser.str.ngrams_tokenize(n=2, sep='_') 0 this_is 1 is_the 2 best_book dtype: object