cudf.core.column.string.StringMethods.ngrams_tokenize#

StringMethods.ngrams_tokenize(n: int = 2, delimiter: str = ' ', separator: str = '_') → SeriesOrIndex#

Generate the n-grams using tokens from each string. This will tokenize each string and then generate ngrams for each string.

Parameters:

nint, Default 2.: The degree of the n-gram (number of consecutive tokens).
delimiterstr, Default is white-space.: The character used to locate the split points of each string.
sepstr, Default is ‘_’.: The separator to use between tokens within an n-gram.

Returns:

Series or Index of object.

Examples

>>> import cudf
>>> ser = cudf.Series(['this is the', 'best book'])
>>> ser.str.ngrams_tokenize(n=2, sep='_')
0      this_is
1       is_the
2    best_book
dtype: object