cudf.core.column.string.StringMethods.ngrams_tokenize#

StringMethods.ngrams_tokenize(n: int = 2, delimiter: str = ' ', separator: str = '_') SeriesOrIndex#

Generate the n-grams using tokens from each string. This will tokenize each string and then generate ngrams for each string.

Parameters:
nint, Default 2.

The degree of the n-gram (number of consecutive tokens).

delimiterstr, Default is white-space.

The character used to locate the split points of each string.

sepstr, Default is ‘_’.

The separator to use between tokens within an n-gram.

Returns:
Series or Index of object.

Examples

>>> import cudf
>>> ser = cudf.Series(['this is the', 'best book'])
>>> ser.str.ngrams_tokenize(n=2, sep='_')
0      this_is
1       is_the
2    best_book
dtype: object