cudf.core.tokenize_vocabulary.TokenizeVocabulary.tokenize#

TokenizeVocabulary.tokenize(text, delimiter: str = '', default_id: int = -1) Series[source]#
Parameters:
textcudf string series

The strings to be tokenized.

delimiterstr

Delimiter to identify tokens. Default is whitespace.

default_idint

Value to use for tokens not found in the vocabulary. Default is -1.

Returns:
Tokenized strings