cudf.core.tokenize_vocabulary.TokenizeVocabulary.tokenize#
- TokenizeVocabulary.tokenize(text, delimiter: str = '', default_id: int = -1) Series [source]#
- Parameters:
- textcudf string series
The strings to be tokenized.
- delimiterstr
Delimiter to identify tokens. Default is whitespace.
- default_idint
Value to use for tokens not found in the vocabulary. Default is -1.
- Returns:
- Tokenized strings