cudf.core.wordpiece_tokenize.WordPieceVocabulary.tokenize#

WordPieceVocabulary.tokenize(text: Series, max_words_per_row: int = 0) Series[source]#

Produces tokens for the input strings. The input is expected to be the output of NormalizeCharacters or a similar normalizer.

Parameters:
textcudf.Series

Normalized strings to be tokenized.

max_words_per_rowint

Maximum number of words to tokenize per row. Default 0 tokenizes all words.

Returns:
cudf.Series

Token values