cudf.core.wordpiece_tokenize.WordPieceVocabulary.tokenize#
- WordPieceVocabulary.tokenize(text: Series, max_words_per_row: int = 0) Series[source]#
Produces tokens for the input strings. The input is expected to be the output of NormalizeCharacters or a similar normalizer.
- Parameters:
- textcudf.Series
Normalized strings to be tokenized.
- max_words_per_rowint
Maximum number of words to tokenize per row. Default 0 tokenizes all words.
- Returns:
- cudf.Series
Token values