subword_tokenize#

class pylibcudf.nvtext.subword_tokenize.HashedVocabulary#

The vocabulary data for use with the subword_tokenize function.

For details, see cudf::nvtext::hashed_vocabulary.

pylibcudf.nvtext.subword_tokenize.subword_tokenize(Column input, HashedVocabulary vocabulary_table, uint32_t max_sequence_length, uint32_t stride, bool do_lower_case, bool do_truncate) tuple#

Creates a tokenizer that cleans the text, splits it into tokens and returns token-ids from an input vocabulary.

For details, see cpp:func:subword_tokenize

Parameters:
inputColumn

The input strings to tokenize.

vocabulary_tableHashedVocabulary

The vocabulary table pre-loaded into this object.

max_sequence_lengthuint32_t

Limit of the number of token-ids per row in final tensor for each string.

strideuint32_t

Each row in the output token-ids will replicate max_sequence_length - stride the token-ids from the previous row, unless it is the first string.

do_lower_casebool

If true, the tokenizer will convert uppercase characters in the input stream to lower-case and strip accents from those characters. If false, accented and uppercase characters are not transformed.

do_truncatebool

If true, the tokenizer will discard all the token-ids after max_sequence_length for each input string. If false, it will use a new row in the output token-ids to continue generating the output.

Returns:
tuple[Column, Column, Column]

A tuple of three columns containing the tokens, masks, and metadata.