subword_tokenize#

class pylibcudf.nvtext.subword_tokenize.HashedVocabulary#

The vocabulary data for use with the subword_tokenize function.

For details, see cudf::nvtext::hashed_vocabulary.

pylibcudf.nvtext.subword_tokenize.subword_tokenize(Column input, HashedVocabulary vocabulary_table, uint32_t max_sequence_length, uint32_t stride, bool do_lower_case, bool do_truncate) → tuple#

Creates a tokenizer that cleans the text, splits it into tokens and returns token-ids from an input vocabulary.

For details, see cpp:func:subword_tokenize

Parameters:

inputColumn: The input strings to tokenize.
vocabulary_tableHashedVocabulary: The vocabulary table pre-loaded into this object.
max_sequence_lengthuint32_t: Limit of the number of token-ids per row in final tensor for each string.
strideuint32_t: Each row in the output token-ids will replicate max_sequence_length - stride the token-ids from the previous row, unless it is the first string.
do_lower_casebool: If true, the tokenizer will convert uppercase characters in the input stream to lower-case and strip accents from those characters. If false, accented and uppercase characters are not transformed.
do_truncatebool: If true, the tokenizer will discard all the token-ids after max_sequence_length for each input string. If false, it will use a new row in the output token-ids to continue generating the output.

Returns:

tuple[Column, Column, Column]: A tuple of three columns containing the tokens, masks, and metadata.

subword_tokenize#

This Page