subword_tokenize#
- class pylibcudf.nvtext.subword_tokenize.HashedVocabulary#
The vocabulary data for use with the subword_tokenize function.
For details, see
cudf::nvtext::hashed_vocabulary
.
- pylibcudf.nvtext.subword_tokenize.subword_tokenize(Column input, HashedVocabulary vocabulary_table, uint32_t max_sequence_length, uint32_t stride, bool do_lower_case, bool do_truncate) tuple #
Creates a tokenizer that cleans the text, splits it into tokens and returns token-ids from an input vocabulary.
For details, see cpp:func:subword_tokenize
- Parameters:
- inputColumn
The input strings to tokenize.
- vocabulary_tableHashedVocabulary
The vocabulary table pre-loaded into this object.
- max_sequence_lengthuint32_t
Limit of the number of token-ids per row in final tensor for each string.
- strideuint32_t
Each row in the output token-ids will replicate
max_sequence_length
-stride
the token-ids from the previous row, unless it is the first string.- do_lower_casebool
If true, the tokenizer will convert uppercase characters in the input stream to lower-case and strip accents from those characters. If false, accented and uppercase characters are not transformed.
- do_truncatebool
If true, the tokenizer will discard all the token-ids after
max_sequence_length
for each input string. If false, it will use a new row in the output token-ids to continue generating the output.
- Returns:
- tuple[Column, Column, Column]
A tuple of three columns containing the tokens, masks, and metadata.