cudf.core.subword_tokenizer.SubwordTokenizer#
- class cudf.core.subword_tokenizer.SubwordTokenizer(hash_file: str, do_lower_case: bool = True)#
Run CUDA BERT subword tokenizer on cuDF strings column. Encodes words to token ids using vocabulary from a pretrained tokenizer. This function requires about 21x the number of character bytes in the input strings column as working memory.
- Parameters:
- hash_filestr
Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the
cudf.utils.hash_vocab_utils.hash_vocab
function- do_lowerbool, Default is True
If set to True, original text will be lowercased before encoding.
- Returns:
- SubwordTokenizer
Methods
__call__
(text, max_length, max_num_rows[, ...])Run CUDA BERT subword tokenizer on cuDF strings column.