cudf.core.subword_tokenizer.SubwordTokenizer#

class cudf.core.subword_tokenizer.SubwordTokenizer(hash_file: str, do_lower_case: bool = True)#

Run CUDA BERT subword tokenizer on cuDF strings column. Encodes words to token ids using vocabulary from a pretrained tokenizer. This function requires about 21x the number of character bytes in the input strings column as working memory.

Parameters:
hash_filestr

Path to hash file containing vocabulary of words with token-ids. This can be created from the raw vocabulary using the cudf.utils.hash_vocab_utils.hash_vocab function

do_lowerbool, Default is True

If set to True, original text will be lowercased before encoding.

Returns:
SubwordTokenizer

Methods

__call__(text, max_length, max_num_rows[, ...])

Run CUDA BERT subword tokenizer on cuDF strings column.