cudf.core.subword_tokenizer.SubwordTokenizer.__call__#

SubwordTokenizer.__call__(text, max_length: int, max_num_rows: int, add_special_tokens: bool = True, padding: str = 'max_length', truncation: bool | str = False, stride: int = 0, return_tensors: str = 'cp', return_token_type_ids: bool = False)#

Run CUDA BERT subword tokenizer on cuDF strings column. Encodes words to token ids using vocabulary from a pretrained tokenizer.

Parameters:
textcudf string series

The batch of sequences to be encoded.

max_lengthint

Controls the maximum length to use or pad to.

max_num_rowsint

Maximum number of rows for the output token-ids expected to be generated by the tokenizer. Used for allocating temporary working memory on the GPU device. If the output generates a larger number of rows, behavior is undefined. This will vary based on stride, truncation, and max_length. For example, for non-overlapping sequences output rows will be the same as input rows. A good default can be twice the max_length

add_special_tokensbool, optional, defaults to True

Whether or not to encode the sequences with the special tokens of the BERT classification model

padding“max_length”

Pad to a maximum length specified with the argument max_length

truncationbool, defaults to False

True: Truncate to a maximum length specified with the argument max_length False or ‘do_not_truncate’: default No truncation (Output differs from HuggingFace)

strideint, optional, defaults to 0

The value of this argument defines the number of overlapping tokens. The information about the overlapping tokens is present in the metadata outputted.

return_tensorsstr, {“cp”, “pt”, “tf”} defaults to “cp”

“cp” : Return cupy cp.ndarray objects “tf” : Return TensorFlow tf.constant objects “pt” : Return PyTorch torch.Tensor objects

return_token_type_idsbool, optional

Only False currently supported

Returns:
An encoding with the following fields:
input_ids:(type defined by return_tensors)

A tensor of token ids to be fed to the model.

attention_mask: (type defined by return_tensors)

A tensor of indices specifying which tokens should be attended to by the model

metadata: (type defined by return_tensors)

Each row contains the index id of the original string and the first and last index of the token-ids that are non-padded and non-overlapping

Examples

>>> import cudf
>>> from cudf.utils.hash_vocab_utils import hash_vocab
>>> hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
>>> from cudf.core.subword_tokenizer import SubwordTokenizer
>>> cudf_tokenizer = SubwordTokenizer('voc_hash.txt',
...                                    do_lower_case=True)
>>> str_series = cudf.Series(['This is the', 'best book'])
>>> tokenizer_output = cudf_tokenizer(str_series,
...                                   max_length=8,
...                                   max_num_rows=len(str_series),
...                                   padding='max_length',
...                                   return_tensors='pt',
...                                   truncation=True)
>>> tokenizer_output['input_ids']
tensor([[ 101, 1142, 1110, 1103,  102,    0,    0,    0],
        [ 101, 1436, 1520,  102,    0,    0,    0,    0]],
        device='cuda:0',
       dtype=torch.int32)
>>> tokenizer_output['attention_mask']
tensor([[1, 1, 1, 1, 1, 0, 0, 0],
        [1, 1, 1, 1, 0, 0, 0, 0]],
        device='cuda:0', dtype=torch.int32)
>>> tokenizer_output['metadata']
tensor([[0, 1, 3],
        [1, 1, 2]], device='cuda:0', dtype=torch.int32)