cudf.core.subword_tokenizer.SubwordTokenizer.__call__#
- SubwordTokenizer.__call__(text, max_length: int, max_num_rows: int, add_special_tokens: bool = True, padding: str = 'max_length', truncation: bool | str = False, stride: int = 0, return_tensors: str = 'cp', return_token_type_ids: bool = False)#
Run CUDA BERT subword tokenizer on cuDF strings column. Encodes words to token ids using vocabulary from a pretrained tokenizer.
- Parameters:
- textcudf string series
The batch of sequences to be encoded.
- max_lengthint
Controls the maximum length to use or pad to.
- max_num_rowsint
Maximum number of rows for the output token-ids expected to be generated by the tokenizer. Used for allocating temporary working memory on the GPU device. If the output generates a larger number of rows, behavior is undefined. This will vary based on stride, truncation, and max_length. For example, for non-overlapping sequences output rows will be the same as input rows. A good default can be twice the max_length
- add_special_tokensbool, optional, defaults to True
Whether or not to encode the sequences with the special tokens of the BERT classification model
- padding“max_length”
Pad to a maximum length specified with the argument max_length
- truncationbool, defaults to False
True: Truncate to a maximum length specified with the argument max_length False or ‘do_not_truncate’: default No truncation (Output differs from HuggingFace)
- strideint, optional, defaults to 0
The value of this argument defines the number of overlapping tokens. The information about the overlapping tokens is present in the metadata outputted.
- return_tensorsstr, {“cp”, “pt”, “tf”} defaults to “cp”
“cp” : Return cupy cp.ndarray objects “tf” : Return TensorFlow tf.constant objects “pt” : Return PyTorch torch.Tensor objects
- return_token_type_idsbool, optional
Only False currently supported
- Returns:
- An encoding with the following fields:
- input_ids:(type defined by return_tensors)
A tensor of token ids to be fed to the model.
- attention_mask: (type defined by return_tensors)
A tensor of indices specifying which tokens should be attended to by the model
- metadata: (type defined by return_tensors)
Each row contains the index id of the original string and the first and last index of the token-ids that are non-padded and non-overlapping
Examples
>>> import cudf >>> from cudf.utils.hash_vocab_utils import hash_vocab >>> hash_vocab('bert-base-cased-vocab.txt', 'voc_hash.txt')
>>> from cudf.core.subword_tokenizer import SubwordTokenizer >>> cudf_tokenizer = SubwordTokenizer('voc_hash.txt', ... do_lower_case=True) >>> str_series = cudf.Series(['This is the', 'best book']) >>> tokenizer_output = cudf_tokenizer(str_series, ... max_length=8, ... max_num_rows=len(str_series), ... padding='max_length', ... return_tensors='pt', ... truncation=True) >>> tokenizer_output['input_ids'] tensor([[ 101, 1142, 1110, 1103, 102, 0, 0, 0], [ 101, 1436, 1520, 102, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32) >>> tokenizer_output['attention_mask'] tensor([[1, 1, 1, 1, 1, 0, 0, 0], [1, 1, 1, 1, 0, 0, 0, 0]], device='cuda:0', dtype=torch.int32) >>> tokenizer_output['metadata'] tensor([[0, 1, 3], [1, 1, 2]], device='cuda:0', dtype=torch.int32)