Public Attributes | List of all members
nvtext::tokenizer_result Struct Reference

Result object for the subword_tokenize functions. More...

#include <subword_tokenize.hpp>

Public Attributes

uint32_t nrows_tensor {}
 The number of rows for the output token-ids.
 
uint32_t sequence_length {}
 The number of token-ids in each row.
 
std::unique_ptr< cudf::columntensor_token_ids
 A vector of token-ids for each row. More...
 
std::unique_ptr< cudf::columntensor_attention_mask
 This mask identifies which tensor-token-ids are valid. More...
 
std::unique_ptr< cudf::columntensor_metadata
 The metadata for each tensor row. More...
 

Detailed Description

Result object for the subword_tokenize functions.

Definition at line 77 of file subword_tokenize.hpp.

Member Data Documentation

◆ tensor_attention_mask

std::unique_ptr<cudf::column> nvtext::tokenizer_result::tensor_attention_mask

This mask identifies which tensor-token-ids are valid.

This column is of type UINT32 with no null entries.

Definition at line 98 of file subword_tokenize.hpp.

◆ tensor_metadata

std::unique_ptr<cudf::column> nvtext::tokenizer_result::tensor_metadata

The metadata for each tensor row.

There are three elements per tensor row [row-id, start_pos, stop_pos]) This column is of type UINT32 with no null entries.

Definition at line 105 of file subword_tokenize.hpp.

◆ tensor_token_ids

std::unique_ptr<cudf::column> nvtext::tokenizer_result::tensor_token_ids

A vector of token-ids for each row.

The data is a flat matrix (nrows_tensor x sequence_length) of token-ids. This column is of type UINT32 with no null entries.

Definition at line 92 of file subword_tokenize.hpp.


The documentation for this struct was generated from the following file: