The vocabulary data for use with the subword_tokenize function. More...
#include <subword_tokenize.hpp>
Public Attributes | |
uint16_t | first_token_id {} |
The first token id in the vocabulary. | |
uint16_t | separator_token_id {} |
The separator token id in the vocabulary. | |
uint16_t | unknown_token_id {} |
The unknown token id in the vocabulary. | |
uint32_t | outer_hash_a {} |
The a parameter for the outer hash. | |
uint32_t | outer_hash_b {} |
The b parameter for the outer hash. | |
uint16_t | num_bins {} |
Number of bins. | |
std::unique_ptr< cudf::column > | table |
std::unique_ptr< cudf::column > | bin_coefficients |
std::unique_ptr< cudf::column > | bin_offsets |
std::unique_ptr< cudf::column > | cp_metadata |
uint32 column, The code point metadata table to use for normalization | |
std::unique_ptr< cudf::column > | aux_cp_table |
uint64 column, The auxiliary code point table to use for normalization | |
The vocabulary data for use with the subword_tokenize function.
Definition at line 35 of file subword_tokenize.hpp.
std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::bin_coefficients |
uint64 column, containing the hashing parameters for each hash bin on the GPU
Definition at line 44 of file subword_tokenize.hpp.
std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::bin_offsets |
uint16 column, containing the start index of each bin in the flattened hash table
Definition at line 46 of file subword_tokenize.hpp.
std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::table |
uint64 column, the flattened hash table with key, value pairs packed in 64-bits
Definition at line 42 of file subword_tokenize.hpp.