The vocabulary data for use with the subword_tokenize function. More...

Public Attributes
uint16_t	first_token_id {}
	The first token id in the vocabulary.

uint16_t	separator_token_id {}
	The separator token id in the vocabulary.

uint16_t	unknown_token_id {}
	The unknown token id in the vocabulary.

uint32_t	outer_hash_a {}
	The a parameter for the outer hash.

uint32_t	outer_hash_b {}
	The b parameter for the outer hash.

uint16_t	num_bins {}
	Number of bins.

std::unique_ptr< cudf::column >	table

std::unique_ptr< cudf::column >	bin_coefficients

std::unique_ptr< cudf::column >	bin_offsets

std::unique_ptr< cudf::column >	cp_metadata
	uint32 column, The code point metadata table to use for normalization

std::unique_ptr< cudf::column >	aux_cp_table
	uint64 column, The auxiliary code point table to use for normalization

Detailed Description

The vocabulary data for use with the subword_tokenize function.

Definition at line 35 of file subword_tokenize.hpp.

Member Data Documentation

std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::bin_coefficients

uint64 column, containing the hashing parameters for each hash bin on the GPU

Definition at line 44 of file subword_tokenize.hpp.

std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::bin_offsets

uint16 column, containing the start index of each bin in the flattened hash table

Definition at line 46 of file subword_tokenize.hpp.

std::unique_ptr<cudf::column> nvtext::hashed_vocabulary::table

uint64 column, the flattened hash table with key, value pairs packed in 64-bits

Definition at line 42 of file subword_tokenize.hpp.

The documentation for this struct was generated from the following file: