Nvtext Tokenize#
- group nvtext_tokenize
Functions
-
std::unique_ptr<bpe_merge_pairs> load_merge_pairs(cudf::strings_column_view const &merge_pairs, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a nvtext::bpe_merge_pairs from a strings column.
The input column should contain a unique pair of strings per line separated by a single space. An incorrect format or non-unique entries will result in undefined behavior.
Example:
merge_pairs = ["e n", "i t", "i s", "e s", "en t", "c e", "es t", "en ce", "t est", "s ent"] mps = load_merge_pairs(merge_pairs) // the mps object can be passed to the byte_pair_encoding API
The pairs are expected to be ordered in the file by their rank relative to each other. A pair earlier in the file has priority over any pairs below it.
- Throws:
cudf::logic_error – if
merge_pairs
is empty or contains nulls- Parameters:
merge_pairs – Column containing the unique merge pairs
stream – CUDA stream used for device memory operations and kernel launches
mr – Memory resource to allocate any returned objects
- Returns:
A nvtext::bpe_merge_pairs object
-
std::unique_ptr<cudf::column> byte_pair_encoding(cudf::strings_column_view const &input, bpe_merge_pairs const &merges_pairs, cudf::string_scalar const &separator = cudf::string_scalar(" "), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Byte pair encode the input strings.
The encoding algorithm rebuilds each string by matching substrings in the
merge_pairs
table and iteratively removing the minimum ranked pair until no pairs are left. Then, the separator is inserted between the remaining pairs before the result is joined to make the output string.merge_pairs = ["e n", "i t", "i s", "e s", "en t", "c e", "es t", "en ce", "t est", "s ent"] mps = load_merge_pairs(merge_pairs) input = ["test sentence", "thisis test"] result = byte_pair_encoding(input, mps) result is now ["test sent ence", "this is test"]
- Throws:
cudf::logic_error – if
merge_pairs
is emptycudf::logic_error – if
separator
is invalid
- Parameters:
input – Strings to encode.
merges_pairs – Created by a call to nvtext::load_merge_pairs.
separator – String used to build the output after encoding. Default is a space.
mr – Memory resource to allocate any returned objects.
- Returns:
An encoded column of strings.
-
std::unique_ptr<hashed_vocabulary> load_vocabulary_file(std::string const &filename_hashed_vocabulary, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Load the hashed vocabulary file into device memory.
The object here can be used to call the subword_tokenize without incurring the cost of loading the same file each time.
- Throws:
cudf::logic_error – if the
filename_hashed_vocabulary
could not be opened.- Parameters:
filename_hashed_vocabulary – A path to the preprocessed vocab.txt file. Note that this is the file AFTER python/perfect_hash.py has been used for preprocessing.
stream – CUDA stream used for device memory operations and kernel launches
mr – Memory resource to allocate any returned objects.
- Returns:
vocabulary hash-table elements
-
tokenizer_result subword_tokenize(cudf::strings_column_view const &strings, hashed_vocabulary const &vocabulary_table, uint32_t max_sequence_length, uint32_t stride, bool do_lower_case, bool do_truncate, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Creates a tokenizer that cleans the text, splits it into tokens and returns token-ids from an input vocabulary.
The strings are first normalized by converting to lower-case, removing punctuation, replacing a select set of multi-byte characters and whitespace characters.
The strings are then tokenized by using whitespace as a delimiter. Consecutive delimiters are ignored. Each token is then assigned a 4-byte token-id mapped from the provided vocabulary table.
Essentially each string is converted into one or more vectors of token-ids in the output column. The total number of these vectors times
max_sequence_length
is the size of thetensor_token_ids
output column. Fordo_truncate==true
:size of tensor_token_ids = max_sequence_length * strings.size() size of tensor_attention_mask = max_sequence_length * strings.size() size of tensor_metadata = 3 * strings.size()
For
do_truncate==false
the number of rows per output string depends on the number of tokens resolved and thestride
value which may repeat tokens in subsequent overflow rows.This function requires about 21x the number of character bytes in the input strings column as working memory.
- Throws:
cudf::logic_error – if
stride > max_sequence_length
std::overflow_error – if
max_sequence_length * max_rows_tensor
exceeds the column size limit
- Parameters:
strings – The input strings to tokenize.
vocabulary_table – The vocabulary table pre-loaded into this object.
max_sequence_length – Limit of the number of token-ids per row in final tensor for each string.
stride – Each row in the output token-ids will replicate
max_sequence_length - stride
the token-ids from the previous row, unless it is the first string.do_lower_case – If true, the tokenizer will convert uppercase characters in the input stream to lower-case and strip accents from those characters. If false, accented and uppercase characters are not transformed.
do_truncate – If true, the tokenizer will discard all the token-ids after
max_sequence_length
for each input string. If false, it will use a new row in the output token-ids to continue generating the output.stream – CUDA stream used for device memory operations and kernel launches
mr – Memory resource to allocate any returned objects.
- Returns:
token-ids, attention-mask, and metadata
-
std::unique_ptr<cudf::column> tokenize(cudf::strings_column_view const &input, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns a single column of strings by tokenizing the input strings column using the provided characters as delimiters.
The
delimiter
may be zero or more characters. If thedelimiter
is empty, whitespace (character code-point <= ‘ ‘) is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored. This means only non-empty tokens are returned.Tokens are found by locating delimiter(s) starting at the beginning of each string. As each string is tokenized, the tokens are appended using input column row order to build the output column. That is, tokens found in input row[i] will be placed in the output column directly before tokens found in input row[i+1].
Example: s = ["a", "b c", "d e f "] t = tokenize(s) t is now ["a", "b", "c", "d", "e", "f"]
All null row entries are ignored and the output contains all valid rows.
- Parameters:
input – Strings column to tokenize
delimiter – UTF-8 characters used to separate each string into tokens. The default of empty string will separate tokens using whitespace.
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings columns of tokens
-
std::unique_ptr<cudf::column> tokenize(cudf::strings_column_view const &input, cudf::strings_column_view const &delimiters, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns a single column of strings by tokenizing the input strings column using multiple strings as delimiters.
Tokens are found by locating delimiter(s) starting at the beginning of each string. Any consecutive delimiters found in a string are ignored. This means only non-empty tokens are returned.
As each string is tokenized, the tokens are appended using input column row order to build the output column. That is, tokens found in input row[i] will be placed in the output column directly before tokens found in input row[i+1].
Example: s = ["a", "b c", "d.e:f;"] d = [".", ":", ";"] t = tokenize(s,d) t is now ["a", "b c", "d", "e", "f"]
All null row entries are ignored and the output contains all valid rows.
- Throws:
cudf::logic_error – if the delimiters column is empty or contains nulls.
- Parameters:
input – Strings column to tokenize
delimiters – Strings used to separate individual strings into tokens
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings columns of tokens
-
std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const &input, cudf::string_scalar const &delimiter = cudf::string_scalar{""}, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the number of tokens in each string of a strings column.
The
delimiter
may be zero or more characters. If thedelimiter
is empty, whitespace (character code-point <= ‘ ‘) is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored. This means that only empty strings or null rows will result in a token count of 0.Example: s = ["a", "b c", " ", "d e f"] t = count_tokens(s) t is now [1, 2, 0, 3]
All null row entries are ignored and the output contains all valid rows. The number of tokens for a null element is set to 0 in the output column.
- Parameters:
input – Strings column to count tokens
delimiter – Strings used to separate each string into tokens. The default of empty string will separate tokens using whitespace.
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of token counts
-
std::unique_ptr<cudf::column> count_tokens(cudf::strings_column_view const &input, cudf::strings_column_view const &delimiters, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the number of tokens in each string of a strings column by using multiple strings delimiters to identify tokens in each string.
Also, any consecutive delimiters found in a string are ignored. This means that only empty strings or null rows will result in a token count of 0.
Example: s = ["a", "b c", "d.e:f;"] d = [".", ":", ";"] t = count_tokens(s,d) t is now [1, 1, 3]
All null row entries are ignored and the output contains all valid rows. The number of tokens for a null element is set to 0 in the output column.
- Throws:
cudf::logic_error – if the delimiters column is empty or contains nulls
- Parameters:
input – Strings column to count tokens
delimiters – Strings used to separate each string into tokens
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New column of token counts
-
std::unique_ptr<cudf::column> character_tokenize(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns a single column of strings by converting each character to a string.
Each string is converted to multiple strings — one for each character. Note that a character maybe more than one byte.
Example: s = ["hello world", null, "goodbye"] t = character_tokenize(s) t is now ["h","e","l","l","o"," ","w","o","r","l","d","g","o","o","d","b","y","e"]
- Throws:
std::invalid_argument – if
input
contains nullsstd::overflow_error – if the output would produce more than max size_type rows
- Parameters:
input – Strings column to tokenize
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings columns of tokens
-
std::unique_ptr<cudf::column> detokenize(cudf::strings_column_view const &input, cudf::column_view const &row_indices, cudf::string_scalar const &separator = cudf::string_scalar(" "), rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Creates a strings column from a strings column of tokens and an associated column of row ids.
Multiple tokens from the input column may be combined into a single row (string) in the output column. The tokens are concatenated along with the
separator
string in the order in which they appear in therow_indices
column.Example: s = ["hello", "world", "one", "two", "three"] r = [0, 0, 1, 1, 1] s1 = detokenize(s,r) s1 is now ["hello world", "one two three"] r = [0, 2, 1, 1, 0] s2 = detokenize(s,r) s2 is now ["hello three", "one two", "world"]
All null row entries are ignored and the output contains all valid rows. The values in
row_indices
are expected to have positive, sequential values without any missing row indices otherwise the output is undefined.- Throws:
cudf::logic_error – is
separator
is invalidcudf::logic_error – if
row_indices.size() != strings.size()
cudf::logic_error – if
row_indices
contains nulls
- Parameters:
input – Strings column to detokenize
row_indices – The relative output row index assigned for each token in the input column
separator – String to append after concatenating each token to the proper output row
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings columns of tokens
-
std::unique_ptr<tokenize_vocabulary> load_vocabulary(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Create a tokenize_vocabulary object from a strings column.
Token ids are the row indices within the vocabulary column. Each vocabulary entry is expected to be unique otherwise the behavior is undefined.
- Throws:
cudf::logic_error – if
vocabulary
contains nulls or is empty- Parameters:
input – Strings for the vocabulary
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
Object to be used with nvtext::tokenize_with_vocabulary
-
std::unique_ptr<cudf::column> tokenize_with_vocabulary(cudf::strings_column_view const &input, tokenize_vocabulary const &vocabulary, cudf::string_scalar const &delimiter, cudf::size_type default_id = -1, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the token ids for the input string by looking up each delimited token in the given vocabulary.
Example: s = ["hello world", "hello there", "there there world", "watch out world"] v = load_vocabulary(["hello", "there", "world"]) r = tokenize_with_vocabulary(s,v) r is now [[0,2], [0,1], [1,1,2], [-1,-1,2]]
Any null row entry results in a corresponding null entry in the output
- Throws:
cudf::logic_error – if
delimiter
is invalid- Parameters:
input – Strings column to tokenize
vocabulary – Used to lookup tokens within
input
delimiter – Used to identify tokens within
input
default_id – The token id to be used for tokens not found in the
vocabulary
; Default is -1stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
Lists column of token ids
-
struct bpe_merge_pairs#
- #include <byte_pair_encoding.hpp>
The table of merge pairs for the BPE encoder.
To create an instance, call nvtext::load_merge_pairs
Public Functions
-
bpe_merge_pairs(std::unique_ptr<cudf::column> &&input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Construct a new bpe merge pairs object.
- Parameters:
input – The input file containing the BPE merge pairs
stream – CUDA stream used for device memory operations and kernel launches.
mr – Device memory resource used to allocate the device memory
-
bpe_merge_pairs(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Construct a new bpe merge pairs object.
- Parameters:
input – The input column of strings
stream – CUDA stream used for device memory operations and kernel launches.
mr – Device memory resource used to allocate the device memory
-
bpe_merge_pairs(std::unique_ptr<cudf::column> &&input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
-
struct hashed_vocabulary#
- #include <subword_tokenize.hpp>
The vocabulary data for use with the subword_tokenize function.
Public Members
-
uint16_t first_token_id = {}#
The first token id in the vocabulary.
-
uint16_t separator_token_id = {}#
The separator token id in the vocabulary.
-
uint16_t unknown_token_id = {}#
The unknown token id in the vocabulary.
-
uint32_t outer_hash_a = {}#
The a parameter for the outer hash.
-
uint32_t outer_hash_b = {}#
The b parameter for the outer hash.
-
uint16_t num_bins = {}#
Number of bins.
-
std::unique_ptr<cudf::column> table#
uint64 column, the flattened hash table with key, value pairs packed in 64-bits
-
std::unique_ptr<cudf::column> bin_coefficients#
uint64 column, containing the hashing parameters for each hash bin on the GPU
-
std::unique_ptr<cudf::column> bin_offsets#
uint16 column, containing the start index of each bin in the flattened hash table
-
uint16_t first_token_id = {}#
-
struct tokenizer_result#
- #include <subword_tokenize.hpp>
Result object for the subword_tokenize functions.
Public Members
-
uint32_t nrows_tensor = {}#
The number of rows for the output token-ids.
-
uint32_t sequence_length = {}#
The number of token-ids in each row.
-
std::unique_ptr<cudf::column> tensor_token_ids#
A vector of token-ids for each row.
The data is a flat matrix (nrows_tensor x sequence_length) of token-ids. This column is of type UINT32 with no null entries.
-
uint32_t nrows_tensor = {}#
-
struct tokenize_vocabulary#
- #include <tokenize.hpp>
Vocabulary object to be used with nvtext::tokenize_with_vocabulary.
Use nvtext::load_vocabulary to create this object.
Public Functions
-
tokenize_vocabulary(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Vocabulary object constructor.
Token ids are the row indices within the vocabulary column. Each vocabulary entry is expected to be unique otherwise the behavior is undefined.
- Throws:
cudf::logic_error – if
vocabulary
contains nulls or is empty- Parameters:
input – Strings for the vocabulary
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
-
tokenize_vocabulary(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
-
std::unique_ptr<bpe_merge_pairs> load_merge_pairs(cudf::strings_column_view const &merge_pairs, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#