Files | Classes | Functions
Tokenizing

Files

file  byte_pair_encoding.hpp
 
file  tokenize.hpp
 
file  wordpiece_tokenize.hpp
 

Classes

struct  nvtext::bpe_merge_pairs
 The table of merge pairs for the BPE encoder. More...
 
struct  nvtext::tokenize_vocabulary
 Vocabulary object to be used with nvtext::tokenize_with_vocabulary. More...
 
struct  nvtext::wordpiece_vocabulary
 Vocabulary object to be used with nvtext::wordpiece_tokenizer. More...
 

Functions

std::unique_ptr< bpe_merge_pairsnvtext::load_merge_pairs (cudf::strings_column_view const &merge_pairs, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Create a nvtext::bpe_merge_pairs from a strings column. More...
 
std::unique_ptr< cudf::columnnvtext::byte_pair_encoding (cudf::strings_column_view const &input, bpe_merge_pairs const &merges_pairs, cudf::string_scalar const &separator=cudf::string_scalar(" "), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Byte pair encode the input strings. More...
 
std::unique_ptr< cudf::columnnvtext::tokenize (cudf::strings_column_view const &input, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns a single column of strings by tokenizing the input strings column using the provided characters as delimiters. More...
 
std::unique_ptr< cudf::columnnvtext::tokenize (cudf::strings_column_view const &input, cudf::strings_column_view const &delimiters, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns a single column of strings by tokenizing the input strings column using multiple strings as delimiters. More...
 
std::unique_ptr< cudf::columnnvtext::count_tokens (cudf::strings_column_view const &input, cudf::string_scalar const &delimiter=cudf::string_scalar{""}, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the number of tokens in each string of a strings column. More...
 
std::unique_ptr< cudf::columnnvtext::count_tokens (cudf::strings_column_view const &input, cudf::strings_column_view const &delimiters, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the number of tokens in each string of a strings column by using multiple strings delimiters to identify tokens in each string. More...
 
std::unique_ptr< cudf::columnnvtext::character_tokenize (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns a single column of strings by converting each character to a string. More...
 
std::unique_ptr< cudf::columnnvtext::detokenize (cudf::strings_column_view const &input, cudf::column_view const &row_indices, cudf::string_scalar const &separator=cudf::string_scalar(" "), rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Creates a strings column from a strings column of tokens and an associated column of row ids. More...
 
std::unique_ptr< tokenize_vocabularynvtext::load_vocabulary (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Create a tokenize_vocabulary object from a strings column. More...
 
std::unique_ptr< cudf::columnnvtext::tokenize_with_vocabulary (cudf::strings_column_view const &input, tokenize_vocabulary const &vocabulary, cudf::string_scalar const &delimiter, cudf::size_type default_id=-1, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the token ids for the input string by looking up each delimited token in the given vocabulary. More...
 
std::unique_ptr< wordpiece_vocabularynvtext::load_wordpiece_vocabulary (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Create a tokenize_vocabulary object from a strings column. More...
 
std::unique_ptr< cudf::columnnvtext::wordpiece_tokenize (cudf::strings_column_view const &input, wordpiece_vocabulary const &vocabulary, cudf::size_type max_words_per_row=0, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns the token ids for the input string a wordpiece tokenizer algorithm with the given vocabulary. More...
 

Detailed Description

Function Documentation

◆ byte_pair_encoding()

std::unique_ptr<cudf::column> nvtext::byte_pair_encoding ( cudf::strings_column_view const &  input,
bpe_merge_pairs const &  merges_pairs,
cudf::string_scalar const &  separator = cudf::string_scalar(" "),
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Byte pair encode the input strings.

The encoding algorithm rebuilds each string by matching substrings in the merge_pairs table and iteratively removing the minimum ranked pair until no pairs are left. Then, the separator is inserted between the remaining pairs before the result is joined to make the output string.

merge_pairs = ["e n", "i t", "i s", "e s", "en t", "c e", "es t", "en ce", "t est", "s ent"]
mps = load_merge_pairs(merge_pairs)
input = ["test sentence", "thisis test"]
result = byte_pair_encoding(input, mps)
result is now ["test sent ence", "this is test"]
Exceptions
cudf::logic_errorif merge_pairs is empty
cudf::logic_errorif separator is invalid
Parameters
inputStrings to encode.
merges_pairsCreated by a call to nvtext::load_merge_pairs.
separatorString used to build the output after encoding. Default is a space.
streamCUDA stream used for device memory operations and kernel launches
mrMemory resource to allocate any returned objects.
Returns
An encoded column of strings.

◆ character_tokenize()

Returns a single column of strings by converting each character to a string.

Each string is converted to multiple strings – one for each character. Note that a character maybe more than one byte.

Example:
s = ["hello world", null, "goodbye"]
t = character_tokenize(s)
t is now ["h","e","l","l","o"," ","w","o","r","l","d","g","o","o","d","b","y","e"]
Exceptions
std::invalid_argumentif input contains nulls
std::overflow_errorif the output would produce more than max size_type rows
Parameters
inputStrings column to tokenize
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings columns of tokens

◆ count_tokens() [1/2]

Returns the number of tokens in each string of a strings column.

The delimiter may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored. This means that only empty strings or null rows will result in a token count of 0.

Example:
s = ["a", "b c", " ", "d e f"]
t = count_tokens(s)
t is now [1, 2, 0, 3]

All null row entries are ignored and the output contains all valid rows. The number of tokens for a null element is set to 0 in the output column.

Parameters
inputStrings column to count tokens
delimiterStrings used to separate each string into tokens. The default of empty string will separate tokens using whitespace.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of token counts

◆ count_tokens() [2/2]

Returns the number of tokens in each string of a strings column by using multiple strings delimiters to identify tokens in each string.

Also, any consecutive delimiters found in a string are ignored. This means that only empty strings or null rows will result in a token count of 0.

Example:
s = ["a", "b c", "d.e:f;"]
d = [".", ":", ";"]
t = count_tokens(s,d)
t is now [1, 1, 3]

All null row entries are ignored and the output contains all valid rows. The number of tokens for a null element is set to 0 in the output column.

Exceptions
cudf::logic_errorif the delimiters column is empty or contains nulls
Parameters
inputStrings column to count tokens
delimitersStrings used to separate each string into tokens
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New column of token counts

◆ detokenize()

std::unique_ptr<cudf::column> nvtext::detokenize ( cudf::strings_column_view const &  input,
cudf::column_view const &  row_indices,
cudf::string_scalar const &  separator = cudf::string_scalar(" "),
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Creates a strings column from a strings column of tokens and an associated column of row ids.

Multiple tokens from the input column may be combined into a single row (string) in the output column. The tokens are concatenated along with the separator string in the order in which they appear in the row_indices column.

Example:
s = ["hello", "world", "one", "two", "three"]
r = [0, 0, 1, 1, 1]
s1 = detokenize(s,r)
s1 is now ["hello world", "one two three"]
r = [0, 2, 1, 1, 0]
s2 = detokenize(s,r)
s2 is now ["hello three", "one two", "world"]

All null row entries are ignored and the output contains all valid rows. The values in row_indices are expected to have positive, sequential values without any missing row indices otherwise the output is undefined.

Exceptions
cudf::logic_erroris separator is invalid
cudf::logic_errorif row_indices.size() != strings.size()
cudf::logic_errorif row_indices contains nulls
Parameters
inputStrings column to detokenize
row_indicesThe relative output row index assigned for each token in the input column
separatorString to append after concatenating each token to the proper output row
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings columns of tokens

◆ load_merge_pairs()

Create a nvtext::bpe_merge_pairs from a strings column.

The input column should contain a unique pair of strings per line separated by a single space. An incorrect format or non-unique entries will result in undefined behavior.

Example:

merge_pairs = ["e n", "i t", "i s", "e s", "en t", "c e", "es t", "en ce", "t est", "s ent"]
mps = load_merge_pairs(merge_pairs)
// the mps object can be passed to the byte_pair_encoding API

The pairs are expected to be ordered in the file by their rank relative to each other. A pair earlier in the file has priority over any pairs below it.

Exceptions
cudf::logic_errorif merge_pairs is empty or contains nulls
Parameters
merge_pairsColumn containing the unique merge pairs
streamCUDA stream used for device memory operations and kernel launches
mrMemory resource to allocate any returned objects
Returns
A nvtext::bpe_merge_pairs object

◆ load_vocabulary()

Create a tokenize_vocabulary object from a strings column.

Token ids are the row indices within the vocabulary column. Each vocabulary entry is expected to be unique otherwise the behavior is undefined.

Exceptions
cudf::logic_errorif vocabulary contains nulls or is empty
Parameters
inputStrings for the vocabulary
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Object to be used with nvtext::tokenize_with_vocabulary

◆ load_wordpiece_vocabulary()

Create a tokenize_vocabulary object from a strings column.

Token ids are the row indices within the vocabulary column. Each vocabulary entry is expected to be unique otherwise the behavior is undefined.

Exceptions
std::invalid_argumentif vocabulary contains nulls or is empty
Parameters
inputStrings for the vocabulary
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Object to be used with nvtext::wordpiece_tokenize

◆ tokenize() [1/2]

Returns a single column of strings by tokenizing the input strings column using the provided characters as delimiters.

The delimiter may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens. Also, any consecutive delimiters found in a string are ignored. This means only non-empty tokens are returned.

Tokens are found by locating delimiter(s) starting at the beginning of each string. As each string is tokenized, the tokens are appended using input column row order to build the output column. That is, tokens found in input row[i] will be placed in the output column directly before tokens found in input row[i+1].

Example:
s = ["a", "b c", "d e f "]
t = tokenize(s)
t is now ["a", "b", "c", "d", "e", "f"]

All null row entries are ignored and the output contains all valid rows.

Parameters
inputStrings column to tokenize
delimiterUTF-8 characters used to separate each string into tokens. The default of empty string will separate tokens using whitespace.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings columns of tokens

◆ tokenize() [2/2]

Returns a single column of strings by tokenizing the input strings column using multiple strings as delimiters.

Tokens are found by locating delimiter(s) starting at the beginning of each string. Any consecutive delimiters found in a string are ignored. This means only non-empty tokens are returned.

As each string is tokenized, the tokens are appended using input column row order to build the output column. That is, tokens found in input row[i] will be placed in the output column directly before tokens found in input row[i+1].

Example:
s = ["a", "b c", "d.e:f;"]
d = [".", ":", ";"]
t = tokenize(s,d)
t is now ["a", "b c", "d", "e", "f"]

All null row entries are ignored and the output contains all valid rows.

Exceptions
cudf::logic_errorif the delimiters column is empty or contains nulls.
Parameters
inputStrings column to tokenize
delimitersStrings used to separate individual strings into tokens
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings columns of tokens

◆ tokenize_with_vocabulary()

std::unique_ptr<cudf::column> nvtext::tokenize_with_vocabulary ( cudf::strings_column_view const &  input,
tokenize_vocabulary const &  vocabulary,
cudf::string_scalar const &  delimiter,
cudf::size_type  default_id = -1,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns the token ids for the input string by looking up each delimited token in the given vocabulary.

Example:
s = ["hello world", "hello there", "there there world", "watch out world"]
v = load_vocabulary(["hello", "there", "world"])
r = tokenize_with_vocabulary(s,v)
r is now [[0,2], [0,1], [1,1,2], [-1,-1,2]]

Any null row entry results in a corresponding null entry in the output

Exceptions
cudf::logic_errorif delimiter is invalid
Parameters
inputStrings column to tokenize
vocabularyUsed to lookup tokens within input
delimiterUsed to identify tokens within input
default_idThe token id to be used for tokens not found in the vocabulary; Default is -1
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Lists column of token ids

◆ wordpiece_tokenize()

std::unique_ptr<cudf::column> nvtext::wordpiece_tokenize ( cudf::strings_column_view const &  input,
wordpiece_vocabulary const &  vocabulary,
cudf::size_type  max_words_per_row = 0,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns the token ids for the input string a wordpiece tokenizer algorithm with the given vocabulary.

Example:

vocabulary = ["[UNK]", "a", "have", "I", "new", "GP", "##U", "!"]
v = load_wordpiece_vocabulary(vocabulary)
input = ["I have a new GPU now !"]
t = wordpiece_tokenize(i,v)
t is now [[3, 2, 1, 4, 5, 6, 0, 7]]

The max_words_per_row also optionally limits the output by only processing a maximum number of words per row. Here a word is defined as consecutive sequence of characters delimited by space character(s).

Example:

vocabulary = ["[UNK]", "a", "have", "I", "new", "GP", "##U", "!"]
v = load_wordpiece_vocabulary(vocabulary)
input = ["I have a new GPU now !"]
t4 = wordpiece_tokenize(i,v,4)
t4 is now [[3, 2, 1, 4]]
t5 = wordpiece_tokenize(i,v,5)
t5 is now [[3, 2, 1, 4, 5, 6]]

Any null row entry results in a corresponding null entry in the output.

Exceptions
std::invalid_argumentIf max_words_per_row is less than 0.
Parameters
inputStrings column to tokenize
vocabularyUsed to lookup tokens within input
max_words_per_rowMaximum number of words to tokenize for each row. Default 0 tokenizes all words.
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Lists column of token ids