Files | |
file | generate_ngrams.hpp |
file | ngrams_tokenize.hpp |
std::unique_ptr<cudf::column> nvtext::generate_character_ngrams | ( | cudf::strings_column_view const & | input, |
cudf::size_type | ngrams = 2 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Generates ngrams of characters within each string.
Each character of a string is used to build ngrams for the output row. Ngrams are not created across strings.
All null row entries are ignored and the corresponding output row will be empty.
std::invalid_argument | if ngrams < 2 |
cudf::logic_error | if there are not enough characters to generate any ngrams |
input | Strings column to produce ngrams from |
ngrams | The ngram number to generate. Default is 2 = bigram. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::generate_ngrams | ( | cudf::strings_column_view const & | input, |
cudf::size_type | ngrams, | ||
cudf::string_scalar const & | separator, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns a single column of strings by generating ngrams from a strings column.
An ngram is a grouping of 2 or more strings with a separator. For example, generating bigrams groups all adjacent pairs of strings.
The size of the output column will be the total number of ngrams generated from the input strings column.
All null row entries are ignored and the output contains all valid rows.
cudf::logic_error | if ngrams < 2 |
cudf::logic_error | if separator is invalid |
cudf::logic_error | if there are not enough strings to generate any ngrams |
input | Strings column to tokenize and produce ngrams from |
ngrams | The ngram number to generate |
separator | The string to use for separating ngram tokens |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::hash_character_ngrams | ( | cudf::strings_column_view const & | input, |
cudf::size_type | ngrams = 5 , |
||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Hashes ngrams of characters within each string.
Each character of a string used to build the ngrams and ngrams are not produced across adjacent strings rows.
The ngrams for each string are hashed and returned in a list column where the offsets specify rows of hash values for each string.
The size of the child column will be the total number of ngrams generated from the input strings column.
All null row entries are ignored and the output contains all valid rows.
The hash algorithm uses MurmurHash32 on each ngram.
cudf::logic_error | if ngrams < 2 |
cudf::logic_error | if there are not enough characters to generate any ngrams |
input | Strings column to produce ngrams from |
ngrams | The ngram number to generate. Default is 5. |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory. |
std::unique_ptr<cudf::column> nvtext::ngrams_tokenize | ( | cudf::strings_column_view const & | input, |
cudf::size_type | ngrams, | ||
cudf::string_scalar const & | delimiter, | ||
cudf::string_scalar const & | separator, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns a single column of strings by tokenizing the input strings column and then producing ngrams of each string.
An ngram is a grouping of 2 or more tokens with a separator. For example, generating bigrams groups all adjacent pairs of tokens for a string.
The delimiter
is used for tokenizing and may be zero or more characters. If the delimiter
is empty, whitespace (character code-point <= ' ') is used for identifying tokens.
Once tokens are identified, ngrams are produced by joining the tokens with the specified separator. The generated ngrams use the tokens for each string and not across strings in adjacent rows. Any input string that contains fewer tokens than the specified ngrams value is skipped and will not contribute to the output. Therefore, a bigram of a single token is ignored as well as a trigram of 2 or less tokens.
Tokens are found by locating delimiter(s) starting at the beginning of each string. As each string is tokenized, the ngrams are generated using input column row order to build the output column. That is, ngrams created in input row[i] will be placed in the output column directly before ngrams created in input row[i+1].
The size of the output column will be the total number of ngrams generated from the input strings column.
All null row entries are ignored and the output contains all valid rows.
input | Strings column to tokenize and produce ngrams from |
ngrams | The ngram number to generate |
delimiter | UTF-8 characters used to separate each string into tokens. An empty string will separate tokens using whitespace. |
separator | The string to use for separating ngram tokens |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |