Nvtext Ngrams#

group nvtext_ngrams

Functions

std::unique_ptr<cudf::column> generate_ngrams(cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &separator, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource())#

Returns a single column of strings by generating ngrams from a strings column.

An ngram is a grouping of 2 or more strings with a separator. For example, generating bigrams groups all adjacent pairs of strings.

["a", "bb", "ccc"] would generate bigrams as ["a_bb", "bb_ccc"]
and trigrams as ["a_bb_ccc"]

The size of the output column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

Throws:
Parameters:
  • input – Strings column to tokenize and produce ngrams from

  • ngrams – The ngram number to generate

  • separator – The string to use for separating ngram tokens

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings columns of tokens

std::unique_ptr<cudf::column> generate_character_ngrams(cudf::strings_column_view const &input, cudf::size_type ngrams = 2, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource())#

Generates ngrams of characters within each string.

Each character of a string used to build ngrams. Ngrams are not created across strings.

["ab", "cde", "fgh"] would generate bigrams as ["ab", "cd", "de", "fg", "gh"]

The size of the output column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

Throws:
Parameters:
  • input – Strings column to produce ngrams from

  • ngrams – The ngram number to generate. Default is 2 = bigram.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings columns of tokens

std::unique_ptr<cudf::column> hash_character_ngrams(cudf::strings_column_view const &input, cudf::size_type ngrams = 5, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource())#

Hashes ngrams of characters within each string.

Each character of a string used to build the ngrams and ngrams are not produced across adjacent strings rows.

"abcdefg" would generate ngrams=5 as ["abcde", "bcdef" "cdefg"]

The ngrams for each string are hashed and returned in a list column where the offsets specify rows of hash values for each string.

The size of the child column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

The hash algorithm uses MurmurHash32 on each ngram.

Throws:
Parameters:
  • input – Strings column to produce ngrams from

  • ngrams – The ngram number to generate. Default is 5.

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory.

Returns:

A lists column of hash values

std::unique_ptr<cudf::column> ngrams_tokenize(cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &delimiter, cudf::string_scalar const &separator, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource())#

Returns a single column of strings by tokenizing the input strings column and then producing ngrams of each string.

An ngram is a grouping of 2 or more tokens with a separator. For example, generating bigrams groups all adjacent pairs of tokens for a string.

["a bb ccc"] can be tokenized to ["a", "bb", "ccc"]
bigrams would generate ["a_bb", "bb_ccc"] and trigrams would generate ["a_bb_ccc"]

The delimiter is used for tokenizing and may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ‘ ‘) is used for identifying tokens.

Once tokens are identified, ngrams are produced by joining the tokens with the specified separator. The generated ngrams use the tokens for each string and not across strings in adjacent rows. Any input string that contains fewer tokens than the specified ngrams value is skipped and will not contribute to the output. Therefore, a bigram of a single token is ignored as well as a trigram of 2 or less tokens.

Tokens are found by locating delimiter(s) starting at the beginning of each string. As each string is tokenized, the ngrams are generated using input column row order to build the output column. That is, ngrams created in input row[i] will be placed in the output column directly before ngrams created in input row[i+1].

The size of the output column will be the total number of ngrams generated from the input strings column.

Example:
s = ["a b c", "d e", "f g h i", "j"]
t = ngrams_tokenize(s, 2, " ", "_")
t is now ["a_b", "b_c", "d_e", "f_g", "g_h", "h_i"]

All null row entries are ignored and the output contains all valid rows.

Parameters:
  • input – Strings column to tokenize and produce ngrams from

  • ngrams – The ngram number to generate

  • delimiter – UTF-8 characters used to separate each string into tokens. An empty string will separate tokens using whitespace.

  • separator – The string to use for separating ngram tokens

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New strings columns of tokens