Files
file	generate_ngrams.hpp

file	ngrams_tokenize.hpp

Functions
std::unique_ptr< cudf::column >	nvtext::generate_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &separator, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Returns a single column of strings by generating ngrams from a strings column. More...

std::unique_ptr< cudf::column >	nvtext::generate_character_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams=2, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Generates ngrams of characters within each string. More...

std::unique_ptr< cudf::column >	nvtext::hash_character_ngrams (cudf::strings_column_view const &input, cudf::size_type ngrams=5, uint32_t seed=0, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Hashes ngrams of characters within each string. More...

std::unique_ptr< cudf::column >	nvtext::ngrams_tokenize (cudf::strings_column_view const &input, cudf::size_type ngrams, cudf::string_scalar const &delimiter, cudf::string_scalar const &separator, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
	Returns a single column of strings by tokenizing the input strings column and then producing ngrams of each string. More...

Detailed Description

Function Documentation

◆ generate_character_ngrams()

std::unique_ptr<cudf::column> nvtext::generate_character_ngrams	(	cudf::strings_column_view const &	input,
		cudf::size_type	ngrams = `2`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Generates ngrams of characters within each string.

Each character of a string is used to build ngrams for the output row. Ngrams are not created across strings.

["ab", "cde", "fgh"] would generate bigrams as

[["ab"], ["cd", "de"], ["fg", "gh"]]

All null row entries are ignored and the corresponding output row will be empty.

Exceptions

std::invalid_argument	if `ngrams < 2`
cudf::logic_error	if there are not enough characters to generate any ngrams

Parameters

input	Strings column to produce ngrams from
ngrams	The ngram number to generate. Default is 2 = bigram.
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource used to allocate the returned column's device memory

Returns: Lists column of strings

◆ generate_ngrams()

std::unique_ptr<cudf::column> nvtext::generate_ngrams	(	cudf::strings_column_view const &	input,
		cudf::size_type	ngrams,
		cudf::string_scalar const &	separator,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Returns a single column of strings by generating ngrams from a strings column.

An ngram is a grouping of 2 or more strings with a separator. For example, generating bigrams groups all adjacent pairs of strings.

["a", "bb", "ccc"] would generate bigrams as ["a_bb", "bb_ccc"]

and trigrams as ["a_bb_ccc"]

The size of the output column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

Exceptions

cudf::logic_error	if `ngrams < 2`
cudf::logic_error	if `separator` is invalid
cudf::logic_error	if there are not enough strings to generate any ngrams

Parameters

input	Strings column to tokenize and produce ngrams from
ngrams	The ngram number to generate
separator	The string to use for separating ngram tokens
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource used to allocate the returned column's device memory

Returns: New strings columns of tokens

◆ hash_character_ngrams()

std::unique_ptr<cudf::column> nvtext::hash_character_ngrams	(	cudf::strings_column_view const &	input,
		cudf::size_type	ngrams = `5`,
		uint32_t	seed = `0`,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Hashes ngrams of characters within each string.

Each character of a string used to build the ngrams and ngrams are not produced across adjacent strings rows.

"abcdefg" would generate ngrams=5 as ["abcde", "bcdef" "cdefg"]

The ngrams for each string are hashed and returned in a list column where the offsets specify rows of hash values for each string.

The size of the child column will be the total number of ngrams generated from the input strings column.

All null row entries are ignored and the output contains all valid rows.

The hash algorithm uses MurmurHash32 on each ngram.

Exceptions

cudf::logic_error	if `ngrams < 2`
cudf::logic_error	if there are not enough characters to generate any ngrams

Parameters

input	Strings column to produce ngrams from
ngrams	The ngram number to generate. Default is 5.
seed	The seed value to use with the hash algorithm. Default is 0.
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource used to allocate the returned column's device memory.

Returns: A lists column of hash values

◆ ngrams_tokenize()

std::unique_ptr<cudf::column> nvtext::ngrams_tokenize	(	cudf::strings_column_view const &	input,
		cudf::size_type	ngrams,
		cudf::string_scalar const &	delimiter,
		cudf::string_scalar const &	separator,
		rmm::cuda_stream_view	stream = `cudf::get_default_stream()`,
		rmm::device_async_resource_ref	mr = `cudf::get_current_device_resource_ref()`
	)

Returns a single column of strings by tokenizing the input strings column and then producing ngrams of each string.

An ngram is a grouping of 2 or more tokens with a separator. For example, generating bigrams groups all adjacent pairs of tokens for a string.

["a bb ccc"] can be tokenized to ["a", "bb", "ccc"]

bigrams would generate ["a_bb", "bb_ccc"] and trigrams would generate ["a_bb_ccc"]

The delimiter is used for tokenizing and may be zero or more characters. If the delimiter is empty, whitespace (character code-point <= ' ') is used for identifying tokens.

Once tokens are identified, ngrams are produced by joining the tokens with the specified separator. The generated ngrams use the tokens for each string and not across strings in adjacent rows. Any input string that contains fewer tokens than the specified ngrams value is skipped and will not contribute to the output. Therefore, a bigram of a single token is ignored as well as a trigram of 2 or less tokens.

Tokens are found by locating delimiter(s) starting at the beginning of each string. As each string is tokenized, the ngrams are generated using input column row order to build the output column. That is, ngrams created in input row[i] will be placed in the output column directly before ngrams created in input row[i+1].

The size of the output column will be the total number of ngrams generated from the input strings column.

Example:
s = ["a b c", "d e", "f g h i", "j"]
t = ngrams_tokenize(s, 2, " ", "_")
t is now ["a_b", "b_c", "d_e", "f_g", "g_h", "h_i"]

All null row entries are ignored and the output contains all valid rows.

Parameters

input	Strings column to tokenize and produce ngrams from
ngrams	The ngram number to generate
delimiter	UTF-8 characters used to separate each string into tokens. An empty string will separate tokens using whitespace.
separator	The string to use for separating ngram tokens
stream	CUDA stream used for device memory operations and kernel launches
mr	Device memory resource used to allocate the returned column's device memory

Returns: New strings columns of tokens

Files

Functions

Detailed Description

Function Documentation

◆ generate_character_ngrams()

◆ generate_ngrams()

◆ hash_character_ngrams()

◆ ngrams_tokenize()