tokenize#

class pylibcudf.nvtext.tokenize.TokenizeVocabulary#

The Vocabulary object to be used with tokenize_with_vocabulary.

For details, see cudf::nvtext::tokenize_vocabulary.

pylibcudf.nvtext.tokenize.character_tokenize(Column input) → Column#

Returns a single column of strings by converting each character to a string.

For details, see cpp:func:cudf::nvtext::character_tokens

Parameters:

inputColumn: Strings column to tokenize

Returns:

Column: New strings columns of tokens

pylibcudf.nvtext.tokenize.count_tokens_column(Column input, Column delimiters) → Column#

Returns the number of tokens in each string of a strings column using multiple strings as delimiters.

For details, see cpp:func:cudf::nvtext::count_tokens

Parameters:

inputColumn: Strings column to count tokens
delimitersColumn: Strings column used to separate each string into tokens

Returns:

Column: New column of token counts

pylibcudf.nvtext.tokenize.count_tokens_scalar(Column input, Scalar delimiter=None) → Column#

Returns the number of tokens in each string of a strings column using the provided characters as delimiters.

For details, see cpp:func:cudf::nvtext::count_tokens

Parameters:

inputColumn: Strings column to count tokens
delimitersScalar: String scalar used to separate each string into tokens

Returns:

Column: New column of token counts

pylibcudf.nvtext.tokenize.detokenize(Column input, Column row_indices, Scalar separator=None) → Column#

Creates a strings column from a strings column of tokens and an associated column of row ids.

For details, see cpp:func:cudf::nvtext::detokenize

Parameters:

inputColumn: Strings column to detokenize
row_indicesColumn: The relative output row index assigned for each token in the input column
separatorScalar: String to append after concatenating each token to the proper output row

Returns:

Column: New strings columns of tokens

pylibcudf.nvtext.tokenize.tokenize_column(Column input, Column delimiters) → Column#

Returns a single column of strings by tokenizing the input strings column using multiple strings as delimiters.

For details, see cpp:func:cudf::nvtext::tokenize

Parameters:

inputColumn: Strings column to tokenize
delimitersColumn: Strings column used to separate individual strings into tokens

Returns:

Column: New strings columns of tokens

pylibcudf.nvtext.tokenize.tokenize_scalar(Column input, Scalar delimiter=None) → Column#

Returns a single column of strings by tokenizing the input strings column using the provided characters as delimiters.

For details, see cpp:func:cudf::nvtext::tokenize

Parameters:

inputColumn: Strings column to tokenize
delimiterScalar: String scalar used to separate individual strings into tokens

Returns:

Column: New strings columns of tokens

pylibcudf.nvtext.tokenize.tokenize_with_vocabulary(Column input, TokenizeVocabulary vocabulary, Scalar delimiter, size_type default_id=-1) → Column#

Returns the token ids for the input string by looking up each delimited token in the given vocabulary.

For details, see cpp:func:cudf::nvtext::tokenize_with_vocabulary

Parameters:

inputColumn: Strings column to tokenize
vocabularyTokenizeVocabulary: Used to lookup tokens within input
delimiterScalar: Used to identify tokens within input
default_idsize_type: The token id to be used for tokens not found in the vocabulary; Default is -1

Returns:

Column: Lists column of token ids

tokenize#

This Page