tokenize#

class pylibcudf.nvtext.tokenize.TokenizeVocabulary#

The Vocabulary object to be used with tokenize_with_vocabulary.

For details, see cudf::nvtext::tokenize_vocabulary.

pylibcudf.nvtext.tokenize.character_tokenize(Column input) Column#

Returns a single column of strings by converting each character to a string.

For details, see cpp:func:cudf::nvtext::character_tokens

Parameters:
inputColumn

Strings column to tokenize

Returns:
Column

New strings columns of tokens

pylibcudf.nvtext.tokenize.count_tokens_column(Column input, Column delimiters) Column#

Returns the number of tokens in each string of a strings column using multiple strings as delimiters.

For details, see cpp:func:cudf::nvtext::count_tokens

Parameters:
inputColumn

Strings column to count tokens

delimitersColumn

Strings column used to separate each string into tokens

Returns:
Column

New column of token counts

pylibcudf.nvtext.tokenize.count_tokens_scalar(Column input, Scalar delimiter=None) Column#

Returns the number of tokens in each string of a strings column using the provided characters as delimiters.

For details, see cpp:func:cudf::nvtext::count_tokens

Parameters:
inputColumn

Strings column to count tokens

delimitersScalar

String scalar used to separate each string into tokens

Returns:
Column

New column of token counts

pylibcudf.nvtext.tokenize.detokenize(Column input, Column row_indices, Scalar separator=None) Column#

Creates a strings column from a strings column of tokens and an associated column of row ids.

For details, see cpp:func:cudf::nvtext::detokenize

Parameters:
inputColumn

Strings column to detokenize

row_indicesColumn

The relative output row index assigned for each token in the input column

separatorScalar

String to append after concatenating each token to the proper output row

Returns:
Column

New strings columns of tokens

pylibcudf.nvtext.tokenize.tokenize_column(Column input, Column delimiters) Column#

Returns a single column of strings by tokenizing the input strings column using multiple strings as delimiters.

For details, see cpp:func:cudf::nvtext::tokenize

Parameters:
inputColumn

Strings column to tokenize

delimitersColumn

Strings column used to separate individual strings into tokens

Returns:
Column

New strings columns of tokens

pylibcudf.nvtext.tokenize.tokenize_scalar(Column input, Scalar delimiter=None) Column#

Returns a single column of strings by tokenizing the input strings column using the provided characters as delimiters.

For details, see cpp:func:cudf::nvtext::tokenize

Parameters:
inputColumn

Strings column to tokenize

delimiterScalar

String scalar used to separate individual strings into tokens

Returns:
Column

New strings columns of tokens

pylibcudf.nvtext.tokenize.tokenize_with_vocabulary(Column input, TokenizeVocabulary vocabulary, Scalar delimiter, size_type default_id=-1) Column#

Returns the token ids for the input string by looking up each delimited token in the given vocabulary.

For details, see cpp:func:cudf::nvtext::tokenize_with_vocabulary

Parameters:
inputColumn

Strings column to tokenize

vocabularyTokenizeVocabulary

Used to lookup tokens within input

delimiterScalar

Used to identify tokens within input

default_idsize_type

The token id to be used for tokens not found in the vocabulary; Default is -1

Returns:
Column

Lists column of token ids