tokenize#
- class pylibcudf.nvtext.tokenize.TokenizeVocabulary#
The Vocabulary object to be used with
tokenize_with_vocabulary
.For details, see
cudf::nvtext::tokenize_vocabulary
.
- pylibcudf.nvtext.tokenize.character_tokenize(Column input) Column #
Returns a single column of strings by converting each character to a string.
For details, see cpp:func:cudf::nvtext::character_tokens
- Parameters:
- inputColumn
Strings column to tokenize
- Returns:
- Column
New strings columns of tokens
- pylibcudf.nvtext.tokenize.count_tokens_column(Column input, Column delimiters) Column #
Returns the number of tokens in each string of a strings column using multiple strings as delimiters.
For details, see cpp:func:cudf::nvtext::count_tokens
- Parameters:
- inputColumn
Strings column to count tokens
- delimitersColumn
Strings column used to separate each string into tokens
- Returns:
- Column
New column of token counts
- pylibcudf.nvtext.tokenize.count_tokens_scalar(Column input, Scalar delimiter=None) Column #
Returns the number of tokens in each string of a strings column using the provided characters as delimiters.
For details, see cpp:func:cudf::nvtext::count_tokens
- Parameters:
- inputColumn
Strings column to count tokens
- delimitersScalar
String scalar used to separate each string into tokens
- Returns:
- Column
New column of token counts
- pylibcudf.nvtext.tokenize.detokenize(Column input, Column row_indices, Scalar separator=None) Column #
Creates a strings column from a strings column of tokens and an associated column of row ids.
For details, see cpp:func:cudf::nvtext::detokenize
- Parameters:
- inputColumn
Strings column to detokenize
- row_indicesColumn
The relative output row index assigned for each token in the input column
- separatorScalar
String to append after concatenating each token to the proper output row
- Returns:
- Column
New strings columns of tokens
- pylibcudf.nvtext.tokenize.tokenize_column(Column input, Column delimiters) Column #
Returns a single column of strings by tokenizing the input strings column using multiple strings as delimiters.
For details, see cpp:func:cudf::nvtext::tokenize
- Parameters:
- inputColumn
Strings column to tokenize
- delimitersColumn
Strings column used to separate individual strings into tokens
- Returns:
- Column
New strings columns of tokens
- pylibcudf.nvtext.tokenize.tokenize_scalar(Column input, Scalar delimiter=None) Column #
Returns a single column of strings by tokenizing the input strings column using the provided characters as delimiters.
For details, see cpp:func:cudf::nvtext::tokenize
- Parameters:
- inputColumn
Strings column to tokenize
- delimiterScalar
String scalar used to separate individual strings into tokens
- Returns:
- Column
New strings columns of tokens
- pylibcudf.nvtext.tokenize.tokenize_with_vocabulary(Column input, TokenizeVocabulary vocabulary, Scalar delimiter, size_type default_id=-1) Column #
Returns the token ids for the input string by looking up each delimited token in the given vocabulary.
For details, see cpp:func:cudf::nvtext::tokenize_with_vocabulary
- Parameters:
- inputColumn
Strings column to tokenize
- vocabularyTokenizeVocabulary
Used to lookup tokens within
input
- delimiterScalar
Used to identify tokens within
input
- default_idsize_type
The token id to be used for tokens not found in the vocabulary; Default is -1
- Returns:
- Column
Lists column of token ids