Nvtext Stemmer#
- group nvtext_stemmer
Enums
Functions
-
std::unique_ptr<cudf::column> is_letter(cudf::strings_column_view const &input, letter_type ltype, cudf::size_type character_index, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns boolean column indicating if
character_index
of the input strings is a consonant or vowel.Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
Example: st = ["trouble", "toy", "sygyzy"] b1 = is_letter(st, VOWEL, 1) b1 is now [false, true, true]
A negative index value will check the character starting from the end of each string. That is, for
character_index < 0
the letter checked for stringinput[i]
is at positioninput[i].length + index
.Example: st = ["trouble", "toy", "sygyzy"] b2 = is_letter(st, CONSONANT, -1) // last letter checked in each string b2 is now [false, true, false]
A null input element at row
i
produces a corresponding null entry for rowi
in the output column.- Parameters:
input – Strings column of words to measure
ltype – Specify letter type to check
character_index – The character position to check in each string
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New BOOL column
-
std::unique_ptr<cudf::column> is_letter(cudf::strings_column_view const &input, letter_type ltype, cudf::column_view const &indices, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns boolean column indicating if character at
indices[i]
ofinput[i]
is a consonant or vowel.Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
Example: st = ["trouble", "toy", "sygyzy"] ix = [3, 1, 4] b1 = is_letter(st, VOWEL, ix) b1 is now [true, true, false]
A negative index value will check the character starting from the end of each string. That is, for
character_index < 0
the letter checked for stringstrings[i]
is at positionstrings[i].length + indices[i]
.Example: st = ["trouble", "toy", "sygyzy"] ix = [3, -2, 4] // 2nd to last character in st[1] is checked b2 = is_letter(st, CONSONANT, ix) b2 is now [false, false, true]
A null input element at row
i
produces a corresponding null entry for rowi
in the output column.- Throws:
cudf::logic_error – if
indices.size() != input.size()
cudf::logic_error – if
indices
contain nulls.
- Parameters:
input – Strings column of words to measure
ltype – Specify letter type to check
indices – The character positions to check in each string
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New BOOL column
-
std::unique_ptr<cudf::column> porter_stemmer_measure(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Returns the Porter Stemmer measurements of a strings column.
Porter stemming is used to normalize words by removing plural and tense endings from words in English. The stemming measurement involves counting consonant/vowel patterns within a string. Reference paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
Example: st = ["tr", "troubles", "trouble"] m = porter_stemmer_measure(st) m is now [0,2,1]
A null input element at row
i
produces a corresponding null entry for rowi
in the output column.- Parameters:
input – Strings column of words to measure
mr – Device memory resource used to allocate the returned column’s device memory
stream – CUDA stream used for device memory operations and kernel launches
- Returns:
New INT32 column of measure values
-
std::unique_ptr<cudf::column> is_letter(cudf::strings_column_view const &input, letter_type ltype, cudf::size_type character_index, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#