Files | |
file | stemmer.hpp |
Enumerations | |
enum class | nvtext::letter_type { nvtext::CONSONANT , nvtext::VOWEL } |
Used for specifying letter type to check. More... | |
|
strong |
Used for specifying letter type to check.
Enumerator | |
---|---|
CONSONANT | Letter is a consonant. |
VOWEL | Letter is not a consonant. |
Definition at line 34 of file stemmer.hpp.
std::unique_ptr<cudf::column> nvtext::is_letter | ( | cudf::strings_column_view const & | input, |
letter_type | ltype, | ||
cudf::column_view const & | indices, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns boolean column indicating if character at indices[i]
of input[i]
is a consonant or vowel.
Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
A negative index value will check the character starting from the end of each string. That is, for character_index < 0
the letter checked for string strings[i]
is at position strings[i].length + indices[i]
.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
cudf::logic_error | if indices.size() != input.size() |
cudf::logic_error | if indices contain nulls. |
input | Strings column of words to measure |
ltype | Specify letter type to check |
indices | The character positions to check in each string |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::is_letter | ( | cudf::strings_column_view const & | input, |
letter_type | ltype, | ||
cudf::size_type | character_index, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns boolean column indicating if character_index
of the input strings is a consonant or vowel.
Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
A negative index value will check the character starting from the end of each string. That is, for character_index < 0
the letter checked for string input[i]
is at position input[i].length + index
.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
input | Strings column of words to measure |
ltype | Specify letter type to check |
character_index | The character position to check in each string |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::porter_stemmer_measure | ( | cudf::strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns the Porter Stemmer measurements of a strings column.
Porter stemming is used to normalize words by removing plural and tense endings from words in English. The stemming measurement involves counting consonant/vowel patterns within a string. Reference paper: https://tartarus.org/martin/PorterStemmer/def.txt
Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.
Also, the algorithm only works with English words.
A null input element at row i
produces a corresponding null entry for row i
in the output column.
input | Strings column of words to measure |
mr | Device memory resource used to allocate the returned column's device memory |
stream | CUDA stream used for device memory operations and kernel launches |