Nvtext Stemmer#

group nvtext_stemmer

Enums

enum class letter_type#

Used for specifying letter type to check.

Values:

enumerator CONSONANT#

Letter is a consonant.

enumerator VOWEL#

Letter is not a consonant.

Functions

std::unique_ptr<cudf::column> is_letter(cudf::strings_column_view const &input, letter_type ltype, cudf::size_type character_index, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns boolean column indicating if character_index of the input strings is a consonant or vowel.

Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt

Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.

Also, the algorithm only works with English words.

Example:
st = ["trouble", "toy", "sygyzy"]
b1 = is_letter(st, VOWEL, 1)
b1 is now [false, true, true]

A negative index value will check the character starting from the end of each string. That is, for character_index < 0 the letter checked for string input[i] is at position input[i].length + index.

Example:
st = ["trouble", "toy", "sygyzy"]
b2 = is_letter(st, CONSONANT, -1) // last letter checked in each string
b2 is now [false, true, false]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters:
  • input – Strings column of words to measure

  • ltype – Specify letter type to check

  • character_index – The character position to check in each string

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New BOOL column

std::unique_ptr<cudf::column> is_letter(cudf::strings_column_view const &input, letter_type ltype, cudf::column_view const &indices, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns boolean column indicating if character at indices[i] of input[i] is a consonant or vowel.

Determining consonants and vowels is described in the following paper: https://tartarus.org/martin/PorterStemmer/def.txt

Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.

Also, the algorithm only works with English words.

Example:
st = ["trouble", "toy", "sygyzy"]
ix = [3, 1, 4]
b1 = is_letter(st, VOWEL, ix)
b1 is now [true, true, false]

A negative index value will check the character starting from the end of each string. That is, for character_index < 0 the letter checked for string strings[i] is at position strings[i].length + indices[i].

Example:
st = ["trouble", "toy", "sygyzy"]
ix = [3, -2, 4] // 2nd to last character in st[1] is checked
b2 = is_letter(st, CONSONANT, ix)
b2 is now [false, false, true]

A null input element at row i produces a corresponding null entry for row i in the output column.

Throws:
Parameters:
  • input – Strings column of words to measure

  • ltype – Specify letter type to check

  • indices – The character positions to check in each string

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

New BOOL column

std::unique_ptr<cudf::column> porter_stemmer_measure(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns the Porter Stemmer measurements of a strings column.

Porter stemming is used to normalize words by removing plural and tense endings from words in English. The stemming measurement involves counting consonant/vowel patterns within a string. Reference paper: https://tartarus.org/martin/PorterStemmer/def.txt

Each string in the input column is expected to contain a single, lower-cased word (or subword) with no punctuation and no whitespace otherwise the measure value for that string is undefined.

Also, the algorithm only works with English words.

Example:
st = ["tr", "troubles", "trouble"]
m = porter_stemmer_measure(st)
m is now [0,2,1]

A null input element at row i produces a corresponding null entry for row i in the output column.

Parameters:
  • input – Strings column of words to measure

  • mr – Device memory resource used to allocate the returned column’s device memory

  • stream – CUDA stream used for device memory operations and kernel launches

Returns:

New INT32 column of measure values