Nvtext Jaccard#

group nvtext_jaccard

Functions

std::unique_ptr<cudf::column> jaccard_index(cudf::strings_column_view const &input1, cudf::strings_column_view const &input2, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::mr::device_memory_resource *mr = rmm::mr::get_current_device_resource())#

Computes the Jaccard similarity between individual rows in two strings columns.

The similarity is calculated between strings in corresponding rows such that output[row] = J(input1[row],input2[row]).

The Jaccard index formula is https://en.wikipedia.org/wiki/Jaccard_index

J = |A ∩ B| / |A ∪ B|
where |A ∩ B| is number of common values between A and B
and |x| is the number of unique values in x.

The computation here compares strings columns by treating each string as text (i.e. sentences, paragraphs, articles) instead of individual words or tokens to be compared directly. The algorithm applies a sliding window (size specified by the width parameter) to each string to form the set of tokens to compare within each row of the two input columns.

These substrings are essentially character ngrams and used as part of the union and intersect calculations for that row. For efficiency, the substrings are hashed using the default MurmurHash32 to identify uniqueness within each row. Once the union and intersect sizes for the row are resolved, the Jaccard index is computed using the above formula and returned as a float32 value.

input1 = ["the fuzzy dog", "little piggy", "funny bunny", "chatty parrot"]
input2 = ["the fuzzy cat", "bitty piggy", "funny bunny", "silent partner"]
r = jaccard_index(input1, input2)
r is now [0.5, 0.15384616, 1.0, 0]

If either input column’s row is null, the output for that row will also be null.

Throws:

std::invalid_argument – if the width < 2 or input1.size() != input2.size()

Parameters:
  • input1 – Strings column to compare with input2

  • input2 – Strings column to compare with input1

  • width – The character width used for apply substrings

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

Index calculation values