Nvtext Edit Distance#
- group nvtext_edit_distance
Functions
-
std::unique_ptr<cudf::column> edit_distance(cudf::strings_column_view const &input, cudf::strings_column_view const &targets, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Compute the edit distance between individual strings in two strings columns.
The
output[i]
is the edit distance betweeninput[i]
andtargets[i]
. This edit distance calculation uses the Levenshtein algorithm as documented here: https://www.cuelogic.com/blog/the-levenshtein-algorithmExample: s = ["hello", "", "world"] t = ["hallo", "goodbye", "world"] d = edit_distance(s, t) d is now [1, 7, 0]
Any null entries for either
input
ortargets
is ignored and the edit distance is computed as though the null entry is an empty string.The
targets.size()
must equalinput.size()
unlesstargets.size()==1
. In this case, allinput
will be computed against the singletargets[0]
string.- Throws:
cudf::logic_error – if
targets.size() != input.size()
and iftargets.size() != 1
- Parameters:
input – Strings column of input strings
targets – Strings to compute edit distance against
input
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New strings columns of with replaced strings
-
std::unique_ptr<cudf::column> edit_distance_matrix(cudf::strings_column_view const &input, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#
Compute the edit distance between all the strings in the input column.
This uses the Levenshtein algorithm to calculate the edit distance between two strings as documented here: https://www.cuelogic.com/blog/the-levenshtein-algorithm
The output is essentially a
input.size() x input.size()
square matrix of integers. All values at diagonalrow == col
are 0 since the edit distance between two identical strings is zero. All values above the diagonal are reflected below since the edit distance calculation is also commutative.Example: s = ["hello", "hallo", "hella"] d = edit_distance_matrix(s) d is now [[0, 1, 1], [1, 0, 2] [1, 2, 0]]
Null entries for
input
are ignored and the edit distance is computed as though the null entry is an empty string.The output is a lists column of size
input.size()
and where each list item isinput.size()
elements.- Throws:
cudf::logic_error – if
strings.size() == 1
- Parameters:
input – Strings column of input strings
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
New lists column of edit distance values
-
std::unique_ptr<cudf::column> edit_distance(cudf::strings_column_view const &input, cudf::strings_column_view const &targets, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = rmm::mr::get_current_device_resource())#