Files | |
file | edit_distance.hpp |
Functions | |
std::unique_ptr< cudf::column > | nvtext::edit_distance (cudf::strings_column_view const &input, cudf::strings_column_view const &targets, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) |
Compute the edit distance between individual strings in two strings columns. More... | |
std::unique_ptr< cudf::column > | nvtext::edit_distance_matrix (cudf::strings_column_view const &input, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref()) |
Compute the edit distance between all the strings in the input column. More... | |
std::unique_ptr<cudf::column> nvtext::edit_distance | ( | cudf::strings_column_view const & | input, |
cudf::strings_column_view const & | targets, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Compute the edit distance between individual strings in two strings columns.
The output[i]
is the edit distance between input[i]
and targets[i]
. This edit distance calculation uses the Levenshtein algorithm as documented here: https://www.cuelogic.com/blog/the-levenshtein-algorithm
Any null entries for either input
or targets
is ignored and the edit distance is computed as though the null entry is an empty string.
The targets.size()
must equal input.size()
unless targets.size()==1
. In this case, all input
will be computed against the single targets[0]
string.
cudf::logic_error | if targets.size() != input.size() and if targets.size() != 1 |
input | Strings column of input strings |
targets | Strings to compute edit distance against input |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::edit_distance_matrix | ( | cudf::strings_column_view const & | input, |
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Compute the edit distance between all the strings in the input column.
This uses the Levenshtein algorithm to calculate the edit distance between two strings as documented here: https://www.cuelogic.com/blog/the-levenshtein-algorithm
The output is essentially a input.size() x input.size()
square matrix of integers. All values at diagonal row == col
are 0 since the edit distance between two identical strings is zero. All values above the diagonal are reflected below since the edit distance calculation is also commutative.
Null entries for input
are ignored and the edit distance is computed as though the null entry is an empty string.
The output is a lists column of size input.size()
and where each list item is input.size()
elements.
cudf::logic_error | if strings.size() == 1 |
input | Strings column of input strings |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |