Files | |
file | deduplicate.hpp |
std::unique_ptr<rmm::device_uvector<cudf::size_type> > nvtext::build_suffix_array | ( | cudf::strings_column_view const & | input, |
cudf::size_type | min_width, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Builds a suffix array for the input strings column.
The internal implementation creates a suffix array of the input which requires ~4x the input size for temporary memory. The output is an additional 4x of the input size.
std::invalid_argument | If min_width is greater than the input chars size |
std::invalid_argument | If the input chars size is greater than 2GB |
input | Strings column to build suffix array for |
min_width | Minimum number of bytes that must match to identify a duplicate |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::resolve_duplicates | ( | cudf::strings_column_view const & | input, |
cudf::device_span< cudf::size_type const > | indices, | ||
cudf::size_type | min_width, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns duplicate strings found in the given input.
The output includes any strings of at least min_width
bytes that appear more than once in the entire input.
The result is undefined if the indices were not created on the same input provided here.
If | min_width <= 8 |
If | min_width is greater than the input chars size |
If | the input chars size is greater than 2GB |
input | Strings column for indices |
indices | Suffix array from nvtext::build_suffix_array |
min_width | Minimum number of bytes that must match to identify a duplicate |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |
std::unique_ptr<cudf::column> nvtext::resolve_duplicates_pair | ( | cudf::strings_column_view const & | input1, |
cudf::device_span< cudf::size_type const > | indices1, | ||
cudf::strings_column_view const & | input2, | ||
cudf::device_span< cudf::size_type const > | indices2, | ||
cudf::size_type | min_width, | ||
rmm::cuda_stream_view | stream = cudf::get_default_stream() , |
||
rmm::device_async_resource_ref | mr = cudf::get_current_device_resource_ref() |
||
) |
Returns duplicate strings found from input1 found in the given input2.
The output includes any strings of at least min_width
bytes that appear more than once between input1 and input2.
The result is undefined if the indices1 were not created on the input1 and indices2 were not created on input2.
If | min_width <= 8 |
If | min_width is greater than the input chars size |
If | the input chars size is greater than 2GB |
input1 | Strings column for indices1 |
indices1 | Suffix array from nvtext::build_suffix_array for input1 |
input2 | Strings column for indices2 |
indices2 | Suffix array from nvtext::build_suffix_array for input2 |
min_width | Minimum number of bytes that must match to identify a duplicate |
stream | CUDA stream used for device memory operations and kernel launches |
mr | Device memory resource used to allocate the returned column's device memory |