Files | Functions
Deduplication

Files

file  deduplicate.hpp
 

Functions

std::unique_ptr< rmm::device_uvector< cudf::size_type > > nvtext::build_suffix_array (cudf::strings_column_view const &input, cudf::size_type min_width, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Builds a suffix array for the input strings column. More...
 
std::unique_ptr< cudf::columnnvtext::resolve_duplicates (cudf::strings_column_view const &input, cudf::device_span< cudf::size_type const > indices, cudf::size_type min_width, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns duplicate strings found in the given input. More...
 
std::unique_ptr< cudf::columnnvtext::resolve_duplicates_pair (cudf::strings_column_view const &input1, cudf::device_span< cudf::size_type const > indices1, cudf::strings_column_view const &input2, cudf::device_span< cudf::size_type const > indices2, cudf::size_type min_width, rmm::cuda_stream_view stream=cudf::get_default_stream(), rmm::device_async_resource_ref mr=cudf::get_current_device_resource_ref())
 Returns duplicate strings found from input1 found in the given input2. More...
 

Detailed Description

Function Documentation

◆ build_suffix_array()

Builds a suffix array for the input strings column.

The internal implementation creates a suffix array of the input which requires ~4x the input size for temporary memory. The output is an additional 4x of the input size.

Exceptions
std::invalid_argumentIf min_width is greater than the input chars size
std::invalid_argumentIf the input chars size is greater than 2GB
Parameters
inputStrings column to build suffix array for
min_widthMinimum number of bytes that must match to identify a duplicate
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
Sorted suffix array and corresponding sizes

◆ resolve_duplicates()

std::unique_ptr<cudf::column> nvtext::resolve_duplicates ( cudf::strings_column_view const &  input,
cudf::device_span< cudf::size_type const >  indices,
cudf::size_type  min_width,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns duplicate strings found in the given input.

The output includes any strings of at least min_width bytes that appear more than once in the entire input.

The result is undefined if the indices were not created on the same input provided here.

Exceptions
Ifmin_width <= 8
Ifmin_width is greater than the input chars size
Ifthe input chars size is greater than 2GB
Parameters
inputStrings column for indices
indicesSuffix array from nvtext::build_suffix_array
min_widthMinimum number of bytes that must match to identify a duplicate
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column with updated strings

◆ resolve_duplicates_pair()

std::unique_ptr<cudf::column> nvtext::resolve_duplicates_pair ( cudf::strings_column_view const &  input1,
cudf::device_span< cudf::size_type const >  indices1,
cudf::strings_column_view const &  input2,
cudf::device_span< cudf::size_type const >  indices2,
cudf::size_type  min_width,
rmm::cuda_stream_view  stream = cudf::get_default_stream(),
rmm::device_async_resource_ref  mr = cudf::get_current_device_resource_ref() 
)

Returns duplicate strings found from input1 found in the given input2.

The output includes any strings of at least min_width bytes that appear more than once between input1 and input2.

The result is undefined if the indices1 were not created on the input1 and indices2 were not created on input2.

Exceptions
Ifmin_width <= 8
Ifmin_width is greater than the input chars size
Ifthe input chars size is greater than 2GB
Parameters
input1Strings column for indices1
indices1Suffix array from nvtext::build_suffix_array for input1
input2Strings column for indices2
indices2Suffix array from nvtext::build_suffix_array for input2
min_widthMinimum number of bytes that must match to identify a duplicate
streamCUDA stream used for device memory operations and kernel launches
mrDevice memory resource used to allocate the returned column's device memory
Returns
New strings column with updated strings