Nvtext Minhash#
- group MinHashing
Functions
-
std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const &input, uint32_t seed, cudf::device_span<uint32_t const> parameter_a, cudf::device_span<uint32_t const> parameter_b, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the minhash values for each string.
This function uses MurmurHash3_x86_32 for the hash algorithm.
The input strings are first hashed using the given
seedover substrings ofwidthcharacters. These hash values are then combined with theaandbparameter values using the following formula:max_hash = max of uint32 mp = (1 << 61) - 1 hv[i] = hash value of a substring at i pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
This calculation is performed on each substring and the minimum value is computed as follows:
mh[j,i] = min(pv[i]) for all substrings in row j and where i=[0,a.size())
Any null row entries result in corresponding null output rows.
- Throws:
std::invalid_argument – if the width < 2
std::invalid_argument – if parameter_a is empty
std::invalid_argument – if
parameter_b.size() != parameter_a.size()std::overflow_error – if
parameter_a.size() * input.size()exceeds the column size limit
- Parameters:
input – Strings column to compute minhash
seed – Seed value used for the hash algorithm
parameter_a – Values used for the permuted calculation
parameter_b – Values used for the permuted calculation
width – The character width of substrings to hash for each row
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
List column of minhash values for each string per seed
-
std::unique_ptr<cudf::column> minhash64(cudf::strings_column_view const &input, uint64_t seed, cudf::device_span<uint64_t const> parameter_a, cudf::device_span<uint64_t const> parameter_b, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the minhash values for each string.
This function uses MurmurHash3_x64_128 for the hash algorithm.
The input strings are first hashed using the given
seedover substrings ofwidthcharacters. These hash values are then combined with theaandbparameter values using the following formula:max_hash = max of uint64 mp = (1 << 61) - 1 hv[i] = hash value of a substring at i pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
This calculation is performed on each substring and the minimum value is computed as follows:
mh[j,i] = min(pv[i]) for all substrings in row j and where i=[0,a.size())
Any null row entries result in corresponding null output rows.
- Throws:
std::invalid_argument – if the width < 2
std::invalid_argument – if parameter_a is empty
std::invalid_argument – if
parameter_b.size() != parameter_a.size()std::overflow_error – if
parameter_a.size() * input.size()exceeds the column size limit
- Parameters:
input – Strings column to compute minhash
seed – Seed value used for the hash algorithm
parameter_a – Values used for the permuted calculation
parameter_b – Values used for the permuted calculation
width – The character width of substrings to hash for each row
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
List column of minhash values for each string per seed
-
std::unique_ptr<cudf::column> minhash_ngrams(cudf::lists_column_view const &input, cudf::size_type ngrams, uint32_t seed, cudf::device_span<uint32_t const> parameter_a, cudf::device_span<uint32_t const> parameter_b, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the minhash values for each input row.
This function uses MurmurHash3_x86_32 for the hash algorithm.
The input row is first hashed using the given
seedover a sliding window ofngramsof strings. These hash values are then combined with theaandbparameter values using the following formula:max_hash = max of uint32 mp = (1 << 61) - 1 hv[i] = hash value of a ngrams at i pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
This calculation is performed on each set of ngrams and the minimum value is computed as follows:
mh[j,i] = min(pv[i]) for all ngrams in row j and where i=[0,a.size())
Any null row entries result in corresponding null output rows.
- Throws:
std::invalid_argument – if the ngrams < 2
std::invalid_argument – if parameter_a is empty
std::invalid_argument – if
parameter_b.size() != parameter_a.size()std::overflow_error – if
parameter_a.size() * input.size()exceeds the column size limit
- Parameters:
input – Strings column to compute minhash
ngrams – The number of strings to hash within each row
seed – Seed value used for the hash algorithm
parameter_a – Values used for the permuted calculation
parameter_b – Values used for the permuted calculation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
List column of minhash values for each string per seed
-
std::unique_ptr<cudf::column> minhash64_ngrams(cudf::lists_column_view const &input, cudf::size_type ngrams, uint64_t seed, cudf::device_span<uint64_t const> parameter_a, cudf::device_span<uint64_t const> parameter_b, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#
Returns the minhash values for each input row.
This function uses MurmurHash3_x64_128 for the hash algorithm.
The input row is first hashed using the given
seedover a sliding window ofngramsof strings. These hash values are then combined with theaandbparameter values using the following formula:max_hash = max of uint64 mp = (1 << 61) - 1 hv[i] = hash value of a ngrams at i pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash
This calculation is performed on each set of ngrams and the minimum value is computed as follows:
mh[j,i] = min(pv[i]) for all ngrams in row j and where i=[0,a.size())
Any null row entries result in corresponding null output rows.
- Throws:
std::invalid_argument – if the ngrams < 2
std::invalid_argument – if parameter_a is empty
std::invalid_argument – if
parameter_b.size() != parameter_a.size()std::overflow_error – if
parameter_a.size() * input.size()exceeds the column size limit
- Parameters:
input – List strings column to compute minhash
ngrams – The number of strings to hash within each row
seed – Seed value used for the hash algorithm
parameter_a – Values used for the permuted calculation
parameter_b – Values used for the permuted calculation
stream – CUDA stream used for device memory operations and kernel launches
mr – Device memory resource used to allocate the returned column’s device memory
- Returns:
List column of minhash values for each string per seed
-
std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const &input, uint32_t seed, cudf::device_span<uint32_t const> parameter_a, cudf::device_span<uint32_t const> parameter_b, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#