Nvtext Minhash#

group nvtext_minhash

Functions

std::unique_ptr<cudf::column> minhash(cudf::strings_column_view const &input, uint32_t seed, cudf::device_span<uint32_t const> parameter_a, cudf::device_span<uint32_t const> parameter_b, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns the minhash values for each string.

This function uses MurmurHash3_x86_32 for the hash algorithm.

The input strings are first hashed using the given seed over substrings of width characters. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint32
mp = (1 << 61) - 1
hv[i] = hash value of a substring at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each substring and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all substrings in row j
                     and where i=[0,a.size())

Any null row entries result in corresponding null output rows.

Throws:
  • std::invalid_argument – if the width < 2

  • std::invalid_argument – if parameter_a is empty

  • std::invalid_argument – if parameter_b.size() != parameter_a.size()

  • std::overflow_error – if parameter_a.size() * input.size() exceeds the column size limit

Parameters:
  • input – Strings column to compute minhash

  • seed – Seed value used for the hash algorithm

  • parameter_a – Values used for the permuted calculation

  • parameter_b – Values used for the permuted calculation

  • width – The character width of substrings to hash for each row

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

List column of minhash values for each string per seed

std::unique_ptr<cudf::column> minhash_permuted(cudf::strings_column_view const &input, uint32_t seed, cudf::device_span<uint32_t const> parameter_a, cudf::device_span<uint32_t const> parameter_b, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns the minhash values for each string.

This function uses MurmurHash3_x86_32 for the hash algorithm.

The input strings are first hashed using the given seed over substrings of width characters. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint32
mp = (1 << 61) - 1
hv[i] = hash value of a substring at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each substring and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all substrings in row j
                     and where i=[0,a.size())

Any null row entries result in corresponding null output rows.

Deprecated:

Use nvtext::minhash()

Throws:
  • std::invalid_argument – if the width < 2

  • std::invalid_argument – if parameter_a is empty

  • std::invalid_argument – if parameter_b.size() != parameter_a.size()

  • std::overflow_error – if parameter_a.size() * input.size() exceeds the column size limit

Parameters:
  • input – Strings column to compute minhash

  • seed – Seed value used for the hash algorithm

  • parameter_a – Values used for the permuted calculation

  • parameter_b – Values used for the permuted calculation

  • width – The character width of substrings to hash for each row

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

List column of minhash values for each string per seed

std::unique_ptr<cudf::column> minhash64(cudf::strings_column_view const &input, uint64_t seed, cudf::device_span<uint64_t const> parameter_a, cudf::device_span<uint64_t const> parameter_b, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns the minhash values for each string.

This function uses MurmurHash3_x64_128 for the hash algorithm.

The input strings are first hashed using the given seed over substrings of width characters. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint64
mp = (1 << 61) - 1
hv[i] = hash value of a substring at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each substring and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all substrings in row j
                     and where i=[0,a.size())

Any null row entries result in corresponding null output rows.

Throws:
  • std::invalid_argument – if the width < 2

  • std::invalid_argument – if parameter_a is empty

  • std::invalid_argument – if parameter_b.size() != parameter_a.size()

  • std::overflow_error – if parameter_a.size() * input.size() exceeds the column size limit

Parameters:
  • input – Strings column to compute minhash

  • seed – Seed value used for the hash algorithm

  • parameter_a – Values used for the permuted calculation

  • parameter_b – Values used for the permuted calculation

  • width – The character width of substrings to hash for each row

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

List column of minhash values for each string per seed

std::unique_ptr<cudf::column> minhash64_permuted(cudf::strings_column_view const &input, uint64_t seed, cudf::device_span<uint64_t const> parameter_a, cudf::device_span<uint64_t const> parameter_b, cudf::size_type width, rmm::cuda_stream_view stream = cudf::get_default_stream(), rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref())#

Returns the minhash values for each string.

This function uses MurmurHash3_x64_128 for the hash algorithm.

The input strings are first hashed using the given seed over substrings of width characters. These hash values are then combined with the a and b parameter values using the following formula:

max_hash = max of uint64
mp = (1 << 61) - 1
hv[i] = hash value of a substring at i
pv[i] = ((hv[i] * a[i] + b[i]) % mp) & max_hash

This calculation is performed on each substring and the minimum value is computed as follows:

mh[j,i] = min(pv[i]) for all substrings in row j
                     and where i=[0,a.size())

Any null row entries result in corresponding null output rows.

Deprecated:

Use nvtext::minhash64()

Throws:
  • std::invalid_argument – if the width < 2

  • std::invalid_argument – if parameter_a is empty

  • std::invalid_argument – if parameter_b.size() != parameter_a.size()

  • std::overflow_error – if parameter_a.size() * input.size() exceeds the column size limit

Parameters:
  • input – Strings column to compute minhash

  • seed – Seed value used for the hash algorithm

  • parameter_a – Values used for the permuted calculation

  • parameter_b – Values used for the permuted calculation

  • width – The character width of substrings to hash for each row

  • stream – CUDA stream used for device memory operations and kernel launches

  • mr – Device memory resource used to allocate the returned column’s device memory

Returns:

List column of minhash values for each string per seed