cudf.core.accessors.string.StringMethods.minhash#

StringMethods.minhash(seed: int, a: ColumnLike, b: ColumnLike, width: int) Series | Index[source]#

Compute the minhash of a strings column or a list strings column of terms.

This uses the MurmurHash3_x86_32 algorithm for the hash function if a or b are of type np.uint32 or MurmurHash3_x86_128 if a and b are of type np.uint64.

Calculation uses the formula (hv * a + b) % mersenne_prime where hv is the hash of a substring of width characters or ngrams of strings if a list column, a and b are provided values and mersenne_prime is 2^61-1.

Parameters:
seedint

The seed used for the hash algorithm.

aColumnLike

Values for minhash calculation. Must be of type uint32 or uint64.

bColumnLike

Values for minhash calculation. Must be of type uint32 or uint64.

widthint

The width of the substring to hash. Or the ngram number of strings to hash.

Examples

>>> import cudf
>>> import numpy as np
>>> s = cudf.Series(['this is my', 'favorite book'])
>>> a = cudf.Series([1, 2, 3], dtype=np.uint32)
>>> b = cudf.Series([4, 5, 6], dtype=np.uint32)
>>> s.str.minhash(0, a=a, b=b, width=5)
0    [1305480171, 462824409, 74608232]
1       [32665388, 65330773, 97996158]
dtype: list
>>> sl = cudf.Series([['this', 'is', 'my'], ['favorite', 'book']])
>>> sl.str.minhash(width=2, seed=0, a=a, b=b)
0      [416367551, 832735099, 1249102647]
1    [1906668704, 3813337405, 1425038810]
dtype: list