cudf.core.accessors.string.StringMethods.minhash#
- StringMethods.minhash(seed: int, a: ColumnLike, b: ColumnLike, width: int) Series | Index [source]#
Compute the minhash of a strings column or a list strings column of terms.
This uses the MurmurHash3_x86_32 algorithm for the hash function if a or b are of type np.uint32 or MurmurHash3_x86_128 if a and b are of type np.uint64.
Calculation uses the formula (hv * a + b) % mersenne_prime where hv is the hash of a substring of width characters or ngrams of strings if a list column, a and b are provided values and mersenne_prime is 2^61-1.
- Parameters:
- seedint
The seed used for the hash algorithm.
- aColumnLike
Values for minhash calculation. Must be of type uint32 or uint64.
- bColumnLike
Values for minhash calculation. Must be of type uint32 or uint64.
- widthint
The width of the substring to hash. Or the ngram number of strings to hash.
Examples
>>> import cudf >>> import numpy as np >>> s = cudf.Series(['this is my', 'favorite book']) >>> a = cudf.Series([1, 2, 3], dtype=np.uint32) >>> b = cudf.Series([4, 5, 6], dtype=np.uint32) >>> s.str.minhash(0, a=a, b=b, width=5) 0 [1305480171, 462824409, 74608232] 1 [32665388, 65330773, 97996158] dtype: list >>> sl = cudf.Series([['this', 'is', 'my'], ['favorite', 'book']]) >>> sl.str.minhash(width=2, seed=0, a=a, b=b) 0 [416367551, 832735099, 1249102647] 1 [1906668704, 3813337405, 1425038810] dtype: list